How does a C compiler know how to store the addresses of a 2D array?

2016-08-11

c gcc stackexchange stackoverflow

An answer to this question on Stack Overflow.

Question

Consider your typical gcc compiler (C99 mode) and consider an array:

char array[2][4];

clearly the compiler would compile (I assume it's in the translation process?) the code to make the target machine (let's suppose it's an X86) work in the way that it "settles", or shall we say, "assigns" both addresses of array[0][0] and array[1][0] before accessing either one of them (I could be completely wrong). My problem is how does the compiler "know" this, since it is just a dumb program? Is it a simple recursion algorithm of some sort done amazingly right so we don't really have to care how many dimensions there ever will be (as in "oh, there's a bracket pair following the name "array"? I'll just translate it into an address, wait, there are 2? Address of an address then) or those people who designed the compiler specifically studied the situation and coded the compiler to tackle it?

If you were confused by my question, consider an one dimensional array arr[2]. I can get arr involved in all sorts of calculation knowing it's just an address, a "beginning address" so to speak. But for a 1D array, you only need one "beginning address" which is easily accomplished during compilation since the compiler will just translate that name (in this case, arr) into an uninitialized address (again, I could be completely wrong), but for a 2D array, the compiler need to deal with more than one address, how does it work?

And what would the assembly code look like?

Answer

A 2D array such as

int arr[3][2] = {{0, 1}, {2, 3}, {4, 5}};

is laid out in memory as:

0 1 2 3 4 5

because C is a row-major language.

This is the same layout as:

int arrflat[6] = { 0, 1, 2, 3, 4, 5 };

And you can access and manipulate both arrays using just their address (arr and arrflat, respectively).

However, when you access elements via arr[y][x] or arrflat[i] a translation occurs.

arrflat[i] becomes

arrflat+i

whereas arr[y][x] becomes

arr+(y*width+x)

and, in fact, you can do pointer mathematics on arr in this way.

A simple test program is:

#include <stdio.h>
int main(){
  int arr[3][2]  = {{0, 1}, {2, 3}, {4, 5}};
  int arrflat[6] = { 0, 1, 2, 3, 4, 5 };
  for(int y=0;y<3;y++)
  for(int x=0;x<2;x++)
    printf("%d\n",arr[y][x]);
  for(int i=0;i<6;i++)
    printf("%d\n",arrflat[i]);
}

Compile this to generate assembly with

gcc -g -Wa,-adhls  test.c

The (abbreviated) output is:

9:test.c        ****     printf("%d\n",arr[y][x]);
49                  .loc 1 9 0 discriminator 3
50 007d 8B45B8      movl  -72(%rbp), %eax
51 0080 4898        cltq
52 0082 8B55B4      movl  -76(%rbp), %edx
53 0085 4863D2      movslq  %edx, %rdx
54 0088 4801D2      addq  %rdx, %rdx
55 008b 4801D0      addq  %rdx, %rax
56 008e 8B4485C0    movl  -64(%rbp,%rax,4), %eax
57 0092 89C6        movl  %eax, %esi
58 0094 BF000000    movl  $.LC0, %edi
58      00
59 0099 B8000000    movl  $0, %eax
59      00
60 009e E8000000    call  printf
12:test.c        ****     printf("%d\n",arrflat[i]);
80                  .loc 1 12 0 discriminator 3
81 00c0 8B45BC      movl  -68(%rbp), %eax
82 00c3 4898        cltq
83 00c5 8B4485E0    movl  -32(%rbp,%rax,4), %eax
84 00c9 89C6        movl  %eax, %esi
85 00cb BF000000    movl  $.LC0, %edi
85      00
86 00d0 B8000000    movl  $0, %eax
86      00
87 00d5 E8000000    call  printf

Eliminating common code between the two calls to printf and annotating gives:

9:test.c        ****     printf("%d\n",arr[y][x]);
49                  .loc 1 9 0 discriminator 3
50 007d 8B45B8      movl  -72(%rbp), %eax        #Load address -72 bytes from the memory pointed to by %rbp
51 0080 4898        cltq                         #Turn this into a 64-bit integer address (where is `arr`?)
52 0082 8B55B4      movl  -76(%rbp), %edx        #Load address -76 bytes from the memory pointed to by %rbp
53 0085 4863D2      movslq  %edx, %rdx           #Turn %edx into a signed 64-bit offset
54 0088 4801D2      addq  %rdx, %rdx             #Add rdx to itself
55 008b 4801D0      addq  %rdx, %rax             #Add offset to the address
56 008e 8B4485C0    movl  -64(%rbp,%rax,4), %eax #Load *(rbp - 4 + (rax * 4)) into eax (get arr[y][x])
12:test.c        ****     printf("%d\n",arrflat[i]);
80                  .loc 1 12 0 discriminator 3
81 00c0 8B45BC      movl  -68(%rbp), %eax        #Load address -62 bytes from the memory pointed to by %rbp
82 00c3 4898        cltq                         #Convert this into a 64-bit integer address (where is `arrflat`?)
83 00c5 8B4485E0    movl  -32(%rbp,%rax,4), %eax #Load *(rbp - 4 + (rax * 4)) into eax (get arrflat[i])