c - 数组语法与指针语法和代码生成？

Question

在Richard Reese的 《理解和使用 C 指针》一书中，它在第 85 页上说，

int vector[5] = {1, 2, 3, 4, 5};
生成的代码与生成的vector[i]代码不同*(vector+i)。该符号vector[i]生成从位置向量开始的机器代码，从该位置移动 i位置，并使用其内容。该符号*(vector+i)生成从 location 开始的机器代码vector，添加 i到地址，然后使用该地址的内容。虽然结果相同，但生成的机器代码不同。这种差异对大多数程序员来说并不重要。

你可以在这里看到摘录。这段话是什么意思？在什么情况下，任何编译器都会为这两者生成不同的代码？从基地“移动”和“添加”到基地有区别吗？我无法让它在 GCC 上工作——生成不同的机器代码。

score 97 · Accepted Answer

报价是错误的。相当悲惨的是，这种垃圾在这十年里仍然出版。事实上，C 标准定义x[y]为*(x+y).

页面后面关于左值的部分也是完全错误的。

恕我直言，使用这本书的最佳方法是将其放入回收箱或烧掉。

score 33 · Accepted Answer

我有 2 个 C 文件：ex1.c

% cat ex1.c
#include <stdio.h>

int main (void) {
    int vector[5] = { 1, 2, 3, 4, 5 };
    printf("%d\n", vector[3]);
}

和ex2.c,

% cat ex2.c
#include <stdio.h>

int main (void) {
    int vector[5] = { 1, 2, 3, 4, 5 };
    printf("%d\n", *(vector + 3));
}

我将两者都编译成汇编，并显示生成的汇编代码的区别

% gcc -S ex1.c; gcc -S ex2.c; diff -u ex1.s ex2.s
--- ex1.s       2018-07-17 08:19:25.425826813 +0300
+++ ex2.s       2018-07-17 08:19:25.441826756 +0300
@@ -1,4 +1,4 @@
-       .file   "ex1.c"
+       .file   "ex2.c"
        .text
        .section        .rodata
 .LC0:

量子点

C 标准非常明确地指出(C11 n1570 6.5.2.1p2)：

后缀表达式后跟方括号[]中的表达式是数组对象元素的下标指定。下标运算符的定义[]与E1[E2]相同(*((E1)+(E2)))。由于适用于二元+运算符的转换规则，ifE1是一个数组对象（相当于，一个指向数组对象的初始元素的指针）并且E2是一个整数，E1[E2]指定E2第一个元素E1（从零开始计数）。

此外，这里适用as-if 规则- 如果程序的行为相同，即使语义不同，编译器也可以生成相同的代码。

score 19 · Accepted Answer

引用的段落是完全错误的。表达式vector[i]和*(vector+i)完全相同，可以预期在所有情况下生成相同的代码。

表达式vector[i]和根据定义*(vector+i)相同。这是 C 编程语言的核心和基本属性。任何称职的 C 程序员都明白这一点。《理解和使用 C 指针》一书的任何作者都必须理解这一点。任何 C 编译器的作者都会理解这一点。这两个片段将生成相同的代码并非偶然，而是因为实际上任何 C 编译器实际上都会几乎立即将一种形式转换为另一种形式，因此当它进入代码生成阶段时，它甚至都不知道最初使用的是哪种形式。（如果 C 编译器生成的代码与.vector[i]*(vector+i)

事实上，引用的文本自相矛盾。正如你所指出的，这两段

该符号vector[i]生成从 location 开始的机器代码，从该位置vector移动i位置，并使用其内容。

和

该符号*(vector+i)生成从 location 开始的机器代码vector，添加i到地址，然后使用该地址的内容。

说的基本一样。

他的语言与旧C 常见问题列表的问题 6.2中的语言非常相似：

...当编译器看到表达式时a[3]，它会发出代码以从“ a”位置开始，将三个移过它，然后从那里获取字符。当它看到表达式p[3]时，它会发出代码从位置“ p”开始，获取那里的指针值，将指针加三，最后获取指向的字符。

但当然，这里的关键区别在于它a是一个数组并且p是一个指针。FAQ 列表不是在讨论a[3]vs *(a+3)，而是讨论a[3](or *(a+3)) wherea是数组， vs p[3](or *(p+3)) wherep是指针。（当然这两种情况会产生不同的代码，因为数组和指针是不同的。正如 FAQ 列表所解释的，从指针变量中获取地址与使用数组的地址根本不同。）

score 6 · Accepted Answer

该标准将arr[i]whenarr是一个数组对象的行为指定为等效于分解arr为指针、添加i和取消引用结果。尽管这些行为在所有标准定义的情况下都是等效的，但在某些情况下，即使标准确实需要，编译器也可以有效地处理操作，因此处理arrayLvalue[i]和*(arrayLvalue+i)可能会有所不同。

例如，给定

char arr[5][5];
union { unsigned short h[4]; unsigned int w[2]; } u;

int atest1(int i, int j)
{
if (arr[1][i])
    arr[0][j]++;
return arr[1][i];
}
int atest2(int i, int j)
{
if (*(arr[1]+i))
    *((arr[0])+j)+=1;
return *(arr[1]+i);
}
int utest1(int i, int j)
{
    if (u.h[i])
        u.w[j]=1;
    return u.h[i];
}
int utest2(int i, int j)
{
    if (*(u.h+i))
        *(u.w+j)=1;
    return *(u.h+i);
}

GCC 为 test1 生成的代码将假定 arr[1][i] 和 arr[0][j] 不能别名，但为 test2 生成的代码将允许指针算术访问整个数组，另一方面，gcc 将认识到在 utest1 中，左值表达式 uh[i] 和 uw[j] 都访问同一个联合，但它还不够复杂，无法注意到关于 *(u.h+i) 和 *(u.w+j) 的相同之处utest2.

score 6 · Accepted Answer

我认为原文可能指的是某些编译器可能会或可能不会执行的一些优化。

例子：

for ( int i = 0; i < 5; i++ ) {
  vector[i] = something;
}

对比

for ( int i = 0; i < 5; i++ ) {
  *(vector+i) = something;
}

在第一种情况下，优化编译器可能会检测到数组vector是逐个元素迭代的，因此会生成类似

void* tempPtr = vector;
for ( int i = 0; i < 5; i++ ) {
  *((int*)tempPtr) = something;
  tempPtr += sizeof(int); // _move_ the pointer; simple addition of a constant.
}

它甚至可以在可用的情况下使用目标 CPU 的指针后增量指令。

对于第二种情况，编译器“更难”看到通过一些“任意”指针算术表达式计算的地址显示出在每次迭代中单调推进固定数量的相同属性。((void*)vector+i*sizeof(int))因此，它可能无法在每次使用附加乘法的迭代中找到优化和计算。在这种情况下，没有（临时）指针被“移动”，而只是重新计算了一个临时地址。

但是，该语句可能并不普遍适用于所有版本的所有 C 编译器。

更新：

我检查了上面的例子。似乎在没有启用优化的情况下，至少 gcc-8.1 x86-64 为第二种（指针算术）形式生成了比第一种（数组索引）更多的代码（2 条额外指令）。

见：https ://godbolt.org/g/7DaPHG

但是，打开任何优化( ... ) 生成的代码对于两者来说都是相同的（长度）。-O-O3

score 3 · Accepted Answer

Let me try to answer this "in the narrow" (others have already described why the description "as-is" is somewhat lacking/incomplete/misleading):

In what context would any compiler generate different code for those two?

A "not-very-optimizing" compiler might generate different code in just about any context, because, while parsing, there's a difference: x[y] is one expression (index into an array), while *(x+y) are two expressions (add an integer to a pointer, then dereference it). Sure, it's not very hard to recognize this (even while parsing) and treat it the same, but, if you're writing a simple/fast compiler, then you avoid putting "too much smarts into it". As an example:

char vector[] = ...;
char f(int i) {
    return vector[i];
}
char g(int i) {
    return *(vector + i);
}

Compiler, while parsing f(), sees the "indexing" and may generate something like (for some 68000-like CPU):

MOVE D0, [A0 + D1] ; A0/vector, D1/i, D0/result of function

OTOH, for g(), compiler sees two things: first a dereference (of "something yet to come") and then the adding of integer to pointer/array, so being not-very-optimizing, it could end up with:

MOVE A1, A0   ; A1/t = A0/vector
ADD A1, D1    ; t += i/D1
MOVE D0, [A1] ; D0/result = *t

Obviously, this is very implementation dependent, some compiler might also dislike using complex instructions as used for f() (using complex instructions makes it harder to debug the compiler), the CPU might not have such complex instructions, etc.

Is there a difference between "move" from base, and "add" to base?

The description in the book is arguably not well-worded. But, I think the author wanted to describe the distinction shown above - indexing ("move" from base) is one expression, while "add and then dereference" are two expressions.

This is about compiler implementation, not language definition, the distinction which should have also been explicitly indicated in the book.

score 2 · Accepted Answer

我测试了一些编译器变体的代码，它们中的大多数都为两条指令提供了相同的汇编代码（针对 x86 进行了测试，没有进行优化）。有趣的是，gcc 4.4.7 完全符合您所提到的：示例：

ARM 或 MIPS 等其他语言有时也会这样做，但我没有全部测试。所以看起来他们是有区别的，但后来的 gcc 版本“修复”了这个错误。

score -2 · Accepted Answer

-2

这是 C 中使用的示例数组语法。

int a[10] = {1,2,3,4,5,6,7,8,9,10};

于 2018-07-23T16:43:55.133 回答

c - 数组语法与指针语法和代码生成？

8 回答 8

Related

Reference