c - “struct hack”在技术上是未定义的行为吗？

Question

我要问的是众所周知的“结构的最后一个成员具有可变长度”技巧。它是这样的：

struct T {
    int len;
    char s[1];
};

struct T *p = malloc(sizeof(struct T) + 100);
p->len = 100;
strcpy(p->s, "hello world");

由于结构在内存中的布局方式，我们能够将结构覆盖在一个大于必要的块上，并将最后一个成员视为大于1 char指定的。

所以问题是：这种技术在技术上是未定义的行为吗？. 我希望它是，但很好奇标准对此有何看法。

PS：我知道 C99 的方法，我希望答案专门针对上面列出的技巧版本。

score 54 · Accepted Answer

As the C FAQ says:

It's not clear if it's legal or portable, but it is rather popular.

and:

... an official interpretation has deemed that it is not strictly conforming with the C Standard, although it does seem to work under all known implementations. (Compilers which check array bounds carefully might issue warnings.)

The rationale behind the 'strictly conforming' bit is in the spec, section J.2 Undefined behavior, which includes in the list of undefined behavior:

An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]) (6.5.6).

Paragraph 8 of Section 6.5.6 Additive operators has another mention that access beyond defined array bounds is undefined:

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

score 35 · Accepted Answer

I believe that technically it's undefined behavior. The standard (arguably) doesn't address it directly, so it falls under the "or by the omission of any explicit definition of behavior." clause (§4/2 of C99, §3.16/2 of C89) that says it's undefined behavior.

The "arguably" above depends on the definition of the array subscripting operator. Specifically, it says: "A postfix expression followed by an expression in square brackets [] is a subscripted designation of an array object." (C89, §6.3.2.1/2).

You can argue that the "of an array object" is being violated here (since you're subscripting outside the defined range of the array object), in which case the behavior is (a tiny bit more) explicitly undefined, instead of just undefined courtesy of nothing quite defining it.

In theory, I can imagine a compiler that does array bounds checking and (for example) would abort the program when/if you attempted to use an out of range subscript. In fact, I don't know of such a thing existing, and given the popularity of this style of code, even if a compiler tried to enforce subscripts under some circumstances, it's hard to imagine that anybody would put up with its doing so in this situation.

score 16 · Accepted Answer

是的，这是未定义的行为。

C 语言缺陷报告 #051 对这个问题给出了明确的答案：

该成语虽然很常见，但并不严格符合

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_051.html

在 C99 基本原理文件中，C 委员会补充道：

这种结构的有效性一直值得怀疑。在对一份缺陷报告的回复中，委员会决定这是未定义的行为，因为数组 p->items 仅包含一项，而与空间是否存在无关。

score 12 · Accepted Answer

这种特定的操作方式在任何 C 标准中都没有明确定义，但 C99 确实将“struct hack”作为语言的一部分。在 C99 中，结构的最后一个成员可能是“灵活数组成员”，声明为char foo[]（使用您想要的任何类型来代替char）。

score 8 · Accepted Answer

它不是未定义的行为，无论任何人，官方或其他人怎么说，因为它是由标准定义的。p->s，除非用作左值，否则计算结果为与相同的指针(char *)p + offsetof(struct T, s)。特别是，这是charmalloc 对象内的有效指针，并且紧随其后有 100 个（或更多，取决于对齐考虑）连续地址，这些地址作为char分配对象内的对象也是有效的。指针是通过使用而不是显式地将偏移量添加到由，强制转换为->返回的指针来派生的这一事实是无关紧要的。mallocchar *

从技术上讲，p->s[0]是结构内数组的单个元素char，接下来的几个元素（例如p->s[1]through p->s[3]）可能是结构内的填充字节，如果您对整个结构执行分配，则可能会损坏，但如果您仅访问单个元素，则不会损坏成员，其余元素是分配对象中的额外空间，只要您遵守对齐要求（并且char没有对齐要求），您就可以随意使用它们。

如果您担心与结构中的填充字节重叠的可能性可能会以某种方式引发鼻恶魔，您可以通过将1in替换为[1]确保结构末尾没有填充的值来避免这种情况。一个简单但浪费的方法是创建一个具有相同成员的结构，除了末尾没有数组，并s[sizeof struct that_other_struct];用于数组。然后，p->s[i]明确定义为 struct for 中的数组元素，并定义为i<sizeof struct that_other_structchar 对象，位于结构 for 末尾之后的地址处i>=sizeof struct that_other_struct。

编辑：实际上，在上述获得正确大小的技巧中，您可能还需要在数组之前放置一个包含每个简单类型的联合，以确保数组本身以最大对齐开始，而不是在其他元素的填充中间. 再说一次，我不相信这一切都是必要的，但我为那些最偏执的语言律师提供了它。

编辑 2：由于标准的另一部分，与填充字节的重叠绝对不是问题。C 要求如果两个结构在其元素的初始子序列中一致，则可以通过指向任一类型的指针访问公共初始元素。因此，如果struct T声明了一个与最终数组相同但具有更大最终数组的结构，则该元素s[0]必须与中的元素重合s[0]，struct T并且这些附加元素的存在不会影响或受到访问更大结构的公共元素的影响使用指向struct T.

score 8 · Accepted Answer

是的，这在技术上是未定义的行为。

请注意，至少有三种方法可以实现“struct hack”：

(1) 声明大小为 0 的尾随数组（遗留代码中最“流行”的方式）。这显然是 UB，因为零大小的数组声明在 C 中总是非法的。即使它确实编译，该语言也不保证任何违反约束的代码的行为。

(2) 声明具有最小合法大小的数组 - 1 (你的情况)。在这种情况下，任何试图获取指针p->s[0]并将其用于超出指针算术的尝试p->s[1]都是未定义的行为。例如，允许调试实现生成带有嵌入范围信息的特殊指针，每次尝试创建超出p->s[1].

(3) 例如，声明具有“非常大”大小的数组，例如 10000。这个想法是，声明的大小应该大于您在实际实践中可能需要的任何大小。该方法在数组访问范围方面没有UB。然而，在实践中，当然，我们总是会分配更少量的内存（仅在真正需要的情况下）。我不确定这是否合法，即我想知道为对象分配的内存少于对象声明的大小是多么合法（假设我们从不访问“未分配”成员）。

score 4 · Accepted Answer

标准非常清楚，您不能访问数组末尾之外的东西。（并且通过指针没有帮助，因为您甚至不允许在数组结束后将指针递增到一个以上）。

而对于“在实践中工作”。我已经看到 gcc/g++ 优化器使用标准的这一部分，因此在遇到这个无效的 C 时会生成错误的代码。

score 1 · Accepted Answer

如果编译器接受类似

类型定义结构{
  国际化;
  字符数据[];
};

我认为很明显，它必须准备好接受“dat”上超出其长度的下标。另一方面，如果有人编写如下代码：

类型定义结构{
  无论如何；
  字符数据[1]；
} MY_STRUCT;

然后访问 somestruct->dat[x]; 我认为编译器没有任何义务使用地址计算代码，它可以处理较大的 x 值。我认为，如果一个人想要真正安全，正确的范式应该更像是：

#define LARGEST_DAT_SIZE 0xF000
类型定义结构{
  无论如何；
  字符数据[LARGEST_DAT_SIZE];
} MY_STRUCT;

然后执行 (sizeof(MYSTRUCT)-LARGEST_DAT_SIZE + desired_array_length) 字节的 malloc（请记住，如果 desired_array_length 大于 LARGEST_DAT_SIZE，则结果可能未定义）。

顺便说一句，我认为禁止零长度数组的决定是一个不幸的决定（一些较旧的方言，如 Turbo C 支持它），因为零长度数组可以被视为编译器必须生成可用于更大索引的代码的标志.

c - “struct hack”在技术上是未定义的行为吗？

8 回答 8

Related

Reference