c++ - 为什么在 Gnu gcc/g++ 中为三字符序列解析字符串文字？

Question

考虑这个无害的 C++ 程序：

#include <iostream>
int main() {
  std::cout << "(Is this a trigraph??)" << std::endl;
  return 0;
}

当我使用 g++ 5.4.0 版编译它时，我得到以下诊断：

me@my-laptop:~/code/C++$ g++ -c test_trigraph.cpp
test_trigraph.cpp:4:36: warning: trigraph ??) ignored, use -trigraphs to enable [-Wtrigraphs]
   std::cout << "(Is this a trigraph??)" << std::endl;
                                     ^

程序运行，其输出如预期：

(Is this a trigraph??)

为什么要为三元组解析字符串文字？

其他编译器也这样做吗？

score 5 · Accepted Answer

Trigraphs 在翻译阶段 1 中处理（但是它们在 C++17 中被删除）。与字符串文字相关的处理发生在后续阶段。正如 C++14 标准规定的 (n4140) [lex.phases]/1.1：

翻译的语法规则之间的优先级由以下阶段指定。

如有必要，物理源文件字符以实现定义的方式映射到基本源字符集（为行尾指示符引入换行符）。接受的物理源文件字符集是实现定义的。Trigraph 序列 ([lex.trigraph]) 被相应的单字符内部表示替换。 任何不在基本源字符集 ([lex.charset]) 中的源文件字符都将替换为指定该字符的通用字符名称。（实现可以使用任何内部编码，只要在源文件中遇到的实际扩展字符，以及在源文件中表示为通用字符名称的相同扩展字符（即，使用 \uXXXX 表示法）是等效处理，除非此替换在原始字符串文字中恢复。）

这首先发生，因为正如您在评论中被告知的那样，trigraphs 所代表的字符也需要可打印。

score 1 · Accepted Answer

This behavious is inherited from C compilers and the old time when we used serial terminals where only 7 bits were used (the 8th being a parity bit). To allow non English languages with special characters (for example the accented àéèêîïôù in French or ñ in Spanish) the ISO/IEC 646 code pages used some ASCII (7bits) code to represent them. In particular, the codes 0x23, 0x24 (#$ in ASCII) 0x40 (@), 0x5B to 0x5E([\]^), 0x60 (`) and 0x7B to 0x7E ({|}~) could be replaced by national variants¹.

As they have special meaning in C, they could be replaced in source code with trigraphs using only the invariant part of the ISO 646.

For compatibility reasons, this has been kept up to the C++14, when only dinosaurs still remember of the (not so good) days of ISO646 and 7 bits only code pages.

¹ For example, the French variant used: 0x23 £, 0x40 à 0x5B-0x5D °ç§, 0x60 µ, 0x7B-0x7E éùè¨

c++ - 为什么在 Gnu gcc/g++ 中为三字符序列解析字符串文字？

2 回答 2

Related

Reference