c - "char*" with an unusual memory word size (Knuth's MIX architecture)

Question

The original MIX architecture features 6-bit bytes and memory is addressed as 31-bit words (5 bytes and a sign bit). As a thought exercise I'm wondering how the C language can function in this environment, given:

char has at least 8 bits (annex E of C99 spec)
C99 spec section 6.3.2.3 ("Pointers") paragraph 8 says "When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object." My interpretation of this requirement is that it underpins "memcpy(&dst_obj, &src_obj, sizeof(src_obj))".

Approaches that I can think of:

Make char 31 bits, so indirection through "char*" is simple memory access. But this makes strings wasteful (and means it isn't POSIX-compliant as that apparently requires 8 bit chars)
Pack three 8 bit chars into one word, with 7 ignored bits: "char*" might be composed of word address and char index within it. However this seems to violate 6.3.2.3, i.e. memcpy() would necessarily skip the ignored bits (which are probably meaningful for the real object type)
Fully pack chars into words, e.g. the fourth 8 bit char would have 7 bits in word 0 and one bit in word 1. However this seems to require that all objects are sized in 8 bit chars, e.g. a "uint31_t" couldn't be declared to match the word length since this again has the memcpy() problem.

So that seems to leave the first (wasteful) option of using 31-bit chars with all objects sized as multiples of char - am I correct in reading it this way?

score 5 · Accepted Answer

我同意在 MIX 架构上实现 C 可能会很痛苦，而且，虽然我自己不是语言律师，但在我看来，你指出你的方法 1. 是正确的，因为它是唯一符合标准的方法。

无论如何，字符串的空间浪费是您的问题中最少的：您可以通过求助于比 C 本身更早的解决方案来规避它：使每个char代表多个字母。例如，对于 MIX 架构，您可以设计一个 7 位编码并将 4 个字母打包到每个字符中：

char hi[4];
hi[0] = 'hell';
hi[1] = 'o, w';
hi[2] = 'orld';
hi[3] = '\0';

printf("%s", hi);

// Whoops, we forgot the exclamation mark
putchar('!\n');

这个实现看起来很奇怪，但根据维基百科，它曾用于第一个“Hello world”程序。我查看了标准，发现没有任何东西阻止它，即使在 C11 中也是如此。特别是第 6.4.4.4 节允许以特定于实现的方式对文字字符和字符串进行编码。

编辑：

这无助于解决其他困难，主要是您无法使用机器的大部分可能指令，因为您无法使用本机 C 类型处理单个字节。但是，您可以通过这种方式使用位域：

typedef struct _bytes {
    unsigned int sign  : 1;
    unsigned int byte1 : 6; // EDIT: bitfields must be 
    unsigned int byte2 : 6; // declared as ints in standard C
    unsigned int byte3 : 6;
    unsigned int byte4 : 6;
    unsigned int byte5 : 6;
} bytes;

typedef union _native_type {
    char as_word;
    int as_int; // int = char; useful for standard library functions, etc.
    bytes as_bytes;
} native_type;

请注意，在 C++ 中，由于严格的别名规则中有一个子句，您必须小心始终访问访问权限和访问权限char之间的成员，因为这个片段：intbytes

native_type a, b;
a.as_int = 0xC11BABE;
b.as_bytes.byte4 = a.as_bytes.byte4; // Whoops

会产生未定义的行为：有关详细信息，请参见此处。

score 2 · Accepted Answer

最实用的方法可能int是 30 位，并且有char10 位或 15 位。使用 10 位char将允许 ASCII 文本更紧密地打包，但char由于需要除以 - 会增加索引到数组的成本三。使用 10 字节或 15 字节存储 Unicode 文本可能相当有效char。对于 15-byte char，大约 30720 个代码点将占用 15 位，其余的将占用 30 个。对于 10-byte char，128 个代码点将占用 10 位，65408 将占用 20 位，其余将占用 30 个。

为了减轻除以 3 的成本，每个char*包含两个单词可能会有所帮助；一个将识别包含该字符的单词，另一个将识别从该单词开头的偏移量，以字符为单位。向已知已规范化的指针添加常量偏移量可以使用如下代码：

p += 5; // Becomes...
if (p.offset) { p.offset=2; p.base+=1; }
else { p.offset--; p.base+=2; }

不是很好，但它可以避免任何“分割”步骤。

c - "char*" with an unusual memory word size (Knuth's MIX architecture)

2 回答 2

Related

Reference