node.js - 使用 ascii 编码读取文件

Question

score 3 · Accepted Answer

The quick answer is that Node doesn't do any magic when converting from a Buffer to a string, whether it is ascii or utf8. Your utf8 string is totally invalid ascii, so I guess ideally it would throw an error, but obviously it doesn't. I would not expect the è\u0081µ since that is invalid ascii.

You can see in the Node source, the code for converting from a buffer to a string are the ...slice functions. The ascii and utf8 functions are identical, leading to the behavior you are seeing. These constructors don't do anything fancy, they just take a sequence of bytes and convert it into a JS string, assuming that it is valid in that encoding.

The differences between the two encodings come from the AsciiWrite and Utf8Write functions in that file, which treat things differently.

For example:

new Buffer("聵", 'ascii') // <Buffer 75>
new Buffer("聵", 'utf8')  // <Buffer e8 81 b5>

As you saw from your tests, binary fits better with what you are looking for. binary goes through each individual byte in a buffer and returns a string where each code point has that byte value.

(new Buffer([0xe8, 0x81, 0xb5])).toString('binary').charCodeAt(0); // 0xe8

score 0 · Accepted Answer

0

It's a bug:

https://github.com/joyent/node/pull/4379

paddddiinnnnngggg

于 2012-12-27T13:37:26.800 回答

score -1 · Accepted Answer

Without knowing exactly what language that is, I am guessing Japanese (correct me if I am wrong). But I believe it is purely coincidental that the characters you supplied happen to fall in the ascii standard, Japanese character encodings

However, Shift JIS has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it. For example, a text search method can get false hits if it is not designed for Shift JIS. EUC, on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus EUC encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus JIS encoding was developed for sending and receiving e-mails.

In character set standards such as JIS, not all required characters are included, so gaiji (外字 "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in Internet environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be written using a larger character set (such as Unicode) that supports the required character.

I would try with some more "exotic" characters, as your test will fail.

node.js - 使用 ascii 编码读取文件

3 回答 3

Related

Reference