1 回答
It seems the message received via IMAP is provided with a combination of 2 different encodings:
- the actual string is encoded according to the "quoted printable" encoding (https://en.wikipedia.org/wiki/Quoted-printable) because I think there's an issue with the 7bit/8bit mapping when transporting that information via the IMAP channel (a TCP socket connection)
- the logic representation of the content (an email body) which is HTML with a
<meta>
tag with a Windows-1252 charset
There is also an "issue" with these HTML chunks that contain a lot of carriage returns in the Windows flavour (\r\n
). I had to pre-process the string to deal with that, in my case: removing those carriage returns.
The following MCV example should show the process of cleaning and validating the content of string representing an email body:
var quotedPrintable = require('quoted-printable');
var windows1252 = require('windows-1252');
const inputStr = 'This should be a pound sign: =A3 \r\nand this should be a long dash: =96\r\n';
console.log(`The original string: "${inputStr}"`);
// 1. clean the "Windows carriage returns" (\r\n)
const cleandStr = inputStr.replace(/\r\n/g, '');
console.log(`The string without carriage returns: "${cleandStr}"`);
// 2. decode using the "quoted printable protocol"
const decodedQp = quotedPrintable.decode(cleandStr)
console.log(`The decoded QP string: "${decodedQp}"`);
// 3. decode using the "windows-1252"
const windows1252DecodedQp = windows1252.decode(decodedQp);
console.log(`The windows1252 decoded QP string: "${windows1252DecodedQp}"`);
Which gives this output:
The original string: "This should be a pound sign: =A3
and this should be a long dash: =96
"
The string without carriage returns: "This should be a pound sign: =A3 and this should be a long dash: =96"
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "
The windows1252 decoded QP string: "This should be a pound sign: £ and this should be a long dash: –"
Notice the "long dash character" that is rendered differently before/after the Windows-1252 decoding phase.
Afaik, this had nothing to do with UTF-8 encoding/decoding. I was able to figure out the "decoding order" of the procedure from this: https://github.com/mathiasbynens/quoted-printable/issues/5
One thing I am not sure is if the Operating System I am running this piece of code on has some sort of impact on charsets/encodings of files or streams of strings.
The npm
packages I have used are: