perl - Perl and reading files with different encodings

Question

I am using a perl script to read in a file, but I'm not sure what encoding the file is in. Basically, my file is a list of book titles, but each book has other info associated with it (author, publication date, etc). So each book title is within a discrete chunk of data for the book. So I iterate through the file line by line until I find the regular expression '/Book Title: (.*)/' and take what's in the paren. Then, I create a separate .txt file with the name of the text file being my book. However, in my unix server, when I look at the name of the file, it's actually not, for example, 'LordOfTheFlies.txt' but rather 'LordOfTheFlies^M.txt'

What is this '^M'? Is that a weird end of line encoding I'm not taking into account? I tried chomp but it doesn't seem to be working. What is the best file encoding for working with perl?

score 5 · Accepted Answer

它是 Windows 系统在换行符之前插入的附加回车符（M == 第 13 个字母，因此 ASCII 13 可视化为 ^M）。

它与文件编码无关，它只是咬你的行尾策略。Perl 通常擅长正确处理行尾字符，但如果它们出现在行尾以外的其他地方，您必须自己处理。您可以使用 s/\r// 而不是 chomp() 将它们取出。

score 0 · Accepted Answer

尝试剁碎，而不是“剁碎”。Chomp 删除了“换行符”。s/\r// 也不错。对于您的一般性问题，您可能希望为您必须使用 Perl 使您的生活更轻松、更好的文件类型使用适当的模块。

score 0 · Accepted Answer

在处理文件之前，你需要知道文件的编码，这是由文件的生产者决定的。
"^M" 是 control-M，它是一个回车，在 Unix 文件系统中不需要。
看起来该文件是在 Unix 中创建并传输到 Windows 的。当文本文件作为二进制文件传输时，它也可以与 ftp 一起添加。

perl - Perl and reading files with different encodings

3 回答 3

Related

Reference