linux - How to remove non UTF-8 characters from text file

Question

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

Is there anyway to do it?

score 175 · Accepted Answer

这个命令：

iconv -f utf-8 -t utf-8 -c file.txt

将清理您的 UTF-8 文件，跳过所有无效字符。

-f is the source format
-t the target format
-c skips any invalid sequence

score 0 · Accepted Answer

0

iconv 可以做到

iconv -f cp1252 foo.txt

于 2012-12-08T04:50:33.283 回答

score 0 · Accepted Answer

您的方法必须逐字节阅读，并完全理解和欣赏字符的字节构造。最简单的方法是使用一个可以读取任何内容但只输出 UTF-8 字符的编辑器。文本板是一种选择。

score 0 · Accepted Answer

这里或任何其他类似问题的方法都不适合我。最后，只需在 Sublime Text 2 中打开文件。转到文件 > 使用编码重新打开 > UTF-8。将文件的全部内容复制到一个新文件中并保存。

可能不是预期的解决方案，但把它放在这里以防它帮助任何人，因为我已经为此苦苦挣扎了几个小时。

score -4 · Accepted Answer

-4

cat foo.txt | strings -n 8 > bar.txt

将完成这项工作。

于 2013-10-29T15:32:06.533 回答

linux - How to remove non UTF-8 characters from text file

5 回答 5

Related

Reference