html - Perl 编码 - 将文件保存为 UTF8

Question

我有一个将下载 www 页面的脚本，我想提取文本并将其存储在统一编码中（UTF8 就可以了）。下载（UserAgent）、解析（TreeBuilder）和文本提取看起来不错，但我不确定我是否正确保存它们。

在例如 notepad++ 中打开输出文件时，他们不会查看；原始 HTML 视图可在文本编辑器中找到。

HTML 文件通常具有 charset=windows-1256 或 charset=UTF-8

所以我想如果我能让 UTF8 工作，那么这只是一个重新编码的问题。这是我尝试过的一些方法，假设我有一个 HTML 文件保存到磁盘。

my $tree = HTML::TreeBuilder->new;
$tree->parse_file("$inhtml");
$tree->dump;

只有在文本编辑器中将编码切换为 utf8 后，才能在 .txt 文件中正确查看 STDOUT 的转储输出...</p>

$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
if (utf8::is_utf8($formatter->format($tree))) {
    print "   Is UTF8\n";
}
else {
    print "   Not UTF8\n";
}

结果当内容表明是 UTF8 时显示此 IS UTF8，否则显示 Not UTF8。

我累了

opening an file with ">" and ">:utf8"
binmode(MYFILE, ":utf8");
encode("utf8", $string); (where string is the output of formatter->format(tree))

但似乎没有任何工作正常。

那里的任何专家都知道我错过了什么？

提前致谢！

score 2 · Accepted Answer

这个例子可以帮助你找到你需要的东西：

use strict;
use warnings;
use feature qw(say);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );

open(my $fh_in,  "<:encoding(cp1252)", $ARGV[0]) or die $!;
open(my $fh_out, ">:encoding(UTF-8)",  $ARGV[1]) or die $!;

my $tree = Object::Destroyer->new(HTML::TreeBuilder->new(), 'delete');
$tree->parse_file($fh_in);

my $h1Element = $tree->look_down("_tag", "h1");
my $h1TrimmedText = $h1Element->as_trimmed_text();
say($fh_out $h1TrimmedText);

score -3 · Accepted Answer

我真的很喜欢这个模块utf8::all（不幸的是不是核心）。

当您只use utf8::all使用 UTF-8 文件时，您无需担心 IO。

html - Perl 编码 - 将文件保存为 UTF8

2 回答 2

Related

Reference