1

I am picking up pieces of someone else's large project and trying to right the wrongs. The problem is, I'm just not sure what the correct ways are.

So, I am cURLing a bunch of HTML pages, then writing it to files with simple commands like:

$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;

Now I wanted those to be saved as UTF-8. What is it saved as? Then I am reading the html file in using the same basic 'open' command, parsing the html with regex calls, and using string concatenation to make a big string and writing it to an XML file (using the same code as above). I have already started using XML::Writer instead, but now I must go through and fix the files that have inaccurate encoding.

So, I don't have the html anymore, but I still have the XML that have to display proper characters. Here is an example: http://filevo.com/wkkixmebxlmh.html

The main problem is detecting and replacing the character in question with a "\x{2019}" that displays in editors properly. But I can't figure out a regex to actually capture the character in the wild.

UPDATE:

I still cannot detect the ALT-0146 character that's in the XML file I uploaded to Filevo above. I've tried opening it in UTF-8, and searching for /\x{2019}/, /chr(0x2019)/, and just /’/, nothing.

4

3 回答 3

4

发现 HTML 文档的编码是很困难的。请参阅http://blog.whatwg.org/the-road-to-html-5-character-encoding尤其是它需要“7 步算法;第 4 步有 2 个子步骤,其中第一个有 7分支,其中一个有 8 个子步骤,其中一个实际上链接到一个单独的算法,该算法本身有 7 个步骤……这样持续了一段时间。”

这就是我在解析 HTML 文件时的有限需求。

my $CHARACTER_SET_CLASS = '\w:.()-';

     # X(HT)?ML: http://www.w3.org/International/O-charset
     /\<\?xml [^>]*(?<= )encoding=[\'\"]?([$CHARACTER_SET_CLASS]+)/ ||
     # X?HTML: http://blog.whatwg.org/the-road-to-html-5-character-encoding
     /\<meta [^>]*\bcharset=["']?([$CHARACTER_SET_CLASS]+)/i ||
     # CSS: http://www.w3.org/International/questions/qa-css-charset
     /\@charset "([^\"]*)"/ ||
于 2011-05-27T20:47:42.960 回答
3

为确保您以 UTF-8 生成输出,请使用以下命令将utf8图层应用于输出流binmode

open FILE, '>output.html';
binmode FILE, ':utf8';

或在 3 参数open调用中

open FILE, '>:utf8', 'output.html'

任意输入更棘手。如果你幸运的话,HTML 输入会在早期告诉你它的编码:

wget http://www.google.com/ -O foo ; head -1 foo

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; 
charset=ISO-8859-1"><title>Google</title><script>window.google=
{kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
{e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
time:function(){return(new Date).getTime()},

啊,就是这样:。现在,您可以继续将输入读取为原始字节,并找到某种方法来使用已知编码对这些字节进行解码。CPAN可以帮助解决这个问题。<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

于 2011-05-27T20:36:23.183 回答
1
于 2011-05-28T13:40:24.530 回答