html - 清理大多数 HTML 标记，但以 ASCII 格式格式化表格

Question

我将下载的 HTML 传递给 STDIN，然后擦除除表格标记之外的所有标签。我想根据 table、tr 和 td 的剩余实例呈现表格，因此表格最终为“\t”或“|” 划定的。ASCII 格式的表格也可以。以下是我到目前为止所拥有的，但它没有完成工作：

#!/usr/bin/perl -ws
use HTML::Scrubber;
use HTML::Entities qw(decode_entities);
use Text::Unidecode qw(unidecode);

my $HTMLinput = do {local $/; <STDIN>};

my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );

#this prints the text from the page, but without formatting tables in ASCII:
#print $scrubber->scrub($HTMLinput);

my $scrubber2 = $scrubber->scrub($HTMLinput);

#was hoping this would remove transform table, tr, and td-tagged content
#into ASCII-formatted tables, but it doesn't work:
print unidecode(decode_entities($scrubber2)), "\n";

#test page: http://www.w3schools.com/html/html_tables.asp
#curl http://www.w3schools.com/html/html_tables.asp | html.table.parser.pl

score 1 · Accepted Answer

这是我得到的解决方案，部分归功于用户名 tjd：

#!/usr/bin/perl -ws
use HTML::Scrubber;

my $HTMLinput = do {local $/; <STDIN>};
my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );
print $scrubber->scrub($HTMLinput);

#test page: http://www.w3schools.com/html/html_tables.asp
#links -dump http://www.w3schools.com/html/html_tables.asp | html.table.parser.pl 

#needed: "links" program for bash (sudo yum install links)
#http://www.jikos.cz/~mikulas/links/

score 0 · Accepted Answer

我不想重新发明在文本中创建表格的轮子。我要么将文本浏览器的输出通过管道links传输w3m到文件/标准输出，要么使用模块Text::Table来完成繁重的工作。

html - 清理大多数 HTML 标记，但以 ASCII 格式格式化表格

2 回答 2

Related

Reference