ruby - 自动修复 Ruby 中未关闭的 HTML 标签

Question

我正在尝试使用 reverse-markdown Ruby gem 将 HTML 页面转换为 Markdown。不幸的是，它失败了：

/usr/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in parse': #<REXML::ParseException: Missing end tag for 'img' (got "td") (REXML::ParseException)

源代码包含一些以代替结尾的 , 等IMG标签。INPUT>/>

我试过 tidy_ffi 宝石：

doc = Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html,
        :numeric_entities => 1,
        :output_html => 1,
        :merge_divs => 0,
        :merge_spans => 0,
        :join_styles => 0,
        :clean => 1,
        :indent => 1,
        :wrap => 0,
        :drop_empty_paras => 0,
        :literal_attributes => 1).clean)

但这并没有什么不同。有什么建议么？

score 1 · Accepted Answer

反向降价实际上假设降价处理器生成格式良好的 XHTML。如果你没有，你可能想试试html2markdown gem。它使用 Nokogiri 进行解析，并且可能更健壮（免责声明：我没有使用它）。

score -2 · Accepted Answer

我制作了一个摘录 html 的 gem：https ://www.ruby-toolbox.com/gems/auto_excerpt 也许您可以使用它或查看它用来执行此操作的代码？不确定这是否回答了这里的问题。

实际上我只是注意到你调用了 Nokogiri::HTML 两次：Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html

我不确定您遇到的错误是来自 Nokogiri 还是来自 TifyFFI。

ruby - 自动修复 Ruby 中未关闭的 HTML 标签

2 回答 2

Related

Reference