ruby-on-rails - Ruby 正则表达式帮助使用匹配来提取 html 文档片段

Question

我有一个这种格式的 HTML 文档：

<tr><td colspan="4"><span class="fullName">Bill Gussio</span></td></tr>
    <tr>
        <td class="sectionHeader">Contact</td>
        <td class="sectionHeader">Phone</td>
        <td class="sectionHeader">Home</td>
        <td class="sectionHeader">Work</td>
    </tr>
    <tr valign="top">
        <td class="sectionContent"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio@erols.com</span></td>
        <td class="sectionContent"><span>Mobile: </span><span>2404173223</span></td>
        <td class="sectionContent"><span>NY</span><br><span>New York</span><br><span>78642</span></td>
        <td class="sectionContent"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>
    </tr>

    <tr><td colspan="4"><hr class="contactSeparator"></td></tr>

    <tr><td colspan="4"><span class="fullName">Eddie Osefo</span></td></tr>
    <tr>
        <td class="sectionHeader">Contact</td>
        <td class="sectionHeader">Phone</td>
        <td class="sectionHeader">Home</td>
        <td class="sectionHeader">Work</td>
    </tr>
    <tr valign="top">
        <td class="sectionContent"><span>Screen Name:</span> <span>eddieOS</span><br><span>Email 1:</span> <span>osefo@wam.umd.edu</span></td>
        <td class="sectionContent"></td>
        <td class="sectionContent"><span></span></td>
        <td class="sectionContent"><span></span></td>
    </tr>

    <tr><td colspan="4"><hr class="contactSeparator"></td></tr>

所以它交替出现 - 联系信息块，然后是“联系分隔符”。我想获取联系信息，所以我的第一个障碍是获取联系人分隔符之间的块。我已经弄清楚了使用 rubular 的正则表达式。这是：

/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/

您可以检查 rubular 以验证这是否隔离了块。

然而，我最大的问题是我在使用 ruby 代码时遇到了问题。我使用内置的匹配功能并进行打印，但没有得到我期望的结果。这是代码：

page = agent.get uri.to_s    
chunks = page.body.match(/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/).captures

chunks.each do |chunk|
   puts "new chunk: " + chunk.inspect
end

请注意，page.body 只是 Mechanize 抓取的 html 文档的正文。html 文档要大得多，但具有这种格式。因此，意外的输出如下：

new chunk: "Bill Gussio</span></td></tr>\r\n\t<tr>\r\n\t\t<td class=\"sectionHeader\">Contact</td>\r\n\t\t<td class=\"sectionHeader\">Phone</td>\r\n\t\t<td class=\"sectionHeader\">Home</td>\r\n\t\t<td class=\"sectionHeader\">Work</td>\r\n\t</tr>\r\n\t<tr valign=\"top\">\r\n\t\t<td class=\"sectionContent\"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio@erols.com</span></td>\r\n\t\t<td class=\"sectionContent\"><span>Mobile: </span><span>2404173223</span></td>\r\n\t\t<td class=\"sectionContent\"><span>NY</span><br><span>New York</span><br><span>78642</span></td>\r\n\t\t<td class=\"sectionContent\"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>\r\n\t</tr>\r\n\t\r\n\t<tr><td colspan=\"4\">"
new chunk: ">"

这里有两个惊喜给我：

1) 没有 2 个匹配项包含联系信息块，即使在 rubular 上我已经验证应该提取这些块。

2) 所有的 \r\n\t（换行符、制表符等）都显示在匹配项中。

任何人都可以在这里看到这个问题吗？

或者，如果有人知道一个好的免费 AOL 联系人进口商，那就太好了。我一直在使用黑皮书，但它在 AOL 上一直失败，我正在尝试修复它。不幸的是，AOL 还没有联系人 API。

谢谢！

score 4 · Accepted Answer

请参阅您能否提供一些示例，说明为什么使用正则表达式难以解析 XML 和 HTML？为什么这是一个坏主意。请改用HTML 解析器。

score 3 · Accepted Answer

如果您只是从 XML 中提取信息，那么使用正则表达式以外的东西可能更容易。XPath 是从 XML 中提取信息的好工具。我相信有一些可用于 Ruby 的库支持 XPath，也许可以尝试 REXML：

score 3 · Accepted Answer

使用诸如 hpricot 之类的 HTML 解析器会省去很多麻烦 :)

须藤宝石安装 hpricot

它主要是用 C 编写的，所以它也很快

这是如何使用它：

http://wiki.github.com/why/hpricot/hpricot-basics

score 0 · Accepted Answer

这是解析该 HTML 的代码。随意提出更好的建议：

contacts = []
    email, mobile = "",""

    names = page.search("//span[@class='fullName']")

    # Every contact has a fullName node, so for each fullName node, we grab the chunk of contact info
    names.each do |n|

      # next_sibling.next_sibling skips:
      # <tr>
      #   <td class=\"sectionHeader\">Contact</td>
      #   <td class=\"sectionHeader\">Phone</td>
      #   <td class=\"sectionHeader\">Home</td>
      #   <td class=\"sectionHeader\">Work</td>
      # </tr>
      # to give us the actual chunk of contact information
      # then taking the children of that chunk gives us rows of contact info
      contact_info_rows = n.parent.parent.next_sibling.next_sibling.children

      # Iterate through the rows of contact info
      contact_info_rows.each do |row|

        # Iterate through the contact info in each row
        row.children.each do |info|
          # Get Email. There are two ".next_siblings" because space after "Email 1" element is processed as a sibling
          if info.content.strip == "Email 1:" then email = info.next_sibling.next_sibling.content.strip end

          # If the contact info has a screen name but no email, use screenname@aol.com
          if (info.content.strip == "Screen Name:" && email == "") then email = info.next_sibling.next_sibling.content.strip + "@aol.com" end

          # Get Mobile #'s
          if info.content.strip == "Mobile:" then mobile = info.next_sibling.content.strip end

          # Maybe we can try and get zips later.  Right now the zip field can look like the street address field
          # so we can not tell the difference.  There is no label node
          #zip_match = /\A\D*(\d{5})-?\d{4}\D*\z/i.match(info.content.strip) 
          #zip_match = /\A\D*(\d{5})[^\d-]*\z/i.match(info.content.strip)     
        end  

      end

      contacts << { :name => n.content, :email => email, :mobile => mobile }

      # clear variables
      email, mobile = "", ""
    end

ruby-on-rails - Ruby 正则表达式帮助使用匹配来提取 html 文档片段

4 回答 4

Related

Reference