ruby - 使用 XPath 解析文本电子邮件地址，而不是 //A[startswith(@href, 'mailto:')]

Question

我想从几个不同的网站中提取电子邮件地址。如果它们是活动链接格式，我可以使用

//A[starts-with(@href, 'mailto:')]

但其中一些只是文本格式example@domain.com，而不是链接，所以我想选择一个包含@内部元素的路径

score 4 · Accepted Answer

您可能想要使用正则表达式。它们将允许您提取电子邮件地址，无论它们在文档中的上下文如何。这是一个测试驱动的小示例，可以帮助您入门：

require "minitest/spec"
require "minitest/autorun"

module Extractor
  EMAIL_REGEX = /[\w]+@[\w]+\.[\w]+/

  def self.emails(document)
    (matches = document.scan(EMAIL_REGEX)).any? ? matches : false
  end
end

describe "Extractor" do
  it 'should extract an email address from plaintext' do
    emails = Extractor.emails("email@example.com")
    emails.must_include "email@example.com"
  end

  it 'should extract multiple email addresses from plaintext' do
    emails = Extractor.emails("email@example.com and email2@example2.com")
    emails.must_include "email@example.com", "email2@example2.com"
  end

  it 'should extract an email address from the href attribute of an anchor' do
    emails = Extractor.emails("<a href='mailto:email3@example3.com'>Email!</a>")
    emails.must_include "email3@example3.com"
  end

  it 'should extract multiple email addresses from both plaintext and within HTML' do
    emails = Extractor.emails("my@email.com OR <a href='mailto:email4@example4.com'>Email!</a>")
    emails.must_include "email4@example4.com", "my@email.com"
  end

  it 'should not extract an email address if there isn\'t one' do
    emails = Extractor.emails("email(at)address(dot)com")
    emails.must_equal false
  end

  it "should extract email addresses" do
    emails = Extractor.emails("email.address@domain.co.uk")
    emails.must_include "email.address@domain.co.uk"
  end
end

最后一个测试失败，因为正则表达式没有预料到大多数有效的电子邮件地址。看看你是否以此为起点来提出或找到更好的正则表达式。为了帮助构建您的正则表达式，请查看Rubular。

score 4 · Accepted Answer

我想选择一个包含@ inside的元素的路径

使用：

//*[contains(., '@')]

在我看来，您真正想要的是选择具有包含“@”的文本节点子节点的元素。如果是这样，请使用：

//*[contains(text(), '@')]

基于 XSLT 的验证：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select=
        "//*[contains(text(), '@')] "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于以下 XML 文档时：

<html>
 <body>
  <a href="xxx.com">xxx.com</a>
  <span>someone@xxx.com</span>
 </body>
</html>

计算 XPath 表达式并将选定节点复制到输出：

<span>someone@xxx.com</span>

ruby - 使用 XPath 解析文本电子邮件地址，而不是 //A[startswith(@href, 'mailto:')]

2 回答 2

Related

Reference