regex - 正则表达式提取a的内容
标签

Question

在这里有点大脑冻结，所以我希望得到一些指示，基本上我需要提取特定 div 标签的内容，是的，我知道正则表达式通常不被批准，但它是一个简单的网络抓取应用程序，其中没有嵌套的div。

我正在尝试匹配这个：

    <div class="entry">
  <span class="title">Some company</span>
  <span class="description">
  <strong>Address: </strong>Some address
    <br /><strong>Telephone: </strong> 01908 12345
  </span>
</div>

简单的vb代码如下：

    Dim myMatches As MatchCollection
    Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
    Dim wc As New WebClient
    Dim html As String = wc.DownloadString("http://somewebaddress.com")
    RichTextBox1.Text = html
    myMatches = myRegex.Matches(html)
    MsgBox(html)
    'Search for all the words in a string
    Dim successfulMatch As Match
    For Each successfulMatch In myMatches
        MsgBox(successfulMatch.Groups(1).ToString)
    Next

任何帮助将不胜感激。

score 8 · Accepted Answer

您的正则表达式适用于您的示例。但是，应该进行一些改进：

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]*表示“匹配除尖括号外的任意数量的字符”，确保我们不会意外跳出我们所在的标签。

.*?（注意?）表示“匹配任意数量的字符，但只匹配尽可能少的字符”。这样可以避免页面中从第一个标记到最后一个标记的匹配<div class="entry">。

但是您的正则表达式本身应该仍然匹配一些东西。也许你没有正确使用它？

我不了解 Visual Basic，所以这只是在黑暗中拍摄，但 RegexBuddy 建议采用以下方法：

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

我建议不要再采用正则表达式方法。如果你坚持，你最终会得到一个像下面这样的怪物正则表达式，它只有在div's 内容的形式永远不会改变的情况下才会起作用：

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

或者（看看 VB.NET 中多行字符串的乐趣）：

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

（当然，现在您需要存储MatchResult.Groups("title")等的结果......）

score 2 · Accepted Answer

~~尝试使用RegexOptions.Multiline而不是RegexOptions.Singleline~~

感谢@Tim 指出上述方法不起作用……我的错。

@Tim 的答案是一个很好的答案，应该是公认的答案，但是阻止您的代码工作的额外部分是没有第二组Group(1)可以返回。

改变...

MsgBox(successfulMatch.Groups(1).ToString)

到...

MsgBox(successfulMatch.Groups(0).ToString)

score 0 · Accepted Answer

用这个

<div.*?class=""entry"".*?>(?<divBody>.*)</div>

并获得名为divBody 的组

但请注意，如果字符串包含其他节点div（并且似乎无法通过正则表达式解决此问题），这将不起作用。对于您的解决方案，xslt可能很有用。

score 0 · Accepted Answer

0

真是好文章。请参阅以下来自 eclipse 的附加结果

于 2015-10-13T03:51:44.640 回答

regex - 正则表达式提取a的内容标签

4 回答 4

Related

Reference

regex - 正则表达式提取a的内容
标签