c# - 正则表达式匹配不起作用

Question

我正在做一个非常简单的任务：解析一个网站，寻找

<tbody>this is what important for me</tbody>`

并返回，但我无法让它发挥作用。当我做：

Regex.Matches(webData, @"<tbody>(.*?)</tbody>")

它没有给我任何结果。然而，这给了我两个结果：

Regex.Matches(webData, @"tbody")

但同样，这

Regex.Matches(webData, @"tbody(.*?)tbody")

什么也没给我（所以我认为逃避不是问题）。我(.*?)在这个页面上找到了，我认为它会很容易使用，但我就是无法解决。

score 2 · Accepted Answer

regex不推荐用于解析html

regex用于定期出现的模式。html它的格式不规则（除了xhtml）。例如html，即使您没有closing tag! 这可能会破坏您的代码，文件也是有效的。

使用像htmlagilitypack这样的 html 解析器

您可以使用此代码检索所有 tbody 的内容HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var tbodyList= doc.DocumentNode.SelectNodes("//tBody")
                  .Select(p => p.InnerText)
                  .ToList();

tbodyList包含tbody整个文档中的所有值！

score 2 · Accepted Answer

要解析网页，请使用真正的 html 解析器，例如HtmlAgilityPack

string html = "<tbody>this is what important for me</tbody>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var text = doc.DocumentNode.Descendants("tbody").First().InnerText;

score 0 · Accepted Answer

我也推荐 HtmlAgilityPack。

您也可以使用 XPath ( http://www.w3schools.com/xpath/ )

在 I4V 示例中：

var text = doc.DocumentNode.SelectSingleNode("//tbody").InnerText;

c# - 正则表达式匹配不起作用

3 回答 3

Related

Reference