c# - 如何使用正则表达式提取标签链接（REGEX - C#）

Question

到目前为止，我有这个：

<a href="(http://www.imdb.com/title/tt\d{7}/)".*?>.*?</a>

C＃

ArrayList imdbUrls = matchAll(@"<a href=""(http://www.imdb.com/title/tt\d{7}/)"".*?>.*?</a>", html);
private ArrayList matchAll(string regex, string html, int i = 0)
{
  ArrayList list = new ArrayList();
  foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
    list.Add(m.Groups[i].Value.Trim());
  return list;
}

我正在尝试从 HTML 页面中提取 imdb 链接这个正则表达式有什么问题？

这样做的主要思想是在谷歌中搜索一部电影，然后在结果中寻找指向 imdb 的链接

score 1 · Accepted Answer

正则表达式不是解析 HTML 文件的好选择。HTML 并不严格，其格式也不规则。

使用htmlagilitypack。您可以使用此代码来检索它HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

List<string> anchorImdbList = doc.DocumentNode.SelectNodes("//a[@href]")//this xpath selects all anchor tags
                  .Select(p => p.Attributes["href"].Value)
                  .Where(x=>Regex.IsMatch(x,@".*?www\.imdb\.com.*?"))
                  .Select(y=>y)
                  .ToList<string>();

score 0 · Accepted Answer

您必须转义正斜杠。尝试：

<a href="(http:\/\/www.imdb.com\/title\/tt\d{7}\/)".*?>.*?<\/a>

如果需要从复杂页面中解析出 html 元素，正则表达式会非常麻烦。按照其他人的建议尝试Html Agility Pack 。

score 0 · Accepted Answer

尝试这个：

string tag = "tag of the link";
string emptystring = Regex.Replace(tag, "<.*?>", string.Empty);

更新：

string emptystring = Regex.Replace(tag, @"<[^>]*>", string.Empty);

c# - 如何使用正则表达式提取标签链接（REGEX - C#）

3 回答 3

Related

Reference