c# - C# 使用 Regex.Match 从网站源中检索文件名

Question

我试图使用 Regex.Match 从网站源中检索文件名我有类似的东西来检索页面标题：

string title = Regex.Match(f, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

f 字符串正在重定向到我的页面..

所以我需要的是：从此源检索文件名：

<br><p><b>Download:</b> 24 hours<br><b>Time Left for Download:</b> <span id='cd'></span></p><p>Click on the file name to begin download.</p><div class='linkbox'><ul><li><a href="http://site.com/file/y8Qi2Bw8SXPX/51423">blabla.pdf</a></li></div></ul>
<a id="facebookbtn-link" title="send to Facebook" href="http://www.facebook.com/sharer.php?u=http://site.com/product/komM8k" onclick="return popup(this)" ><img src="http://site/img/facebook.png" alt="Facebook" />Post on Facebook</a>

我需要检索 blabla.pdf 问题是，页面总是更新文件名，所以每次都不会是相同的名称，所以我真正需要的是检索 >blabla.pdf 之间的名称

score 2 · Accepted Answer

详细说明 SLaks 的答案。有一个名为 HTML Agility 包的包。它可以作为 NuGet 包提供。

一个例子在这里http://htmlagilitypack.codeplex.com/wikipage?title=Examples

score 0 · Accepted Answer

0

试试这个模式：

<a href="[^>]+>(.+?)</a>

捕获的组 ($1) 应具有文件名

于 2012-10-24T20:46:25.353 回答

score 0 · Accepted Answer

由于您不是在进行标记处理，而是在寻找特定的锚定模式，我相信 Regex 是在这种情况下使用的好工具。这是一个可以完成这项工作的模式。

string data = @"<br><p><b>Download:</b> 24 hours<br><b>Time Left for Download:</b>
<span id='cd'></span></p><p>Click on the file name to begin download.</p><div class='linkbox'><ul><li>
<a href=""http://site.com/file/y8Qi2Bw8SXPX/51423"">blabla.pdf</a></li></div></ul>
<a id=""facebookbtn-link"" title=""send to Facebook""
href=""http://www.facebook.com/sharer.php?u=http://site.com/product/komM8k""
onclick=""return popup(this)"" ><img src=""http://site/img/facebook.png"" alt=""Facebook"" />Post on Facebook</a>";


Console.WriteLine (Regex.Match(data, @"(?:\>)(?<PDF>[^\.]+\.pdf)(?:\<)").Groups["PDF"].Value);

// blabla.pdf is outputed

编辑：匹配任何文件使用（注意命名的分组更改远离 PDF）

Regex.Match(data, @"(?:\>)(?<File>[^\.]+\.[a-z]{3})(?:\</a\>)").Groups["File"].Value

c# - C# 使用 Regex.Match 从网站源中检索文件名

3 回答 3

Related

Reference