c# - 如何使用 html 敏捷包从 html 文本中提取所有 url

Question

我经常使用正则表达式从 html 文本数据中提取文件名，但我听说 html 敏捷包非常适合解析 html 数据。如何使用 html 敏捷包从 html 数据中提取所有 url。任何人都可以用示例代码指导我。谢谢。

这是我的代码示例，效果很好。

using System.Text.RegularExpressions;

private ArrayList GetFilesName(string Source)
{
    ArrayList arrayList = new ArrayList();
    Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
    MatchCollection matchCollection = regex.Matches(Source);
    foreach (Match match in matchCollection)
    {
        if (!match.get_Value().StartsWith("http://"))
        {
                    arrayList.Add(Path.GetFileName(match.get_Value()));
                }
                match.NextMatch();
            }
            ArrayList arrayList1 = arrayList;
            return arrayList1;
        }

private string ReplaceSrc(string Source)
{
    Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
    MatchCollection matchCollection = regex.Matches(Source);
    foreach (Match match in matchCollection)
    {
        string value = match.get_Value();
        string str = string.Concat("images/", Path.GetFileName(value));
        Source = Source.Replace(value, str);
        match.NextMatch();
    }
    string source = Source;
    return source;
}

score 2 · Accepted Answer

就像是：

var doc = new HtmlDocument();
doc.LoadHtml(html);

var images = doc.DocumentNode.Descendants("img")
    .Where(i => i.GetAttributeValue("src", null) != null)
    .Select(i => i.Attributes["src"].Value);

<img>这会从文档中选择所有具有src属性集的元素，并返回这些 URL。

score 0 · Accepted Answer

选择所有img具有非空src属性的标签（否则在获取属性值时会得到NullReferenceException ）：

HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//img[@src!='']")
               .Select(i => i.Attributes["src"].Value);

c# - 如何使用 html 敏捷包从 html 文本中提取所有 url

2 回答 2

Related

Reference