c# - 如何使用正则表达式解析 HTML 文档以查找 og:image 标记？

Question

我以编程方式下载了网页的内容并将其保存在字符串变量中。寻找“ og:image ”元标记内容网址的最佳方法是什么？

例如，假设来自页面视图源的片段如下所示：

<meta property="og:site_name" content="The Christian Science Monitor"  />
<meta property="og:type" content="article"  />
<meta property="og:url" content="http://www.csmonitor.com/Business/2013/0729/Cannes-jewel-heist-53-million-in-diamonds-jewels-stolen-from-hotel"  />
<meta property="og:description" content="Cannes jewel heist saw $53 million in diamonds and other precious gems stolen from a hotel on the French Riviera. The Cannes jewel heist is the latest in a series of several brazen jewelry thefts in Europe in recent years."  />
<meta property="og:image" content="http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg"  />
<meta property="og:title" content="Cannes jewel heist: $53 million in diamonds, jewels stolen from hotel"  />
<meta name="sailthru.author" content="Thomas Adamson"  />

我想提取“ http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg ”字符串是“og:image”标签的目标。

我可以在代码中构造一些逻辑来查找子字符串，然后从那里获取它，但我想使用类似于此的正则表达式语法来完成此操作：

List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";

MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);

最后一个示例抓取网页源并提取所有图像标签。我想对 og:image 标签做同样的事情，但我不太精通正则表达式。

score 0 · Accepted Answer

我不认为你应该使用正则表达式，它可能会变得有点古怪，这取决于他们如何将它放在 html 中。例如，content= 可能在 property= 之前。我确实使用了一些常规代码，我不想使用 html 或 xml 解析器插件。这就是我最终做的事情。

Dictionary<string, string> metatags = new Dictionary<string, string>();
int TagStart,TagEnd;
string element;
int AttrStart, AttrEnd;
string PropVal,ContentVal;
TagStart = strIn.IndexOf("<meta", StringComparison.OrdinalIgnoreCase);
while(TagStart != -1) {
    TagEnd = strIn.IndexOf(">", TagStart + 1, StringComparison.OrdinalIgnoreCase);
    if (TagEnd != -1) {
        element = strIn.Substring(TagStart, TagEnd - TagStart + 1);
        //Console.WriteLine("\nPROCESSING META TAG: {0}",element);
        PropVal = null;
        ContentVal = null;

        // Get "property" attribute
        AttrStart = element.IndexOf("property=\"", StringComparison.OrdinalIgnoreCase);
        if (AttrStart != -1) {
            AttrStart = AttrStart + 10;
            AttrEnd = element.IndexOf("\"", AttrStart, StringComparison.OrdinalIgnoreCase);
            if(AttrEnd != -1) {
                PropVal = element.Substring(AttrStart, AttrEnd - AttrStart);
            }
        }
        // Get "content" attribute
        AttrStart = element.IndexOf("content=\"", StringComparison.OrdinalIgnoreCase);
        if(AttrStart != -1) {
            AttrStart = AttrStart + 9;
            AttrEnd = element.IndexOf("\"", AttrStart, StringComparison.OrdinalIgnoreCase);
            if(AttrEnd != -1) {
                ContentVal = element.Substring(AttrStart, AttrEnd - AttrStart);
            }
        }
        if (PropVal != null && ContentVal != null)
            metatags.Add(PropVal, ContentVal);

    }
    // go to next meta tag
    TagStart = strIn.IndexOf("<meta", TagStart + 1, StringComparison.OrdinalIgnoreCase);
}
Console.WriteLine("\nOG meta tags");
foreach(var item in metatags) {
    Console.WriteLine("KEY={0} VALUE={1}",item.Key,item.Value);
}

c# - 如何使用正则表达式解析 HTML 文档以查找 og:image 标记？

1 回答 1

Related

Reference