我以编程方式下载了网页的内容并将其保存在字符串变量中。寻找“ og:image ”元标记内容网址的最佳方法是什么?
例如,假设来自页面视图源的片段如下所示:
<meta property="og:site_name" content="The Christian Science Monitor" />
<meta property="og:type" content="article" />
<meta property="og:url" content="http://www.csmonitor.com/Business/2013/0729/Cannes-jewel-heist-53-million-in-diamonds-jewels-stolen-from-hotel" />
<meta property="og:description" content="Cannes jewel heist saw $53 million in diamonds and other precious gems stolen from a hotel on the French Riviera. The Cannes jewel heist is the latest in a series of several brazen jewelry thefts in Europe in recent years." />
<meta property="og:image" content="http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg" />
<meta property="og:title" content="Cannes jewel heist: $53 million in diamonds, jewels stolen from hotel" />
<meta name="sailthru.author" content="Thomas Adamson" />
我想提取“ http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg ”字符串是“og:image”标签的目标。
我可以在代码中构造一些逻辑来查找子字符串,然后从那里获取它,但我想使用类似于此的正则表达式语法来完成此操作:
List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
最后一个示例抓取网页源并提取所有图像标签。我想对 og:image 标签做同样的事情,但我不太精通正则表达式。