c# - 正则表达式 - 获取标题的特定部分

Question

我的标题结构是这样的：

<title>WebsiteName | Page title | Slogan</title>

目前，在 C# 中，我使用它来获取标题：

Regex.Match(pageSource,
                @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
                RegexOptions.IgnoreCase).Groups["Title"].Value;

但是，我想得到的只是页面标题。

score 3 · Accepted Answer

避免使用regex.

您可以使用htmlAgilityPack

这将获得 html 的标题！

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);    
string title=doc.DocumentNode.SelectSingleNode("//title").InnerText;

现在获取页面标题后，您可以使用此正则表达式获取所需的数据

考虑到您的标题始终与您可以使用的示例中给出的形式相同

(?<=\|).+?(?=\|)

score 2 · Accepted Answer

如果你只是想得到Page Title然后试试这个：

\|(.*)\|

如果您传递您提供的字符串，您的第二个匹配项将包含标题。如果您发现自己在做比这更复杂的事情，那么正则表达式可能不是您的工具。有更好的方法来解析 HTML。

score 1 · Accepted Answer

尝试这个：

@"\<title[^>]*\>[^|]*\|\s*(?<Title>[^|]*?)\|[^<]*\</title\>"

"\<title[^>]*\>"   //Title tag
"[^|]*"            //Everything up to the first pipe
"\|\s*"            //First pipe and any leading white space
"(?<Title>[^|]*?)" //The page title section between the pipes
"\|"               //Second pipe
"[^<]*\"           //Everything after the first pipe up to closing title tag
"</title\>"        //closing title tag

c# - 正则表达式 - 获取标题的特定部分

3 回答 3

Related

Reference