c# - 正则表达式仅捕获最后几个值

Question

我有一个大文件

 <title>words words </title>

作为描述符，我试图找到一个正则表达式来给我找到这些标签之间的数据

<title.*?>(\w+)</title>

这会起作用，但我认为它只会选择几场比赛，因为标签通常是这样的

adaddad<title>Word word word</title>sdfdsfdsfs

通常两边都有随机垃圾。我真的不擅长正则表达式，但仍在努力学习它我发现了很多非常接近的帖子，但没有什么能完全解决我的问题。

:origLink></item>\r\n<item><title>words word word</title><guid is

这是我的一个字符串的一个更好的例子

score 1 · Accepted Answer

我认为问题在于您正在尝试使用\w单词字符和空格来捕获文本。它应该是：

<title.*?>([\w\s]+?)</title>

这将强制这样的文本

adaddad<title>Word word word</title>sdfdsfdsfs

也可以用单词和空格来捕捉

score 1 · Accepted Answer

尝试让你的正则表达式变得贪婪

 <title.*?>.+?</title>

此外，\w+ 将不匹配空格“”

尝试使用 expresso 微调您的正则表达式http://www.ultrapico.com/Expresso.htm

score -1 · Accepted Answer

改为使用

^[^<]*<title.*?>([^<]*)</title>.*$

解释

^ at the beginning means begining of line
[^<] any character but '<'
.*$ any garbage after the tag is closed

这将捕获空标题以及标签之间可能存在的任何奇怪的字符串。

<title>Normal title</title>
<title></title>
<title>Weird #@!@#%@%^[]{}""///? title ≥╙♥&lt;/title>

3 回答 3