0

I'm working on a html parser for a client, and I have just started messing around with RegEx. I'm quite new to it but am learning quickly! In this part, I need to acquire all of the text that is 18.0pt size within the document. Here is the first RegEx I have tried (using a real-time RegEx tester):

<p.*?><span.*?style='.*?font-size:1

Here is my test text:

<p class=MsoNormal><span style='font-size:14.0pt;font-family:"Comic Sans MS"'>3<sup>rd</sup>
Sunday in Lent - 2013c<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:14.0pt;font-family:"Comic Sans MS"'>Old
Testament – Isaiah 55:1-9<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:14.0pt;font-family:"Comic Sans MS"'>New
Testament – Luke 13:1-9<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:18.0pt;font-family:"Comic Sans MS"'><o:p>&nbsp;</o:p>
</span></p>

It works correctly and highlights each line separately until the 1. The problem is, right when I change 1 to 18, instead of highlighting just the line with font-size:18, it highlights ALL THE WAY from the first line until the 18. I would like to just grab the line with 18pt font. Thank you, and any help is appreciated! :)

4

2 回答 2

2

Here's a better regexp:

<p[^>]*>[ \t\r\n]*<span[^>]* style='[^']*font-size:18

Your one is doing exactly as you told it; finding <p, then any number of arbitrary characters, then ><span, then more arbitrary characters, then font-size:18. So it finds the first <p then all the arbitrary characters until font-size:18. You were just lucky in the first example that all your spans had font-size specified.

This version doesn't allow so much; stopping at any >. Also to make it more robust, I allowed whitespace between the <p> and <span>.

于 2013-03-09T00:35:46.500 回答
0

如果您匹配“除换行符以外的任何字符”,而不是匹配“任何字符”(带点),您将确保不要超出行尾:

<p.*?><span[^\n]*?style='[^\n]*?font-size:18

现在通常.不匹配换行符,除非设置了某些标志(这取决于您的环境) - 特别是s标志。这可能是您的正则表达式测试器的默认设置吗?

另一个想法是限制您希望与 {} 匹配的字符数 - 例如

<p.{,20}>

只要您的开始<p>标签中的字符不超过 20 个,这将起作用。

于 2013-03-09T00:27:24.713 回答