0

I have two almost idential expressions and I'm getting one good and another way wrong output.

data/holidays/photos-2012-2013/word-another-more-more-5443/"><span class="bold">word another</span> - word</a>    

regex = 'data/holidays/photos-2012-2013/.+?(\d{4})/"><span class="bold">(.+?)</span>(.+?)</a>'

word-another-more-more, word another and word, this all in the above changes. The above prints out correctly, a list of tuples like this: ('6642', 'word another', ' - word')

data/holidays/photos-2012-2013/word-another-more-more-5443/">word- another - <span class="bold">word another</span></a>

regex1 = 'data/holidays/photos-2012-2013/.+?(\d{4})/">(.+?)<span class="bold">(.+?)</span></a>'

This above prints out some trash code, despite the syntax used is idential. Output is a list containing tuples too, but is full of unwanted code.

Can you see what's wrong about the second regex?

4

1 回答 1

1

为我工作:

>>> import re
>>> text = 'data/holidays/photos-2012-2013/word-another-more-more-5443/">word- another - <span class="bold">word another</span></a>'
>>> re.findall(r'data/holidays/photos-2012-2013/.+?(\d{4})/">(.+?)<span class="bold">(.+?)</span></a>', text)
[('5443', 'word- another - ', 'word another')]

注意:不要使用正则表达式解析 HTML。BeautifulSoup就是因为这个原因而存在的。

于 2013-03-08T22:22:10.320 回答