python - 想要使用正则表达式获取字符串的一部分

Question

我有一个字符串：

 <a class="x3-large" href="_ylt=Ats3LonepB5YtO8vbPyjYAWbvZx4;_ylu=X3oDMTVlanQ4dDV1BGEDMTIwOTI4IG5ld3MgZGFkIHNob290cyBzb24gdARjY29kZQNwemJ1ZmNhaDUEY3BvcwMxBGVkAzEEZwNpZC0yNjcyMDgwBGludGwDdXMEaXRjAzAEbWNvZGUDcHpidWFsbGNhaDUEbXBvcwMxBHBrZ3QDMQRwa2d2AzI1BHBvcwMyBHNlYwN0ZC1mZWEEc2xrA3RpdGxlBHRlc3QDNzAxBHdvZQMxMjc1ODg0Nw--/SIG=12uht5d19/EXP=1348942343/**http%3A//news.yahoo.com/conn-man-kills-masked-teen-learns-son-063653076.html"  style="font-family: inherit;">Man kills masked teen, learns it&#39;s his son</a>

我只想得到它的最后一部分，即实际消息：

Man kills masked teen, learns it&#39;s his son

到目前为止，我做了这样的事情：

pattern = '''<a class="x3-large" (.*)">(.*)</a>'''

但它没有做我想要的，第一个(.*)匹配链接内的所有垃圾，但第二个是我想要得到的实际消息

score 2 · Accepted Answer

本着回答你应该问的问题的精神;^)，是的，你应该使用 BeautifulSoup [link]或 lxml 或真正的解析器来处理 HTML。例如：

>>> s = '<a class="x3-large" href="_stuff--/SIG**morestuff" style="font-family: inherit;">Man learns not to give himself headaches using regex to deal with HTML</a>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'Man learns not to give himself headaches using regex to deal with HTML'

或者，如果要捕获多个文本：

>>> s = '<a class="test" href="ignore1">First sentence</a><a class="test" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a")
[<a class="test" href="ignore1">First sentence</a>, <a class="test" href="ignore1">Second sentence</a>]
>>> [a.get_text() for a in soup.find_all("a")]
[u'First sentence', u'Second sentence']

或者，如果您只想要某些值class：

>>> s = '<a class="test" href="ignore1">First sentence</a><a class="x3-large" href="ignore1">Second sentence</a>'
>>> soup = BeautifulSoup(s)
>>> soup.find_all("a", {"class": "x3-large"})
[<a class="x3-large" href="ignore1">Second sentence</a>]

score 1 · Accepted Answer

键入([^"]*)而不是第一个(.*)，([^<]*)而不是第二个。或使用非贪婪量词，如(.*?).

python - 想要使用正则表达式获取字符串的一部分

2 回答 2

Related

Reference