regex - Python 正则表达式 Findall 语句

Question

我有点业余程序员和这个网站的新手。我已经搜索过这个问题，但在互联网或本网站的其他任何地方都没有找到它。

我正在尝试获取打开和关闭段落 html 标记 ( <p>& </p>) 之间的所有单词。我的 findall 声明适用于所有段落中的所有单词，特别是在线文章，除了有单引号或双引号的地方。完全有可能有更好的方法来做我正在尝试做的事情，或者可以轻松地调整此语句以包含带引号的段落。任何建议将不胜感激！

findall 声明：

aText = findall("<p>[A-Za-z0-9<>=\"\:/\.\-,\+\?#@'<>;%&\$\*\^\(\)\[\]\{\}\|\\!_`~ ]+</p>",text)

score 1 · Accepted Answer

要做到这一点，可以使用 Beautiful soup 之类的 HTML 解析引擎：

from BeautifulSoup import BeautifulSoup

html_doc= """
<p>
paragraph 1
</p>

<p>
paragraph 2
</ap>

<p>
paragraph 3
</p>
"""

soup = BeautifulSoup(html_doc)

soup.findAll('p')

score 1 · Accepted Answer

>>> t = "<p>there isn't much here</p>"
>>> re.findall(r'<p>(.+?)</p>',t)
["there isn't much here"]

"嵌入的示例：

>>> t = r"<p>there isn't much \"to go by\" here</p>"
>>> re.findall(r'<p>(.+?)</p>',t)
['there isn\'t much \\"to go by\\" here']

通常+是一个贪婪的限定符，通过?在最后添加我们使其成为non-greedy，它试图实现最小匹配。因此它将消耗部分字符串，直到 </p>可以匹配为止。

regex - Python 正则表达式 Findall 语句

2 回答 2

Related

Reference