python - 使用 python 正则表达式在 HTML 中查找随机句子

Question

我正在尝试为另一个脚本编写一个小函数，该脚本从“ http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1 ”中提取生成的文本
基本上，我需要它来提取任何句子在 标签之间。

我一直在尝试使用正则表达式，但我从来没有真正掌握这些方法。
我所做的所有搜索都找到了提取特定句子或单个单词的东西。
然而，这需要拉出 标签之间的任意字符串。

谁能帮我吗？谢谢。

我能想到的最好的：

html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)

编辑：最终采用了一种不同的方法，只需将 HTML 拆分为由 分隔的列表并拉出 [3]，以实现更简洁的代码和更少的字符串操作。保留这个问题以供将来参考和其他有类似问题的人参考。

score 1 · Accepted Answer

您需要使用该DOTALL标志，因为您需要匹配的表达式中有换行符。我会用

re.findall('<br>(.*?)<br>', html, re.S)

但是会返回多个结果， 因为该页面上有很多结果。您可能希望使用更具体的：

re.findall('<hr><br>(.*?)<br><hr>', html, re.S)

score 1 · Accepted Answer

from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)

if (len(output) > 0):
    print(output)
    output = re.sub('\n', ' ', output[0])
    output = re.sub('\t', '', output)
    print(output)

终端

imac2011:Desktop allendar$ python test.py 
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']

A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx

如果您再次在 HTML 中显示它，您还可以去掉最后\n的 's 并替换文本内的所有内容（在较长的引号上） ，这样您就可以在视觉上保持原始换行符。

score 0 · Accepted Answer

那个页面的所有笑话都有相同的模型，没有暧昧的东西，你可以用这个

output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)

无需使用 dotall 标志，因为没有点。

score 0 · Accepted Answer

这是呃，7年后，但供将来参考：

正如 Floris 在评论中所建议的那样，将 beautifulsoup 库用于这些目的。

python - 使用 python 正则表达式在 HTML 中查找随机句子

4 回答 4

Related

Reference