python - Python 正则表达式没有捕捉到模式

Question

我基本上是从特定页面抓取数据。我有这个代码：

regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

... rest of code omitted ...

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

pattern = re.compile(regex)
movies = re.findall(pattern, htmlText)

# Lines below will always returns empty.
if not movies:
    print "List is empty. Printing source instead...", "\n\n"
    print htmlText
else:
    print movies

htmlText的内容：

<ul class="w462">

... bunch of <li>s (the content i want to retrieve).

</ul>

htmlText 包含正确的来源（我尝试按 ctrl+F 并且我可以验证它是否包含所需的 ul 元素。只是我的正则表达式无法获得所需的内容。

我试图改用这个：

movies = re.findall(r'<ul class="w462">(.*?)</ul>', htmlText)

有谁知道出了什么问题？

score 2 · Accepted Answer

默认情况下，.正则表达式匹配除换行符以外的任何字符。因此，您的正则表达式无法匹配任何跨越多行的内容（至少包含一个换行符）。

将编译行更改为：

pattern = re.compile(regex, re.DOTALL)

改变的意思.。使用re.DOTALL,.将匹配任何字符（包括换行符）。

python - Python 正则表达式没有捕捉到模式

1 回答 1

Related

Reference