python - Python中的正则表达式-从网站抓取数据

Question

我是 Python 新手，我试图从网站中提取 xml 文件并将它们加载到数据库中。我一直在 Python 中使用 Beautiful Soup 模块，但我无法提取我想要的特定 xml 文件。在网站源代码中，它如下所示：

<a href="ReportName I want 20130101.XML">ReportName.XML</a>
<a href="ReportName I want 20120101.XML">ReportName.XML</a>
<<a href="ReportName I dont want 123.XML">ReportName.XML</a>

下面显示了我在 Python 中的代码。这会带回带有“href”标签的所有内容，而我想过滤“报告我想要名称 dddddddd”上的文件。例如，我曾尝试使用诸如“href=\s\w+”之类的正则表达式，但无济于事，因为它返回 NONE。任何帮助表示赞赏

from bs4 import BeautifulSoup
import urllib
import re

webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
soup=BeautifulSoup(response)



for link in soup.find_all('a'):
   print(link.get('href')

当我使用 Python 时，它 findall('href') 会拉回整个字符串，但我只想过滤 xml 方面。我已经尝试过代码的变体，例如 findall('href\MarketReports') 和 findall('href\w+') 在我运行代码时将返回“None”。

任何帮助表示赞赏

score 2 · Accepted Answer

我并不完全清楚您在寻找什么，但如果我理解正确，您只想获取 ReportName.XML，在这种情况下它将是：

find('a').text

如果您正在寻找“/MarketRepoerts/ReportName.XML”，那么它将是：

find('a').attrs['href']

score 0 · Accepted Answer

我使用了以下代码，它能够根据需要找到报告。Google 演示文稿与 jdotjdot 输入一起提供了很大帮助

http://www.youtube.com/watch?v=kWyoYtvJpe4

我用来查找 XML 的代码是

import re
import urllib

webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()

print re.findall(r"Report I want\w+[.]XML",response)

python - Python中的正则表达式-从网站抓取数据

2 回答 2

Related

Reference