python - 用python提取网页的一部分

Question

所以我有一个数据检索/输入项目，我想提取网页的某个部分并将其存储在文本文件中。我有一个 url 的文本文件，程序应该为每个 url 提取页面的相同部分。

具体来说，该程序会在诸如此类的页面上复制遵循“法律授权：”的法律法规。如您所见，仅列出了一项法规。但是，一些 url 也看起来像这样，这意味着有多个单独的法规。

我的代码适用于第一类页面：

from sys import argv
from urllib2 import urlopen

script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")

def get_legal(page):
    # this is where Legal Authority: starts in the code
    start_link = page.find('Legal Authority:')
    start_legal = page.find('">', start_link+1)
    end_link = page.find('<', start_legal+1)
    legal = page[start_legal+2: end_link]
    return legal

for line in input:
  pg = urlopen(line).read()
  statute = get_legal(pg)
  output.write(get_legal(pg))

在“法律文件”输出 .txt 中给我所需的法规名称。但是，它不能复制多个法规名称。我试过这样的事情：

def get_legal(page):
# this is where Legal Authority: starts in the code
    end_link = ""
    legal = ""
    start_link = page.find('Legal Authority:')
    while (end_link != '</a>&nbsp;'):
        start_legal = page.find('">', start_link+1)

        end_link = page.find('<', start_legal+1)
        end2 = page.find('</a>&nbsp;', end_link+1)
        legal += page[start_legal+2: end_link] 
        if 
        break
    return legal

由于每个法规列表都以'</a> '（检查两个链接中的任何一个的来源）结尾，我想我可以使用该事实（将其作为索引的结尾）循环遍历并在一个字符串中收集所有法规。有任何想法吗？

score 2 · Accepted Answer

我建议使用BeautifulSoup来解析和搜索您的 html。这比进行基本的字符串搜索要容易得多。

这是一个提取包含> 标签的标签中的所有<a>标签的示例。（请注意，我在这里使用requests库来获取页面内容 - 这只是推荐且非常易于使用的替代方法。） <td><b>Legal Authority:</burlopen

import requests
from BeautifulSoup import BeautifulSoup

# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)

# parse the html
html = BeautifulSoup(response.content)

# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})


def fetch_parent_tag(tags):
    # fetch the parent <td> tag of the first <a> tag
    # whose "previous sibling" is the <b>Legal Authority:</b> tag.
    for tag in tags:
        sibling = tag.findPreviousSibling()
        if not sibling:
            continue
        if sibling.getText() == 'Legal Authority:':
            return tag.findParent()

# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')

for tag in tags_you_want:
    print 'statute: ' + tag.getText()

如果这不是您真正需要做的，BeautifulSoup仍然是您可能想要用于筛选 html 的工具。

score 0 · Accepted Answer

他们在那里提供 XML 数据，请参阅我的评论。如果您认为您无法下载那么多文件（或者另一端可能不喜欢这么多 HTTP GET 请求），我建议您询问他们的管理员是否愿意为您提供访问数据的不同方式。

我过去做过两次（使用科学数据库）。在一种情况下，数据集的庞大规模禁止下载；他们运行了我的 SQL 查询并将结果通过电子邮件发送（但之前曾提出邮寄 DVD 或硬盘）。在另一种情况下，我可以向 Web 服务发出数百万个 HTTP 请求（并且它们没问题），每个请求获取大约 1k 字节。这将花费很长时间，并且会非常不方便（需要一些错误处理，因为其中一些请求总是会超时）（并且由于paging而不是原子的）。我收到了一张DVD。

我想管理和预算办公室可能会提供类似的便利。

python - 用python提取网页的一部分

2 回答 2

Related

Reference