python - Need to parse a specific website using Python 2.x

Question

I am attempting to retrieve information from a Health Inspection website, then parse and save the data to variables, then maybe save the records to a file. I suppose I could use dictionaries to store the information, from each business.

The website in question is: http://www.swordsolutions.com/Inspections.

Clicking [Search] on the website will start displaying information.

I need to be able to pass some search data to the website, and then parse the information that is returned into variables and then to files.

I am fetching the website to a file using:

import urllib
u = urllib.urlopen('http://www.swordsolutions.com/Inspections')
data = u.read()
f = open('data.html', 'wb')
f.write(data)
f.close()

This is the data that is retrieved by urllib: http://bpaste.net/show/126433/ and currently does not show anything useful.

Any ideas?

score 0 · Accepted Answer

我只是给你介绍一下。

您想提交一个包含多个预定义字段值的表单。然后你要解析返回的数据。然后，接下来的步骤取决于自动化该表单发布请求是否容易。

您在这里有很多选择：

使用浏览器开发人员工具分析单击“提交”时发生的情况。然后，如果有一个简单的 POST 请求 - 使用urllib2or requests或mechanize或任何你喜欢的东西来模拟它
试试Scrapy和它的FormRequest类
在selenium的帮助下使用真正的自动化浏览器。将数据填入字段，点击提交，使用同一个工具（selenium）获取和解析数据

基本上，如果表单提交过程涉及大量 javascript 逻辑 - 您将不得不使用自动浏览工具，例如selenium.

另外，请注意，有几种工具可以解析 HTML：BeautifulSoup、lxml。

另见：

使用 Python 进行网页抓取

希望有帮助。

python - Need to parse a specific website using Python 2.x

1 回答 1

Related

Reference