1

I am attempting to retrieve information from a Health Inspection website, then parse and save the data to variables, then maybe save the records to a file. I suppose I could use dictionaries to store the information, from each business.

The website in question is: http://www.swordsolutions.com/Inspections.

Clicking [Search] on the website will start displaying information.

I need to be able to pass some search data to the website, and then parse the information that is returned into variables and then to files.

I am fetching the website to a file using:

import urllib
u = urllib.urlopen('http://www.swordsolutions.com/Inspections')
data = u.read()
f = open('data.html', 'wb')
f.write(data)
f.close()

This is the data that is retrieved by urllib: http://bpaste.net/show/126433/ and currently does not show anything useful.

Any ideas?

4

1 回答 1

0

我只是给你介绍一下。

您想提交一个包含多个预定义字段值的表单。然后你要解析返回的数据。然后,接下来的步骤取决于自动化该表单发布请求是否容易。

您在这里有很多选择:

  • 使用浏览器开发人员工具分析单击“提交”时发生的情况。然后,如果有一个简单的 POST 请求 - 使用urllib2or requestsmechanize或任何你喜欢的东西来模拟它
  • 试试Scrapy和它的FormRequest
  • 在selenium的帮助下使用真正的自动化浏览器。将数据填入字段,点击提交,使用同一个工具(selenium)获取和解析数据

基本上,如果表单提交过程涉及大量 javascript 逻辑 - 您将不得不使用自动浏览工具,例如selenium.

另外,请注意,有几种工具可以解析 HTML:BeautifulSouplxml

另见:

希望有帮助。

于 2013-08-26T19:20:50.090 回答