python - 是否可以读取网页html表格数据？

Question

我目前正在考虑一些自动化来读取网页数据。那么是否可以从网页中读取下面这样的表格以读入 excel：excel 的值应为name of condion,Operator and Expressions.

编辑

    >>> from urllib import urlopen
>>> from bs4 import BeautifulSoup
>>> source = BeautifulSoup(urlopen(url))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'url' is not defined
>>> source = BeautifulSoup(urlopen(https://demo.aravo.com))
  File "<stdin>", line 1
    source = BeautifulSoup(urlopen(https://demo.aravo.com))
                                        ^
SyntaxError: invalid syntax
>>> from urllib import urlopen
>>> from bs4 import BeautifulSoup
>>> source = BeautifulSoup(urlopen(https://demo.aravo.com/))
  File "<stdin>", line 1
    source = BeautifulSoup(urlopen(https://demo.aravo.com/))
                                        ^
SyntaxError: invalid syntax
>>> source = BeautifulSoup(urlopen(demo.aravo.com/))
  File "<stdin>", line 1
    source = BeautifulSoup(urlopen(demo.aravo.com/))
                                                  ^
SyntaxError: invalid syntax
>>> source = BeautifulSoup(urlopen(demo.aravo.com))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'demo' is not defined
>>>

编辑2

C:\Users>cd..

C:\>cd cd C:\Python27\selenv\Scripts
The filename, directory name, or volume label syntax is incorrect.

C:\>cd C:\Python27\selenv\Scripts

C:\Python27\selenv\Scripts>python
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import urlopen
>>> from bs4 import BeautifulSoup
>>> source = BeautifulSoup(urlopen("https://demo.aravo.com/"))
>>> tables = source.findAll('td')
>>> import csv
>>> writer = csv.writer(open('filename.csv','w'))
>>> writer.writerow(rows)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'rows' is not defined
>>>

谢谢

score 1 · Accepted Answer

有可能，请查看名为 Beautiful Soup 的库，它将简化您废弃页面后获取正确信息的过程

#!/usr/bin/env python
from selenium import webdriver

browser = webdriver.Firefox()
url = 'http://python.org'
browser.get(url)
page_source = browser.page_source
print page_source

score 1 · Accepted Answer

您还可以使用 urllib 库中的 urlopen 来获取页面源，然后使用 BeautifulSoup 来解析 html

from urllib import urlopen

from beautifulSoup import BeautifulSoup

#get BeautifulSoup object
source = BeautifulSoup(urlopen(url))

#get list of table elements from source
tables = source.findAll('td')

保存信息以在 exel 中使用的最简单方法可能是将其保存为.csv文件

您可以使用 csv 模块执行此操作

import csv
writer = csv.writer(open('filename.csv','w'))
writer.writerow(rows)

所有这些模块都有很好的文档记录，您应该可以填写空白。

为确保已安装这些库，请确保您有 easy_install，它可以通过setuptools下载。运行 easy_install 后，在 shell 中输入：

easy_install csv
easy_install BeautifulSoup
easy_install urllib
easy_install ipython

然后运行ipython进入live python环境

ipython

这将打开一个 python shell，可以从中测试以前的代码。我希望这有帮助。如果您需要更多基础知识帮助，请在网上搜索 python 教程。[scraperwiki][3]在 python 中有一些很好的 web 解析示例。

python - 是否可以读取网页html表格数据？

2 回答 2

Related

Reference