python - Parse data from html page to table

Question

I would like make table of chosen physical properties of elements (for example atomization enthalpy, vaporization enthalpy, heat of vaporization, boiling point), which are accessible on this page.

It is a huge pain to do it by hand, and I didn't find any other machine-processing-friendly source of such data on the internet.

I was trying to learn how to to do it in Python (because I want to use this data for my other code written in Python / NumPy / Pandas).

I was able to download the webpage HTML code with urllib2, and I was trying to learn how to use some HTML/XML parser like ElementTree or MiniDom. However I have no experience with web programing and HTML/XML processing.

score 0 · Accepted Answer

谢谢你，雷锋

有必要稍微修改您的代码以使其正常工作，但感谢您的kickstart。此代码有效：

import lxml.html
import lxml.etree
import urllib2

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://environmentalchemistry.com/yogi/periodic/W.html')
html = infile.read()

doc = lxml.html.document_fromstring(html)
result = doc.xpath("/html/body/div[2]/div[1]/div[1]/div[1]/ul[7]/li[8]")
print lxml.etree.tostring(result[0])

但可能不是最好的

反正。因为不同元素的页面结构并不完全相同，我可能只使用简单的string.find()和常规 expersion。像这样

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://environmentalchemistry.com/yogi/periodic/W.html')
page = infile.read()

i = page.find("Heat of Vaporization")
substr = page[i:i+50]
print substr

import re
non_decimal = re.compile(r'[^\d.]+')
print non_decimal.sub('', substr)

score 0 · Accepted Answer

使用 lxml 的 xpath 支持，您可以轻松解析数据。这是一个解析雾化焓的例子

import lxml.html
import urllib2

html = urllib2.urlopen("http://http://environmentalchemistry.com/yogi/periodic/W.html").read()
doc = lxml.html.document_fromstring(html)
result = doc.xpath("/html/body/div[2]/div[2]/div[1]/div[1]/ul[7]/li[8]")

您可以为不同的元素动态生成 xpath 字符串，并使用 dict 来解析需要的字段。

python - Parse data from html page to table

2 回答 2

Related

Reference