2

我怎样才能刮取基金的价格:

http://www.prudential.com.hk/PruServlet?module=fund& purpose=searchHistFund&fundCd=JAS_U

这是错误的,但我该如何修改它:

import pandas as pd
import requests
import re
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")})
4

2 回答 2

2

我喜欢用 lxml 来解析和查询 HTML。这是我想出的:

import requests
from lxml import etree

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
doc = requests.get(url)
tree = etree.HTML(doc.content)

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]'

rows = tree.xpath(row_xpath)

for row in rows:
    (date_string, v1, v2) = (td.text for td in row.getchildren())
    print "%s - %s - %s" % (date_string, v1, v2)
于 2013-12-06T17:00:26.560 回答
1

我的解决方案和你的类似:

import pandas as pd
import requests
from lxml import etree

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U"
r = requests.get(url)
html = etree.HTML(r.content)
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()')

if len(data) % 3 == 0:
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask'])
    df = df.set_index('date')
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y')
    df.sort_index(inplace = True)
于 2013-12-13T02:58:36.613 回答