0

网站上的日期是“1899 年 8 月 14 日”、“1901 年 12 月 13 日”等。“1899 年 8 月 14 日”按原样打印。但是当从网站上抓取并写入 csv 时,“1901 年 12 月 13 日”变成“2001 年 12 月 13 日”。示例代码如下所示;

url = ['www.example1.com','www.example2.com','www.example3.com' ... 'www.example4.com']
output = csv.writer(open('output_demo.csv','wb',))
output.writerow('Name', 'Start Date')
for page in url:
    startdate = []
    name = []
    content = lxml.html.parse(page)
    name_n = content.xpath('//tr[@class="data1"]/td[1]')
    start_d = content.xpath('//tr[@class="data1"]/td[2]') # extracting the date
    sdate = [sd.text for sd in start_d]
    name_list = [na.text for na in name_n]
    startdate.append(sdate)
    name.append(name_list)
    zipped = zip(name,startdate)
    for row in zipped:
        output.writerow(row) # writing 'date' and 'name'
        zipped = None

这是网站

4

1 回答 1

1

看不到日期有任何问题。仅供参考,我对代码进行了一些改进:

import csv
from lxml import html


url = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;orderby=start;page=52;template=results;type=batting;view=innings']

output = csv.writer(open('output_demo.csv', 'wb'))
output.writerow(['Name', 'Start Date'])
for page in url:
    content = html.parse(page)
    rows = content.xpath('//tr[@class="data1"]')
    for row in rows:
        cells = row.getchildren()
        name = cells[0].find('a').text
        start_date = cells[11].find('b').text
        output.writerow([name, start_date])

output_demo.csv运行代码后的内容:

Name,Start Date
WM Bradley,17 Jul 1899
W Brockwell,17 Jul 1899
Hon.FS Jackson,14 Aug 1899
TW Hayward,14 Aug 1899
KS Ranjitsinhji,14 Aug 1899
CB Fry,14 Aug 1899
AC MacLaren,14 Aug 1899
CL Townsend,14 Aug 1899
WM Bradley,14 Aug 1899
WH Lockwood,14 Aug 1899
AO Jones,14 Aug 1899
AFA Lilley,14 Aug 1899
W Rhodes,14 Aug 1899
J Worrall,14 Aug 1899
H Trumble,14 Aug 1899
VT Trumper,14 Aug 1899
MA Noble,14 Aug 1899
J Darling,14 Aug 1899
SE Gregory,14 Aug 1899
...

希望有帮助。

于 2013-07-03T11:29:39.637 回答