web-scraping - 抓取网站时数据丢失

Question

我正在尝试废弃一个网站（请参阅代码中的网址）。从网站上，我正在尝试删除所有信息并将数据传输到 json 文件。

scrapy shell http://www.narakkalkuries.com/intimation.html

从网站中提取信息

response.xpath('//table[@class="MsoTableGrid"]/tr/td[1]//text()').re(r'[0-9,-/]+|[0-9]+')

我能够从网站上检索大部分信息。

关注： 能够在“提示”下取消数据，预计“2017 年 9 月的提示”无法取消此选项卡下的信息。

发现：

对于“Intimation For September 2017”，该值存储在 span 标签中

/html/body/div[4]/div[2]/div/table/tbody/tr[32]/td[1]/table/tbody/tr[1]/td[1]/p/b/span

对于剩下的月份，这些值存储在字体标签中

/html/body/div[4]/div[2]/div/table/tbody/tr[35]/td[1]/table/tbody/tr[2]/td[1]/p/b/span/font

如何提取“Intimation For September 2017”的信息？

score 1 · Accepted Answer

你的表使用不同的@class( MsoTableGridand MsoNormalTable) 所以你需要一些方法来处理它们：

for table in response.xpath('//table[@width="519"]'):
    for row in table.xpath('./tr[position() > 1]'):
        for cell in row.xpath('./td'):
            #you can stringify value
            cell_value = cell.xpath('string(.)').extract_first()

web-scraping - 抓取网站时数据丢失

1 回答 1

Related

Reference