1

import.io 通过在几页上训练工具从爬取的网站中提取什么来加速网络抓取,看起来很棒。但是我不明白我当前的爬虫有什么问题。我训练它浏览来自匈牙利的选区报告(用于投票记录)。在训练期间,前两个文本字段被正确识别,即使我在抓取期间出现在流中的完全相同的页面上进行训练。同时,最后在爬行过程中将列留空。发生了什么/出了什么问题?谢谢!

爬虫在https://import.io/data/mine/?id=772c725f-6048-4861-9f73-03ae30d8f7cc

流第一行的示例页面是http://valasztas.hu/dyn/pv14/szavossz/hu/M08/T150/szkjkv_029.html

保存的流的前两行是:

_url,_position,szavazokor,valasztokerulet,valasztok_szama,megjelentek_szama,megjelentek_szama/_source,ervenyes_lapok_szama,ervenyes_lapok_szama/_source,mcp,mcp/_source,haza_nem_elado,haza_nem_elado/_source,sms,sms/_source,fkgp,fkgp/_source,udp,udp/_source,fidesz,fidesz/_source,sem,sem/_source,lmp,lmp/_source,jesz,jesz/_source,ump,ump/_source,munkaspart,munkaspart/_source,szocialdemokratak,szocialdemokratak/_source,kti,kti/_source,egyutt2014,egyutt2014/_source,zoldek,zoldek/_source,osszefogas,osszefogas/_source,kormanyvaltok,kormanyvaltok/_source,jobbik,jobbik/_source,osszes_ervenyes_listas,osszes_ervenyes_listas/_source
"http://valasztas.hu/dyn/pv14/szavossz/hu/M08/T150/szkjkv_029.html","1","","","825","478","478","478","478","0","0","1","1","2","2","1","1","0","0","221","221","1","1","34","34","0","0","0","0","0","0","0","0","2","2","1","1","3","3","0","0","129","129","80","80","475","475"

相反,szavazokor应该Sopron 029从页面说,valasztokerulet应该说GYŐR–MOSON–SOPRON 04

我没有找到深入研究爬虫在训练后寻找什么模式的选项。

4

1 回答 1

1

I just had a look at your crawler, and it is indeed strange that it is not functioning as you would expect, given that it matches all the training data provided - I have asked the team to look into it.

There is a potential workaround in that you can specify a manual regex override for columns, which you may have more luck with.

When you create your first column (or, click the "text" link in the column header to edit an existing column) you can check the "Advanced" box, and provide a "Manual Regex override". In here I put (.+?).számú szavazókör. For the second column I used (.+?).számú egyéni választókerületi szavazás.

Does that resolve your issue?

p.s. if you hadn't already guessed, I work at import.io

于 2014-04-14T09:36:51.377 回答