regex - import.io 爬虫不会填充训练期间填充的文本列（在与流中相同的站点上）

Question

import.io 通过在几页上训练工具从爬取的网站中提取什么来加速网络抓取，看起来很棒。但是我不明白我当前的爬虫有什么问题。我训练它浏览来自匈牙利的选区报告（用于投票记录）。在训练期间，前两个文本字段被正确识别，即使我在抓取期间出现在流中的完全相同的页面上进行训练。同时，最后在爬行过程中将列留空。发生了什么/出了什么问题？谢谢！

爬虫在https://import.io/data/mine/?id=772c725f-6048-4861-9f73-03ae30d8f7cc

流第一行的示例页面是http://valasztas.hu/dyn/pv14/szavossz/hu/M08/T150/szkjkv_029.html

保存的流的前两行是：

_url,_position,szavazokor,valasztokerulet,valasztok_szama,megjelentek_szama,megjelentek_szama/_source,ervenyes_lapok_szama,ervenyes_lapok_szama/_source,mcp,mcp/_source,haza_nem_elado,haza_nem_elado/_source,sms,sms/_source,fkgp,fkgp/_source,udp,udp/_source,fidesz,fidesz/_source,sem,sem/_source,lmp,lmp/_source,jesz,jesz/_source,ump,ump/_source,munkaspart,munkaspart/_source,szocialdemokratak,szocialdemokratak/_source,kti,kti/_source,egyutt2014,egyutt2014/_source,zoldek,zoldek/_source,osszefogas,osszefogas/_source,kormanyvaltok,kormanyvaltok/_source,jobbik,jobbik/_source,osszes_ervenyes_listas,osszes_ervenyes_listas/_source
"http://valasztas.hu/dyn/pv14/szavossz/hu/M08/T150/szkjkv_029.html","1","","","825","478","478","478","478","0","0","1","1","2","2","1","1","0","0","221","221","1","1","34","34","0","0","0","0","0","0","0","0","2","2","1","1","3","3","0","0","129","129","80","80","475","475"

相反，szavazokor应该Sopron 029从页面说，valasztokerulet应该说GYŐR–MOSON–SOPRON 04。

我没有找到深入研究爬虫在训练后寻找什么模式的选项。

score 1 · Accepted Answer

I just had a look at your crawler, and it is indeed strange that it is not functioning as you would expect, given that it matches all the training data provided - I have asked the team to look into it.

There is a potential workaround in that you can specify a manual regex override for columns, which you may have more luck with.

When you create your first column (or, click the "text" link in the column header to edit an existing column) you can check the "Advanced" box, and provide a "Manual Regex override". In here I put (.+?).számú szavazókör. For the second column I used (.+?).számú egyéni választókerületi szavazás.

Does that resolve your issue?

p.s. if you hadn't already guessed, I work at import.io

regex - import.io 爬虫不会填充训练期间填充的文本列（在与流中相同的站点上）

1 回答 1

Related

Reference