windows - xidel如何跟踪分页html并提取URL？

Question

在批处理和 xidel 的 Windows 7 上，我在一个带有分页的网站上进行测试，如下例所示：

链接1

链接2

链接3

1 2 3 4 5 6 7 8 9 10 下一个

我找到了获得前 10 个链接的方法：

xidel.exe https://www.website.es/search?q=xidel+follow+pagination^&start=0 --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

但是当我尝试跟随第 2 页或第 (n) 页时

-f "<A class="fl">{.}</A>"

或者

--follow "//a/[@class='nav']"

nothink工作，你能给我一些帮助或一些例子吗？

谢谢。

score 3 · Accepted Answer

雷诺是对的。但是查询谷歌也可以这样：

xidel -s "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination','num':'25'})" ^
      -e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"

score 2 · Accepted Answer

xidel -s^
  "https://www.google.com/search?q=xidel+follow+pagination&start=0"^
  -e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"^
  -f "(//td/a/@href)[last()]"^
  -e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"

Update 2021:

xidel -s^
  --user-agent "Mozilla/5.0 Firefox/94.0.1"^
  -H "Cookie: CONSENT=YES+cb.20210518-05-p0.nl+F+224"^
  "https://www.google.com/search?q=xidel+follow+pagination"^
  -e "//div[@class='yuRUbf']/a/@href"^
  -f "//a[@id='pnnext']/@href"

("https://www.google.com" -f "form(//form,{'q':'xidel follow pagination'})" also works)

Five years ago querying Google without a user-agent or cookie-header would work just fine. Nowadays it won't work without.

My original query (with me being a xidel rookie and all) would just extract the urls from page 1 and 2. With -f "//a[@id='pnnext']/@href" now at the end xidel will recursively follow all result-pages.

Be warned though that although extracting the urls with -e "//div[@class='yuRUbf']/a/@href" worked for me, it may not work for you, because @class might have another value and above all, changes over time. Same goes for -f "//a[@id='pnnext']/@href".

windows - xidel如何跟踪分页html并提取URL？

2 回答 2

Related

Reference