shell - 使用 Shell 从页面获取随机链接

Question

我正在尝试编写一个非常基本的基准测试脚本，它将从网站加载随机页面，从主页开始。

我将使用 curl 来获取页面的内容，但是我也想从中加载一个随机的下一页。有人可以给我一些Shell代码，从curl命令的输出中的随机a href中获取URL吗？

score 1 · Accepted Answer

同时使用 lynx 和 bash 数组：

hrefs=($(lynx -dump http://www.google.com |
sed -e '0,/^References/{d;n};s/.* \(http\)/\1/'))
echo ${hrefs[$(( $RANDOM % ${#hrefs[@]} ))]}

score 1 · Accepted Answer

这是我想出的：

curl <url> 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1

用您尝试从中获取链接的 URL 替换该位。

编辑：制作一个名为 getrandomurl.sh 的脚本可能更容易，其中包含：

#!/bin/sh

curl $1 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1

并像什么一样运行它./getrandomurl.sh http://stackoverflow.com。

score 1 · Accepted Answer

不是curl解决方案，但我认为考虑到任务更有效。

我建议为此使用该perl WWW::Mechanize模块。例如，要转储页面中的所有链接，请使用以下内容：

use WWW::Mechanize;

$mech = WWW::Mechanize->new();
$mech->get("URL");
$mech->dump_links(undef, 'absolute' => 1);

注意URL应替换为想要的页面。

然后继续在内perl，以下跟随URL页面上的随机链接：

$number_of_links = "" . @{$mech->links()};
$mech->follow_link( n => int(rand($number_of_links)) )

或者使用dump_links上面的版本来获取 url 并在 shell 中进一步处理，例如获取随机 url（如果上面的脚本被调用get_urls.pl）：

./get_urls.pl | shuf | while read; do
  # Url is now in the $REPLY variable
  echo "$REPLY"
done

score 0 · Accepted Answer

使用小狗

获取页面上所有链接的灵活解决方案是使用pup指定CSS 选择器。例如，我可以<a>使用以下方法从我的博客中获取所有链接（标签）：

curl https://jlericson.com/ 2>/dev/null \
| pup 'a attr{href}'

最后attr{href}只输出href属性。如果您运行该命令，您会注意到几个链接不是指向我网站上的帖子，而是指向我的电子邮件地址和 Twitter 帐户。

如果我只想获取博客文章链接，我可以更挑剔一些：

curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}'

那只抓取带有class='post-link'的链接，这是指向我的帖子的链接。

现在我们可以随机选择一行输出：

curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| shuf | head -1

该shuf命令像一副纸牌一样混合线条并将head -1最上面的牌从牌堆中抽出。（或者第一行，如果你愿意的话。）

我的链接都是相对的，所以我想使用以下方式附加域sed：

curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1

该sed命令将第一个替换/为 URL 的其余部分。

我可能还想包含链接的文本。这有点棘手，因为pup不支持两个输出函数。但它确实支持输出到 JSON，可以通过以下方式读取jq：

curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link json{}' \
| jq -r '.[] | [.text,.href] | @tsv' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1

这是一个制表符分隔的值输出，可能是也可能不是您想要的。

shell - 使用 Shell 从页面获取随机链接

4 回答 4

使用小狗

Related

Reference