bash - 查找不良链接

Question

我有一个大约 6k 链接的列表。我需要浏览每一个，看看它指向的页面是否包含特定的单词。

最简单的方法是什么？

score 3 · Accepted Answer

肮脏的解决方案：

#! /bin/bash
while read link ; do
    wget -qO- "$link" | grep -qiFf words.lst - && echo "$link"
done < links.lst > found.lst

链接应保留在中links.lst，每行一个链接。单词应保持在中words.lst，每行一个单词。

score 1 · Accepted Answer

我为你创建了一个：

创建一个名为 words.txt 的文件，其中包含要检查的单词，以空格分隔。

创建一个名为 links.url 的文件，其中包含一个 url 列表以检查每行一个

创建一个名为 crawler.sh 的文件，其中包含以下脚本：

#!/bin/bash

# A file with a list of urls one per line
LINKS_FILE="links.url"
# A file with a list of words separed by spaces
WORDS_FILE="words.txt"

HTTP_CLIENT="/usr/bin/wget -O - "

rm -f /tmp/temp.html
for link in `cat "$LINKS_FILE"`
do
        # Downloading page
        echo "--"
        echo "Scanning link: $link"
        $HTTP_CLIENT "$link" > /tmp/temp.html
        if [ $? -ne 0 ]
        then
                echo "## Problem downloading resource $link" 1>&2
                continue
        fi

        # Checking words
        for word in `cat "$WORDS_FILE"`
        do
                echo "Checking for the word \"$word\"..."
                if [ "x`grep -i $word /tmp/temp.html`" != "x" ]
                then
                        echo "** The word $word is found into the uri \"$link\""
                        continue 2
                fi
        done
        echo "** No words found into \"$link\""
        echo "--"
        echo
done
rm -f /tmp/temp.html

运行包装器。

score 0 · Accepted Answer

您可以编写一个 selenium 脚本来访问每个 url，然后检查这些单词是否出现在这些页面上。

score 0 · Accepted Answer

不是最快的方法，但首先出现：

#!bin/bash

while read url
do
    content=$(wget $url -q -O -)

    # and here you can check
    # if there are matches in $content

done < "links.txt"

bash - 查找不良链接

4 回答 4

Related

Reference