2

我正在尝试构建一个 shell 脚本,它将使用 while 循环读取文件(scope.txt)。范围文件包含网站域。该循环将遍历 scope.txt 文件并在另一个名为 urls.txt 的文件中搜索该域。我需要 grep urls.txt 文件中的模式,并且需要最后提到的结果。

范围文件包含 -

google.com
facebook.com

URLs 文件内容 -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com

我需要的输出 -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

因为生成的输出包含 scope.txt 文件中提到的特定域的所有域和子域。

我试图构建一个 shell 脚本文件,但没有得到想要的输出 shell 脚本内容 -

while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt
4

2 回答 2

4

您可以使用此grep + sed解决方案:

grep -Ef <(sed 's/\./\\&/g; s~^~^https?://([^.?]+\\.)*~' scope.txt) urls.txt

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

命令的输出sed是构建我们正在使用的正确正则表达式grep

sed 's/\./\\&/g; s~^~^https?://([^.?]+\\.)*~' scope.txt

^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com
于 2021-06-05T17:58:39.903 回答
3

使用您显示的示例,请尝试以下操作。

awk '
FNR==NR{
  arr[$0]
  next
}
{
  for(key in arr){
    if($0~/^https?:\/\// && $0 ~ key"/"){
      print
      next
    }
  }
}
' scope urlfile

说明:为上述添加详细说明。

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition which will be TRUE when scope file.
  arr[$0]              ##Creating array arr with index of current line.
  next                 ##next will skip all further statements from here.
}
{
  for(key in arr){     ##Traversing through array arr here.
    if($0~/^https?:\/\// && $0 ~ key"/"){  ##Checking if line starts from http/https AND contains key/ here then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' scope urlfile        ##Mentioning Input_file names here.
于 2021-06-05T18:38:57.460 回答