linux - Bash 脚本返回域而不是 URL

Question

我有这个 bash 脚本，我编写它来分析任何给定网页的 html。它实际上应该做的是返回该页面上的域。目前它返回该网页上的 URL 数量。

#!/bin/sh

echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out

我怎样才能让它返回域而不是 URL。根据我的编程知识，我知道它应该从右侧进行解析，但我是 bash 脚本的新手。有人可以帮帮我吗。这就是我所到之处。

score 2 · Accepted Answer

编辑 2： 请注意，您可能希望根据sed您的需要调整表达式中的搜索模式。此解决方案仅考虑http[s]?://-protocol 和www.-servers...

编辑：
如果你想要计数和域：

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http[s]?://\([^/]*\).*$@\1@p' | \
   sort | \
     uniq -c | \
       sed 's/www.//'

给

2 wordpress.org
10 zelleke.com

原答案：

您可能希望lynx用于从 URL 中提取链接

lynx -dump -listonly http://zelleke.com

给

# blank line at the top of the output
References

   1. http://www.zelleke.com/feed/
   2. http://www.zelleke.com/comments/feed/
   3. http://www.zelleke.com/
   4. http://www.zelleke.com/#content
   5. http://www.zelleke.com/#secondary
   6. http://www.zelleke.com/
   7. http://www.zelleke.com/wp-login.php
   8. http://www.zelleke.com/feed/
   9. http://www.zelleke.com/comments/feed/
  10. http://wordpress.org/
  11. http://www.zelleke.com/
  12. http://wordpress.org/

根据此输出，您可以通过以下方式获得所需的结果：

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http://\([^/]*\).*$@\1@p' | \
   sort -u | \
     sed 's/www.//'

给

wordpress.org
zelleke.com

score 2 · Accepted Answer

我知道在 awk 中有更好的方法来执行此操作，但您可以使用 sed 执行此操作，方法是将其附加到您的后面awk '/http/'：

| sed -e 's;https\?://;;' | sed -e 's;/.*$;;'

然后你想把你的 sort 和 uniq 移到最后。

这样整行将如下所示：

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | awk   '/http/' | sed -e 's;https\?://;;' | sed -e 's;/.*$;;' | sort | uniq -c > out)

你可以摆脱这一行：

output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

score 0 · Accepted Answer

你可能对它感兴趣：

https://www.rfc-editor.org/rfc/rfc3986#appendix-B

解释使用正则表达式解析 uri 的方法。

所以你可以用这种方式从左边解析一个uri ，并提取包含域和子域名的“权限”。

sed -r 's_^([^:/?#]+:)?(//([^/?#]*))?.*_\3_g';
grep -Eo '[^\.]+\.[^\.]+$' # pipe with first line, give what you need

这很有趣：

http://www.scribd.com/doc/78502575/124/Extracting-the-Host-from-a-URL

假设 url 总是以这种方式开始

https?://(www\.)?

真的很危险。

score 0 · Accepted Answer

您可以使用 sed 从 url 中删除路径：

sed s@http://@@; s@/.*@@

我也想告诉你，这两行是错误的：

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

您必须进行重定向 ( > out) 或命令替换$()，但不能同时进行两件事。因为在这种情况下变量将为空。

这部分

content=$(wget "$url" -q -O -)
echo $content > $file

这样写也更好：

wget "$url" -q -O - > $file

linux - Bash 脚本返回域而不是 URL

4 回答 4

Related

Reference