0

在执行简单的 ip 地址提取任务时,我发现程序运行良好。但是在完整的网络爬虫程序中,它无法生存并且结果参差不齐。

这是我的 ip 地址代码片段:

    #!/usr/bin/python3

    import os
    import re 

    def get_ip_address(url):
        command = "host " + url
        process = os.popen(command)
        results = str(process.read())
        marker = results.find("has address") + 12
        n = (results[marker:].splitlines()[0])
        m = re.search('\w+ \w+: \d\([A-Z]+\)', n)
        if m is not None:
            url_new = url[8:]
            command = "host " + url_new
            process = os.popen(command)
            results = str(process.read())
            marker = results.find("has address") + 12
            return results[marker:].splitlines()[0]

    print(get_ip_address("https://www.yahoo.com"))

网络爬取的完整程序如下所示:

    #!/usr/bin/python3

    from general import *
    from domain_name import *
    from ip_address import *
    from nmap import * 
    from robots_txt import *
    from whois import *

    ROOT_DIR = "companies"
    create_dir(ROOT_DIR)

    def gather_info(name, url):
        domain_name = get_domain_name(url)
        ip_address = get_ip_address(url)
        nmap = get_nmap('-F', ip_address)
        robots_txt = get_robots_txt(url)
        whois = get_whois(domain_name)
        create_report(name, url, domain_name, nmap, robots_txt, whois, ip_address)

   def create_report(name, full_url, domain_name, nmap, robots_txt, whois, ip_address):
       project_dir = ROOT_DIR + '/' + name
       create_dir(project_dir)
       write_file(project_dir + '/full_url.txt', full_url)
       write_file(project_dir + '/domain_name.txt', domain_name)
       write_file(project_dir + '/nmap.txt', nmap)
       write_file(project_dir + '/robots_txt.txt', robots_txt)
       write_file(project_dir + '/whois.txt', whois)
       write_file(project_dir + '/ip_address.txt', ip_address)

    x = input("Enter the Company Name: ")
    y = input("Enter the complete url of the company: ")    
    gather_info( x , y )

输入的输入如下所示:

    root@nitin-Lenovo-G580:~/Desktop/web_scanning# python3 main.py 
    106.10.138.240
    Enter the Company Name: Yahoo
    Enter the complete url of the company: https://www.yahoo.com/
    /bin/sh: 1: Syntax error: "(" unexpected

ip_address.txt 中的输出为:

    hoo.com/ not found: 3(NXDOMAIN)

所见的程序在运行时运行良好,并且将 ip 提供为 106.10.138.240 仍然在 ip_address.txt 中保存了一些不同的东西我也未能找出这个 /bin/sh 语法错误是如何产生的。请帮我...

4

2 回答 2

0

对不起,我没有足够的声誉来添加评论,所以我会在这里发表我的建议。

我认为问题出process = os.popen(command)def get_ip_address(url). 您可以打印command以查看它是否有效。

除了问题,只是一些建议:

  1. 尽量不要*在 import 中使用,因为它使读者更难跟踪代码。

  2. 学习 pdb,这是一个 python 调试器,简单但功能强大,适用于中小型项目。使用它的最简单方法是import pdb; pdb.set_trace()在您希望程序停止的行之前添加,以便您可以逐行运行代码。

于 2016-06-19T18:27:34.577 回答
0

我赞同 Joe Lin 的建议,即不要在您的导入语句中使用通配符。它极大地污染了您的命名空间,并可能产生奇怪的行为。

Python 是“包含电池的”,因此您可能应该利用requestsurllib3包来处理 HTTP 请求,subprocess谨慎使用执行命令,并签出scrapy包以进行网络抓取。它们各自的对象和方法返回的数据可能包含您尝试提取的内容。

尽可能懒惰并依靠“现有技术”。

在前几行中,get_ip_address我注意到以下内容:

def get_ip_address(url):
    command = "host " + url
    process = os.popen(command)
    ....

如果我通过 shell 执行这个命令,它实际上会反映这个:

host http://www.foo.com

做一个man host并阅读手册页:

   host is a simple utility for performing DNS lookups. It is normally
   used to convert names to IP addresses and vice versa. When no arguments
   or options are given, host prints a short summary of its command line
   arguments and options.

   name is the domain name that is to be looked up. It can also be a
   dotted-decimal IPv4 address or a colon-delimited IPv6 address, in which
   case host will by default perform a reverse lookup for that address.
   server is an optional argument which is either the name or IP address
   of the name server that host should query instead of the server or
   servers listed in /etc/resolv.conf.

您提供host了一个 URL,而它只需要一个 IP 地址或一个主机名。URL 包括方案、主机名和路径。您将必须显式提取主机名,才能以host选择与之交互的方式工作。鉴于 URL 可能/可能不包含详细的路径信息,您必须解开它:

url= "http://www.yahoo.com/some_random/path"

# Split on "//" to extract scheme
_, host_and_path = url.split("//")

# Use .split() with maxsplit 1 to break this into pieces as desired
hostname , path = host_path.split("/", 1)

# # Use 'hostname' as input to the command
command = "host " + url
...

我不认为问题是提供与此问题相关的所有代码。错误输出似乎是基于 shell 的,而不是传统的 Python 堆栈跟踪,可能是用于执行您想要的一些 shell 命令的get_something函数之一。Popen

于 2016-06-19T19:03:43.623 回答