python - phemail.py - 没有属性“编码”

Question

root@bt:~# ./phemail.py -g0@*******.com
Gathering emails from domain: ******.com
Traceback (most recent call last):
  File "./phemail.py", line 206, in <module>
  gatherEmails(domain[0],domain[1],p)
  File "./phemail.py", line 51, in gatherEmails
  namesurname = re.sub(' -.*','',a.text.encode('utf8'))
AttributeError: 'NoneType' object has no attribute 'encode'

为什么是 a.text NoneType 类型？

score 0 · Accepted Answer

a.texthas no value ( None)初始化变量
的行可能有问题。a

顺便说一句，我不建议以 root 身份做事。

score 0 · Accepted Answer

作为解释，该脚本正在使用 Google 搜索 LinkedIn 的索引页面，特别是出现用户姓名的页面（而不是公司简介、工作、讨论等）。由于目标公司名称和可能是该公司的标准电子邮件格式是已知的（并在脚本的 args 中指定），因此搜索似乎会查找所有提及该公司的 LI 个人资料页面结果，提取名称并生成 e - 来自名称的邮件地址。它不是在抓取电子邮件地址，甚至不是域名——它是在抓取名称。

它实际上表明对 LI 如何使公共配置文件对搜索引擎可见（或对大量垃圾结果的容忍度）缺乏了解，因为您的结果将充满“目录”页面，而不是配置文件。

但除了该战略错误之外，您还使用了错误的脚本 - Google 不支持每个字符的通配符 - 通配符主要表示一个或多个单词可能介于两者之间（或之后/之前 - 但它在两者之间效果最好）换句话说。不过，通配符的行为有点棘手，并且没有完全记录在所有情况下。因此，即使后来没有失败，您的输出也会是出现在非常通用的“站点：”LinkedIn 搜索中的前一百个名称（没有任何公司/域信息）。不确定这对任何人有什么用处？

至于为什么脚本在该特定行上失败，您正在遍历 BeautifulSoup.findAll 调用的输出，以获取搜索结果项的 a-tags。在这种情况下，a.text 的值和类型为“None”，这会导致错误，因为 None 没有 encode() 方法。BeautifulSoup 有很多很棒的快捷方式，但它们可能会让人难以追溯以查找错误。findAll 的结果是一组标签，标签的默认值是像 findAll 一样，所以我认为 a.text 就像在交互循环的单个标签上调用 findAll('text') 。我不能确定为什么这不起作用——我在这台机器上没有 BeautifulSoup——但你应该可以玩一下，看看哪里出了问题。

在相关部分：

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,}
p = 10

def gatherEmails(l,domain,p):
    print "Gathering emails from domain: "+domain
    emails = []
    for i in range(0,p):
        url = "http://www.google.co.uk/search?hl=en&safe=off&q=site:linkedin.com/pub+"+re.sub('\..*','',domain)+"&start="+str(i)+"0"
        request=urllib2.Request(url,None,headers)
        response = urllib2.urlopen(request)
        data = response.read()
        html = BeautifulSoup(data)
        for a in html.findAll('a',attrs={'class':'l'}):
            namesurname = re.sub(' -.*','',a.text.encode('utf8'))
            firstname = re.sub(' ([a-zA-Z])+','',namesurname).lower()
            surname = re.sub('([a-zA-Z])+ ','',namesurname).lower()
            sys.stdout.write("\r%d%%" %((100*(i+1))/p))
            sys.stdout.flush()
            if firstname != surname and not re.search('\W',firstname) and not re.search('\W',surname):                
                if l == '0' : # 1- firstname.surname@example.com
                    emails.append(firstname+" "+surname)

score 0 · Accepted Answer

您使用的是 3.0.8 之前的 Beautiful Soup 版本。升级以获取 .text、.getText(separator) 和（在 Beautiful Soup 4 中）.get_text(separator)。

python - phemail.py - 没有属性“编码”

3 回答 3

Related

Reference