您已经有了一个好的开始,但现在您只是检索索引页面并将其加载到 BeautifulSoup 解析器中。现在您已经从链接中获得了 href,您基本上需要打开所有这些链接,并将它们的内容加载到您可以用于分析的数据结构中。
这基本上相当于一个非常简单的网络爬虫。如果您可以使用其他人的代码,您可以通过谷歌搜索“python Web crawler”找到适合的内容。我已经看过其中的一些,它们很简单,但对于这项任务来说可能有点矫枉过正。大多数网络爬虫使用递归来遍历给定站点的完整树。看起来更简单的东西可以满足您的情况。
鉴于我对 BeautifulSoup 的不熟悉,这个基本结构有望让您走上正确的道路,或者让您了解网络爬取的完成方式:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()