我正在尝试从链接http://dl.acm.org/results.cfm?CFID=376026650&CFTOKEN=88529867的 html 文件中提取信息。对于每篇论文标题,我需要作者、期刊名称和摘要。但是在把它们放在一起之前,我会先得到每个重复的版本。请帮忙。这意味着我首先得到一个标题列表,然后是作者,然后是期刊,然后是摘要,然后我按照标题将它们放在一起,如标题,然后是各自的作者、期刊名称和摘要。我只需要它们在一起,而不是单独使用。
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
import re
f = open('acmpage.html', 'r') #open html file stores locally
html = f.read() #read from the html file and store the content in 'html'
soup = BeautifulSoup(html)
pret = soup.prettify()
soup1 = BeautifulSoup(pret)
for content in soup1.find_all("table"):
soup2 = BeautifulSoup(str(content))
pret2 = soup2.prettify()
soup3 = BeautifulSoup(pret2)
for titles in soup3.find_all('a', target = '_self'): #to print title
print "Title: ",
print titles.get_text()
for auth in soup3.find_all('div', class_ = 'authors'): #to print authors
print "Authors: ",
print auth.get_text()
for journ in soup3.find_all('div', class_ = 'addinfo'): #to print name of journal
print "Journal: ",
print journ.get_text()
for abs in soup3.find_all('div', class_ = 'abstract2'): # to print abstract
print "Abstract: ",
print abs.get_text()