python - 为什么我要先获取作者、标题、摘要和期刊，然后再一起出现？他们应该每个标题在一起。

Question

我正在尝试从链接http://dl.acm.org/results.cfm?CFID=376026650&CFTOKEN=88529867的 html 文件中提取信息。对于每篇论文标题，我需要作者、期刊名称和摘要。但是在把它们放在一起之前，我会先得到每个重复的版本。请帮忙。这意味着我首先得到一个标题列表，然后是作者，然后是期刊，然后是摘要，然后我按照标题将它们放在一起，如标题，然后是各自的作者、期刊名称和摘要。我只需要它们在一起，而不是单独使用。

from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
import re

f = open('acmpage.html', 'r') #open html file stores locally
html = f.read() #read from the html file and store the content in 'html'
soup = BeautifulSoup(html)
pret = soup.prettify()
soup1 = BeautifulSoup(pret)
for content in soup1.find_all("table"):
    soup2 = BeautifulSoup(str(content))
    pret2 = soup2.prettify()
    soup3 = BeautifulSoup(pret2)

    for titles in soup3.find_all('a', target = '_self'): #to print title
        print "Title: ", 
        print titles.get_text()
    for auth in soup3.find_all('div', class_ = 'authors'): #to print authors
        print "Authors: ", 
        print auth.get_text()
    for journ in soup3.find_all('div', class_ = 'addinfo'): #to print name of journal
        print "Journal: ", 
        print journ.get_text()
    for abs in soup3.find_all('div', class_ = 'abstract2'): # to print abstract
        print "Abstract: ", 
        print abs.get_text()

score 1 · Accepted Answer

您正在分别搜索每个信息列表，几乎没有问题为什么您会看到单独列出的每种类型的信息。

你的代码也充满了冗余；您只需要导入一个版本的 BeautifulSoup（第一个导入被第二个遮住），您也不需要重新解析元素 2 次。您导入两个不同的URL 加载库，然后通过加载本地文件来忽略这两个库。

而是搜索包含标题信息的表格行，然后按表格行解析出包含的信息。

对于这个页面，它的布局更复杂（坦率地说，是杂乱无章的），有多个表格，最简单的方法是找到每个标题链接到表格行：

from bs4 import BeautifulSoup
import requests

resp = requests.get('http://dl.acm.org/results.cfm', 
                    params={'CFID': '376026650', 'CFTOKEN': '88529867'})
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)

for title_link in soup.find_all('a', target='_self'):
    # find parent row to base rest of search of
    row = next(p for p in title_link.parents if p.name == 'tr')
    title = title_link.get_text()
    authors = row.find('div', class_='authors').get_text()
    journal = row.find('div', class_='addinfo').get_text()
    abstract = row.find('div', class_='abstract2').get_text()

该next()调用循环遍历标题链接的每个父级的生成器表达式，直到<tr>找到一个元素。

现在您拥有按标题分组的所有信息。

score 0 · Accepted Answer

您需要找到第一个 addinfo div，然后在文档中进一步查找 div 中的发布者。您将需要上树到封闭的 tr，然后为连续的 tr 获取下一个兄弟。然后在该 tr 中搜索下一个数据项（发布者）。

为所有需要显示的项目完成此操作后，为找到的所有项目发出一个打印命令

python - 为什么我要先获取作者、标题、摘要和期刊，然后再一起出现？他们应该每个标题在一起。

2 回答 2

Related

Reference