0

我有一个关于使用 BeautifulSoup 解析 HTML 的问题。我要解析的网站是这个:http ://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html?page=1&pageSize=40

起初,我需要编写一个函数来提供所有 h3-tags 和所有 p-tags。我这样做如下:

    from bs4 import BeautifulSoup
    import urllib2
    website=urllib2.urlopen("http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html","r")

    def parseUsingSoup2(content):
        list1=soup.findAll('h3')
        list2=soup.findAll('p')
        return list1+list2        

    parseUsingSoup2(website)

问题的下一部分要求包含 4 个元组的事件列表(尽管网站上只有一个事件):时间段、标题、类型和描述。

我真的不知道如何开始。我的第一次尝试是这样的:

    def GeneratingListofEvents(content):
        event={}
        list=['time', 'title', 'feature', 'description']
        for item in list: 

但是,我不知道这是否朝着正确的方向发展,并且我没有设法从 HTML 文档中检索时间,而无需手动输入。先感谢您。

4

1 回答 1

0

注意您需要的所有信息在<div class="agendaright">

from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html","r")
soup = BeautifulSoup(html)

all = soup.find('div',class_="agendaright")
time = all.find('span',class_="event-time").text
# u'18:00 - 20:00'
title = all.h3.text
# u'Images Without Borders Violence, Visuality, and Landscape in Postwar Ambon, Indonesia'
feature = all.find('span',class_="feature").text
# u' | Lecture'
description = all.find('p',class_="event-description").text
# u'This lecture explores the thematization of the visual and expansion of\nits terrain exemplified by the gigantic hijacked billboards with Jesus\nfaces and the painted murals with Christian themes which arose during\nthe ...'

l = [time,title,feature,description]
于 2013-03-31T11:40:34.120 回答