python - 基本的美丽汤帮助 - 格拉斯顿伯里阵容

Question

我是 python 新手。我想做的是用python和美丽的汤提取今年格拉斯顿伯里音乐节宣布的所有乐队。我想将所有乐队转储到一个文本文件中，并最终根据每个艺术家的热门曲目创建一个 Spotify 播放列表。

我想从www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#中提取的艺术家列表（我实际上想在 AZ 标签上而不是在 Friday 标签上）

我曾尝试先将波段打印到终端，但我得到空白结果。这是我尝试过的

from bs4 import BeautifulSoup
import urllib2

#efestivals page with all glastonbury acts
url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

bands = soup.findAll('a')
for eachband in bands:
   print eachband.string

基本上，我需要帮助才能进入 AZ 选项卡并提取每个波段。我也只想要确认的乐队（那些有img src="/img2009/lineup_confirmed.gif"）。我对 html 不是很熟悉，但这似乎是一个合理的起点。

score 1 · Accepted Answer

有很多方法可以解决这个问题。这只是一个似乎有效的例子：

from bs4 import BeautifulSoup
import urllib2 as ul

url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#"
page = ul.urlopen(url)
soup = BeautifulSoup(page.read())

elements = soup.findAll('img', {'src': '/img2009/lineup_confirmed.gif'})

bands = [e.next_element.next_element.text for e in elements]

print bands[1:11]

输出：

[u'Arctic Monkeys', u'Dizzee Rascal', u'The Vaccines', u'Kenny Rogers']

score 1 · Accepted Answer

要从 AZ 表中提取已确认波段的链接：

#!/usr/bin/env python
import re

try:
    from urllib2 import urlopen
except ImportError: # Python 3
    from urllib.request import urlopen

from bs4 import BeautifulSoup, NavigableString

def table_after_atoz(tag):
    '''Whether tag is a <table> after an element with id="LUA to Z".'''
    if tag.name == 'table' and 'TableLineupBox' in tag.get('class', ''):
        for tag in tag.previous_elements: # go back
            if not isinstance(tag, NavigableString): # skip strings
                return tag.get('id') == "LUA to Z"

def confirmed_band_links(soup):
    table = soup.find(table_after_atoz) # find A to Z table
    for tr in table.find_all('tr'): # find all rows (including nested tables)
        if tr.find('img', alt="confirmed"): # row with a confirmed band?
            yield tr.find('a', href=re.compile(r'^/festivals/bands')) # a link

def main():
    url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml"
    soup = BeautifulSoup(urlopen(url))
    for link in confirmed_band_links(soup):
        print("%s\t%s" % (link['href'], link.string))

main()

score 0 · Accepted Answer

以下工作

from bs4 import BeautifulSoup
import urllib2

#efestivals page with all glasto acts
url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

bands = soup.findAll('a', href=True)
for band in bands:
    if band['href'].startswith("/festivals/bands"):
        print band.string

python - 基本的美丽汤帮助 - 格拉斯顿伯里阵容

3 回答 3

Related

Reference