0

我使用Python 2.7Beautiful Soup 3.2,我得到了以下刮板来获取流 URL:

# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup

# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/broadcast.php?matchid=219751&part=sports'
source = urllib2.urlopen(url)

# Turn the saved source into a BeautifulSoup object
soup = BeautifulSoup(source)

for tr in soup.findAll('tr', {'class': ['broadcast']}):
    stationName = tr.findAll('td')[1].text

    for trBelow in tr.findAllNext('tr'):
        curClass = trBelow['class']
        if curClass == 'broadcast':
            break

        kindStream = trBelow.findAll('td')[0].text
        streamUrl = trBelow.find('a', {'class': 'broadcast go'})['href']
        streamQuality = trBelow.findAll('td')[2].text
        streamRating = trBelow.find('div', {'class': 'rating'})['rel']

        print stationName, kindStream, streamQuality, streamRating, streamUrl

这是完美的工作,并提供以下输出:

BWIN Flash 650 Kbps 100 http://forum.wiziwig.eu/threads/1847-BWIN-Info
BWIN Flash 675 Kbps 100 https://sports.bwin.com/en/sports?wm=3448325&zoneId=1068792
Bet365 Flash 650 Kbps 100 http://forum.wiziwig.eu/threads/6258-Bet365
Bet365 Flash 675 Kbps 100 http://www.bet365.com/?affiliate=365_014110
TRK Ukraine+ AceStream 1250 Kbps 100 acestream://94879770520f2e9db2146d0eca59204bfbd72cbe
TRK Ukraine+ AceStream 1251 Kbps 75 http://aviatortv.org/football_ua_plus/
Arenavision1 Sopcast 2000 Kbps 75 sop://broker.sopcast.com:3912/143876
Arenavision3 AceStream 2000 Kbps 75 acestream://a53a380706846bfc6667e21a1485dedb78b9674b
Arenavision3 AceStream 2001 Kbps 75 http://avod.me/play/a53a380706846bfc6667e21a1485dedb78b9674b
Dazsports Ace2 AceStream 850 Kbps 100 acestream://d293c82146aa6c2904e45ff305ae0f38dc5b329d
Dazsports Ace2 AceStream 851 Kbps 75 http://dazsports.org/ace2.html
Digi Sport1 [RO] Sopcast 1500 Kbps 100 sop://broker.sopcast.com:3912/146141
Digi Sport1 [RO] Sopcast 1500 Kbps 100 sop://broker.sopcast.com:3912/124992
Digi Sport1 [RO] Sopcast 1501 Kbps 100 sop://broker.sopcast.com:3912/139777
Digi Sport1 [RO] Sopcast 1501 Kbps 100 sop://broker.sopcast.com:3912/110152
Pole Position1 [NL] AceStream 1000 Kbps 100 acestream://86fd521d30e9319198b75121761eccf260fef0cb
Pole Position1 [NL] AceStream 1001 Kbps 75 http://polepositionweb.org/?page_id=6 popup
Solodeportes Veetle Veetle 850 Kbps 100 http://veetle.com/index.php/widget/index/E47CFF6CB6A770852515B8B30C2E30F6/0/true/default/false
Livesports4u4 Flash 225 Kbps 75 http://livesport4u.com/stream4.html
Cricfree Flash2 Flash 175 Kbps 75 http://cricfree.tv/live-golf-streaming-ch2.php
Njtvx9 Flash 175 Kbps 75 http://nutjob.eu/njtvx9.html
Igoal C+ Liga Flash 175 Kbps 75 http://ana1.me/liga+.html
Soccertoall2 [PT] Flash 175 Kbps 75 http://soccertoall.net/index.php?channel=2
Tugalive1 Flash 175 Kbps 75 http://www.tugalive.eu/p/live-1.html
Diresport1 Flash 175 Kbps 75 http://diresportt.blogspot.com.es/
Footstream11 Flash 175 Kbps 75 http://www.footstream.tv/channel11.html
Lag10 (8) Flash 150 Kbps 50 http://lag10.com/channel8
ANA STV2 Flash 400 Kbps 75 http://ana1.me/STV2.html
ANA STV2 Flash 400 Kbps 75 http://bliner.tv/sporttv2pt.html
Livesoccerhd4 Flash 225 Kbps 75 http://livesoccerhd.tv/l4.html
Stvstreams Ace HD1 AceStream 1500 Kbps 100 acestream://750acfc788e12220dbd57188505eae08f566281e
Stvstreams Ace HD1 AceStream 1500 Kbps 100 http://stvstreams.com/acestreams/stv-hd/
Btsportshd12 Flash 200 Kbps 75 http://www.btsportshd.com/stream12.php
Ana Stream1 Flash 175 Kbps 75 http://ana3.me/STREAM1.html
Onlinesoccer2all (13) Flash 175 Kbps 75 http://online--soccer.eu/channel13.html
Hdfoots6 Flash 175 Kbps 75 http://hdfoots.com/stream6.html

但我想知道我是否应该这样做,或者是否有更好的方法而不进行下一个循环for trBelow in tr.findAllNext('tr'):,然后在它到达特定类时打破它?

4

2 回答 2

0

我认为您的实施已经很棒了。只是一个简单的问题,如果我想重复使用我收到的一些内容怎么办?我断言“Soup”没有为此使用内置缓存,如果我想重新运行这个循环,它将重新遍历节点。

这是我的看法:

with soup:
  tr_elements, tr_belows, collection = findAll('tr', {'class': ['broadcast']}) \
                                       [tr.findAllNext('tr') for tr in tr_elements], {}
  collection['station_names'] = [tr.findAll('td').text[1] for tr in tr_elements]
  collection['kind_streams'] = [trb.findAll('td').text[0] for trb in tr_belows]
  ## and so fourth.
  print dict(collection)

这仍然需要一些工作,因为它无法扫描其他节点中的“广播”节点。此外,我的方法的复杂性可以使用一些工作。

于 2013-09-25T00:22:13.867 回答
0

我可能只是迭代这些<tr>项目:

station_name = ''
for tr in soup.findAll('tr'):
    if tr['class'] == 'broadcast':
        station_name = tr.findAll('td')[1].text
    else:
        # Your current extraction code
        print stationName, kindStream, ....

我猜这样代码会更清晰一些。

另一方面......看起来你有一个可以工作的快速脚本。与代码中的错误相比,它会更快地通过实际页面的 html 输出中的更改来中断。所以,如果它有效,它就会有效,我会说。

于 2013-09-24T21:56:14.110 回答