我正在尝试使用 LXML 解析从此搜索 URL 返回的搜索结果:
http://www.rte.ie/player/ie/search/?q=news
HTML 中返回的文章标签是这样的:
<article class="search-result clearfix"><a
href="/player/ie/show/10117771/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/0005d4bf-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117771/">elev8</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117771/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">Ivan and Sean talk to future basketball sensation Julian Newman and the <span class="search-highlight">News</span> Dudes are in the loft with some crazy <span class="search-highlight">news</span> stories.</p>
<span
class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10118015/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000716b2-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10118015/">One <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10118015/">Wed 06 Mar 2013</a></p>
<!-- p class="search-programme-date">06/03/2013</p -->
<p class="search-programme-description">The One O'Clock <span class="search-highlight">News</span> followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117836/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/00071614-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117836/"><span class="search-highlight">News</span> on Two and World Forecast</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117836/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">All the <span class="search-highlight">news</span> and sport from home and abroad.</p>
<span
class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117816/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000715f2-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117816/">Nine <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117816/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">The Nine <span class="search-highlight">News</span> followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117789/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000715ae-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117789/">Six One <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117789/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">The Six One <span class="search-highlight">News</span> and Sport followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117784/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000715a0-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117784/">Nuacht and <span class="search-highlight">News</span> with Signing</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117784/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">Nuacht and <span class="search-highlight">News</span> with Signing.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117770/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/0007158d-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117770/"><span class="search-highlight">News</span>2Day</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117770/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">Domestic and international <span class="search-highlight">news</span> items of interest to younger viewers.</p>
<span
class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117728/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/0007154e-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117728/">One <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117728/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">The One O'Clock <span class="search-highlight">News</span> followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
我添加了以下代码来尝试解析返回的结果,但我的问题是重新生成的结果不一致。我感兴趣的部分是重复的文章标签,但问题是在返回的结果中找到搜索文本是添加标签 span class="search-highlight" 并且这会抛出我的解析。
url = "http://www.rte.ie/player/ie/search/?q=news"
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = str(response.read())
response.close()
parser = etree.HTMLParser(encoding='utf-8')
tree = etree.fromstring(html, parser)
for elem in tree.xpath('//article[@class="search-result clearfix"]'):
icon_url = str(elem[0][1].attrib.get('src'))
print 'icon_url ', icon_url
name_tmp = str(elem[1][0].text)
print 'name_tmp ', name_tmp
stream = str(elem[1][0].attrib.get('href'))
print 'stream ', stream
date_tmp = str(elem[2][0].text)
print 'date_tmp ', date_tmp
short_tmp = elem[4].text
print 'short_tmp ', short_tmp
channel = elem[5].text
print 'channel ', channel
问题字段是 name_tmp 和 short_tmp,由于搜索突出显示跨度标签,它们正在删除全文名称。谁能想到解析全文或忽略跨度标签的方法?
对不起,很长的帖子......