我有一个非常简单的任务是从网页http://subscribe.ru/catalog?rss输出锚点内的所有文本。这是我的代码:
# encoding: utf-8
from lxml import etree
import urllib2
from lxml.html import document_fromstring
data = urllib2.urlopen('http://subscribe.ru/catalog?rss')
S=data.read()
oHTML = document_fromstring(S)
loLinks = oHTML.xpath("//a")
for oLink in loLinks:
print etree.tostring(oLink)
sLink = oLink.xpath('string()')[0]
输出如下:
C:\Development\Python27\python.exe "D:/Topic Modeling/Playground/delme3.py"
Traceback (most recent call last):
File "D:/Topic Modeling/Playground/delme3.py", line 15, in <module>
<a onclick="rgNav('js_tab_auth');return false;" href="">÷ÈÏÄ ÎÁ ÓÁÊÔ</a>
sLink = oLink.xpath('string()')[0]
<a onclick="rgNav('js_tab_reg');return false;" href="">òÅÇÉÓÔÒÁÃÉÑ </a>
IndexError: string index out of range
<a class="forgot_pass" href="/member/totalrecall">úÁÂÙÌÉ ÐÁÒÏÌØ?</a>
<a class="button_blue_2" id="js_loginFormBut" href="#">÷ÏÊÔÉ</a>
<a class="font_gray link_txd" href="/faq/vereinbarung.html">ÕÓÌÏ×ÉÑ ÐÏÌØÚÏ×ÁÎÉÑ ÓÅÒ×ÉÓÏÍ Subscribe.ru</a>
<a class="button_blue_2" id="js_regFormBut" href="#">îÁÞÁÔØ ÒÅÇÉÓÔÒÁÃÉÀ</a>
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="auth_email" href="#"><span><i/>Email</span></a>
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="auth_openid" href="#"><span><i/>OpenID</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="auth_vkontakte" href="#"><span><i/>÷ËÏÎÔÁËÔÅ</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="auth_mailru" href="#"><span><i/>Mail.Ru</span></a>
{#/if}
{#if $P.login_register_tab == 2}
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="reg_email" href="#"><span><i/>Email</span></a>
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="reg_openid" href="#"><span><i/>OpenID</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="reg_vkontakte" href="#"><span><i/>÷ËÏÎÔÁËÔÅ</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="reg_mailru" href="#"><span><i/>Mail.Ru</span></a>
{#/if}
<a href="" onclick="return false;">òÅÇÉÓÔÒÁÃÉÑ</a>
<a href="" onclick="ajax_recall_code();return false">÷ÙÓÌÁÔØ ÅÝÅ ÒÁÚ</a>
<a href="#" class="button_blue_2" id="js_confirmFormBut">çÏÔÏ×Ï</a>
<a class="green" href="http://subs.link.subscribe.ru/422433"><strong>òÅÚÕÌØÔÁÔÙ ÏÎÌÁÊÎ ÏÐÒÏÓÁ: "óÐÁÍ ÉÌÉ ÎÅ ÓÐÁÍ? ÷ÏÔ × ÞÅÍ ×ÏÐÒÏÓ!"</strong></a>
<a title="Subscribe.Ru" href="/" class="logo"><dfn class="logokanal"/></a>
Process finished with exit code 1
所以提取了链接,但是由于某种原因无法提取链接文本。输出提示编码存在一些问题(引用内容仅来自人类可读的文本)。我该如何解决这个问题?
尝试使用 utf-8 解码也没有成功:
# encoding: utf-8
from lxml import etree
import urllib2
import chardet
from lxml import html
data = urllib2.urlopen('http://subscribe.ru/catalog?rss')
S=data.read()
encoding = chardet.detect(S)['encoding']
print encoding
if encoding != 'utf-8':
S = S.decode(encoding,'replace').encode('utf-8')
oHTML = html.fromstring(S)
loLinks = oHTML.xpath("//a")
for oLink in loLinks:
print etree.tostring(oLink)
sLink = oLink.xpath('string()')[0]
它因同样的错误而失败。
在此先感谢您的帮助!