当我使用 Mechanize 获取 html 数据时,我将其存储到一个变量中,我们称之为“HTML_RESPONSE”。完成后,我会对其进行解析并提取三项内容:标题、简短描述和详细描述。
我面临的问题是短或长描述有可能包含 - &、£、$ 等字符。
当我尝试将其放入 XML 并保存它时,就会出现问题,因为当我尝试解码这些时,python 会发疯。
例如,这里是页面的简短描述:
S_DESC = "Senior VP of Treasury and Corporate Finance & ERM,
RTL Group, has been invited to the above conference to present a Case Study
on Integrating Strategy and Risk into Enterprise Risk Management"
我解码的方式-
#!/usr/bin/python
# -*- coding: ISO-8859-1 -*-
print S_DESC.decode('UTF-8').encode('ascii','xmlcharrefreplace')
这适用于&符号。如果然后我得到一个带有英镑符号的 S_DESC,我的脚本会因以下输出而中断:
UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\xa3'
这段代码在我的脚本部分中失败的地方(上述异常在最后一行抛出,每次我得到一个英镑符号时)。我想知道是否有一种通用的方法可以告诉 python 自己处理这些字符。为每个可能的不兼容字符制作 100 个函数不是一种选择,同样,我不准备筛选整个网站(2k + 篇文章)以识别所有“可能”使我的代码失效的特殊字符。 ..
XML = """
<MAIN>
<ITEM>
<Author>{0}</Author>
<Author_UN>{1}</Author_UN>
<Date_Modified>{2}</Date_Modified>
<Date_Published>{3}</Date_Published>
<Default_Group_Rights>
{4}
</Default_Group_Rights>
<attachment>
<file_name>{5}</file_name>
<file_extension>{6}</file_extension>
<file_stored_local>{7}</file_stored_local>
</attachment>
<title>{8}</title>
<sm_desc>{9}</sm_desc>
<lg_desc>
<![CDATA[
{10}
]]>
</lg_desc>
</ITEM>
</MAIN>""".format(author_soup, username, date_modified, published_date, xrights, attachment_text, file_extension, localstore, item_title.decode('UTF-8').encode('ascii','xmlcharrefreplace'), short_description.decode('UTF-8').encode('ascii','xmlcharrefreplace'), long_description.decode('UTF-8').encode('ascii','xmlcharrefreplace'))
[编辑]
这是我创建的示例代码,它完美地反映了错误,以防万一有人对此不感兴趣?
#TESTING GROUND
# -*- coding: UTF-8 -*-
author_soup = "John Smith"
username = "jsmith"
date_modified = "25 December 2012, 15:42 PM"
published_date = "25 December 2012, 15:42 PM"
xrights = "r-w-x-x"
attachment_text = "Random Attachment"
file_extension = "txt"
localstore = "../Local"
item_title = "The NEw Financial Reforms of 2012"
short_description = " £16 Billion Spent on new reforms backfire."
long_description = '[<p>fullstory</p>, <p><a class="external-link" href="http://business.timesonline.co.uk/tol/business/industry_sectors/banking_and_finance/article4526065.ece">http://business.timesonline.co.uk/tol/business/industry_sectors/banking_and_finance/article4526065.ece</a></p>]'
XML = """
<MAIN>
<ITEM>
<Author>{0}</Author>
<Author_UN>{1}</Author_UN>
<Date_Modified>{2}</Date_Modified>
<Date_Published>{3}</Date_Published>
<Default_Group_Rights>
{4}
</Default_Group_Rights>
<attachment>
<file_name>{5}</file_name>
<file_extension>{6}</file_extension>
<file_stored_local>{7}</file_stored_local>
</attachment>
<title>{8}</title>
<sm_desc>{9}</sm_desc>
<lg_desc>
<![CDATA[
{10}
]]>
</lg_desc>
</ITEM>
</MAIN>""".format(author_soup, username, date_modified, published_date, xrights, attachment_text, file_extension, localstore, item_title.decode('UTF-8'), short_description.decode('UTF-8'), long_description.decode('UTF-8'))