python - 我想将我解析的 HTML 文件保存到 TXT 文件中

Question

我已经解析了一个显示文章的网页。我想将解析后的数据保存到文本文件中，但我的 python shell 显示如下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 107: ordinal not in range(128)

这是我的代码的一部分

search_result = urllib.urlopen(url)
f = search_result.read()
#xml parsing
parsedResult = xml.dom.minidom.parseString(f)
linklist = parsedResult.getElementsByTagName('link') #extracting links
extractedURL = linklist[3].firstChild.nodeValue #pick one link
page = urllib.urlopen(extractedURL).read()
#making html file
g= open('yyyy.html', 'w') 
g.write(page)
g.close()
#reading html file and parsing html to get pure text of article
g= open('yyyy.html', 'r')
bs = BeautifulSoup(g,fromEncoding="utf-8")
g.close()
article = bs.find(id="articleBody")
content = article.get_text()
#save as a text file
h= open('yyyy.txt', 'w')
h.write(content)
h.close()

我应该添加什么来完成这项工作？

score 1 · Accepted Answer

1

尝试

import codecs
h = codecs.open('yyyy.txt', 'w', 'utf-8')

或使用 Python 3。

于 2013-05-08T16:44:30.547 回答

score 0 · Accepted Answer

0

尝试使用 unidecode：

from unidecode import unidecode

unidecode(page)

于 2013-05-08T16:21:33.933 回答

python - 我想将我解析的 HTML 文件保存到 TXT 文件中

2 回答 2

Related

Reference