python - 在 BeautifulSoup 中处理印度语言

Question

我正在尝试在NDTV网站上搜索新闻标题。这是我用作 HTML 源的页面。我正在使用 BeautifulSoup (bs4) 来处理 HTML 代码，并且一切正常，除了当我在链接到的页面中遇到印地语标题时我的代码中断。

到目前为止，我的代码是：

import urllib2
from bs4 import BeautifulSoup

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"

fptr = open(FileName, "w")
fptr.seek(0)

page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")

li = soup.findAll( 'li')
for link_tag in li:
   hypref = link_tag.find('a').contents[0]
   strhyp = str(hypref)
   fptr.write(strhyp)
   fptr.write("\n")

我得到的错误是：

Traceback (most recent call last):
  File "./ScrapeTemplate.py", line 30, in <module>
  strhyp = str(hypref)
  UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

即使我没有包含from_encoding参数，我也会遇到同样的错误。我最初将它用作fromEncoding，但 python 警告我它已被弃用。

我该如何解决？从我读过的内容来看，我需要避免使用印地语标题或将其明确编码为非 ascii 文本，但我不知道该怎么做。任何帮助将不胜感激！

score 3 · Accepted Answer

你看到的是一个 NavigableString 实例（它派生自 Python unicode 类型）：

(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

您需要使用转换为 utf-8

hypref.encode('utf-8')

score 1 · Accepted Answer

1

strhyp = hypref.encode('utf-8')

http://joelonsoftware.com/articles/Unicode.html

于 2013-01-19T09:28:31.730 回答

python - 在 BeautifulSoup 中处理印度语言

2 回答 2

Related

Reference