python - Python: Searching for Unicode string in HTML with index/find returns wrong position

Question

I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.

A snippet of the HTML code being parsed:

<div id="WPaging_total">
  Aproximádamente 37 resultados.
</div>

and I search for it like this:

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24

The print statement returns:

damente 37

When the expected result is:

It seems str_start isn't starting at the beginning of the string I am searching for, instead 8 positions back.

print html[str_start:str_start+5]

Outputs:

l">

The problem is hard to replicate though because it doesn't happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn't help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.

Thank you.

LINK: http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1

SAMPLE CODE:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read()

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

score 3 · Accepted Answer

您的问题最终归结为这样一个事实，即在 Python 2.x 中，str类型表示字节序列，而unicode类型表示字符序列。因为一个字符可以由多个字节编码，这意味着字符串的unicode-type 表示的长度可能与str同一字符串的 -type 表示的长度不同，并且以同样的方式，unicode表示上的索引字符串的可能指向文本的不同部分，而不是str表示上的相同索引。

发生的情况是，当你这样做时str_start = html.index(u'Aproxim\xe1damente ')，Python 会自动解码html变量，假设它是用 utf-8 编码的。（好吧，实际上，在我的 PC 上，UnicodeDecodeError当我尝试执行该行时，我只是得到一个。我们的一些与文本编码相关的系统设置肯定是不同的。）因此，如果str_start是 n 那么这意味着u'Aproxim\xe1damente '出现在第n 个字符处HTML。但是，当您稍后将其用作切片索引以尝试获取第 (n+16)th 个字符之后的内容时，您实际上得到的是第(n+16)th 字节之后的内容，在这种情况下不是等效，因为页面的早期内容具有重音字符，在以 utf-8 编码时占用 2 个字节。

最好的解决方案是在收到 html 时将其转换为 unicode。对您的示例代码的这个小修改将做您想要的，没有错误或奇怪的行为：

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

score 0 · Accepted Answer

目前还不清楚您要做什么，但是如果我猜对了，您正在尝试从 HTML 文件中获取大致的结果数量，那么您可能会更好，因为您使用该re模块常用表达。

import re
re.search(ur'(?<=Aproxim\xe1damente )\d+', s).group(0)

# returns:
#   u'37'

最终，您最好的选择实际上是一个类似lxmlor的包BeautifulSoup，但如果没有更多上下文，我无法为您提供更具体的帮助。

python - Python: Searching for Unicode string in HTML with index/find returns wrong position

2 回答 2

Related

Reference