1

I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.

A snippet of the HTML code being parsed:

<div id="WPaging_total">
  Aproximádamente 37 resultados.
</div>

and I search for it like this:

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24

The print statement returns:

damente 37

When the expected result is:

37

It seems str_start isn't starting at the beginning of the string I am searching for, instead 8 positions back.

print html[str_start:str_start+5]

Outputs:

l">

The problem is hard to replicate though because it doesn't happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn't help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.

Thank you.

LINK: http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1

SAMPLE CODE:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read()

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]
4

2 回答 2

3

您的问题最终归结为这样一个事实,即在 Python 2.x 中,str类型表示字节序列,而unicode类型表示字符序列。因为一个字符可以由多个字节编码,这意味着字符串的unicode-type 表示的长度可能与str同一字符串的 -type 表示的长度不同,并且以同样的方式,unicode表示上的索引字符串的 可能指向文本的不同部分,而不是str表示上的相同索引。

发生的情况是,当你这样做时str_start = html.index(u'Aproxim\xe1damente '),Python 会自动解码html变量,假设它是用 utf-8 编码的。(好吧,实际上,在我的 PC 上,UnicodeDecodeError当我尝试执行该行时,我只是得到一个。我们的一些与文本编码相关的系统设置肯定是不同的。)因此,如果str_start是 n 那么这意味着u'Aproxim\xe1damente '出现在第n 个字符处HTML。但是,当您稍后将其用作切片索引以尝试获取第 (n+16)th 个字符之后的内容时,您实际上得到的是第(n+16)th 字节之后的内容,在这种情况下不是等效,因为页面的早期内容具有重音字符,在以 utf-8 编码时占用 2 个字节。

最好的解决方案是在收到 html 时将其转换为 unicode。对您的示例代码的这个小修改将做您想要的,没有错误或奇怪的行为:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end] 
于 2012-12-01T21:40:19.963 回答
0

目前还不清楚您要做什么,但是如果我猜对了,您正在尝试从 HTML 文件中获取大致的结果数量,那么您可能会更好,因为您使用该re模块常用表达。

import re
re.search(ur'(?<=Aproxim\xe1damente )\d+', s).group(0)

# returns:
#   u'37'

最终,您最好的选择实际上是一个类似lxmlor的包BeautifulSoup,但如果没有更多上下文,我无法为您提供更具体的帮助。

于 2012-12-01T20:45:03.720 回答