I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.
A snippet of the HTML code being parsed:
<div id="WPaging_total">
Aproximádamente 37 resultados.
</div>
and I search for it like this:
str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24
The print statement returns:
damente 37
When the expected result is:
37
It seems str_start isn't starting at the beginning of the string I am searching for, instead 8 positions back.
print html[str_start:str_start+5]
Outputs:
l">
The problem is hard to replicate though because it doesn't happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn't help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.
Thank you.
LINK: http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1
SAMPLE CODE:
from urllib2 import Request, urlopen
url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}
req = Request(url, post, headers)
conn = urlopen(req)
html = conn.read()
str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]