0

I'm trying to GET a webpage parse a part of it and then POST it as a value. The problem is: when there is a character as ó, I retrieve ó, and thus when posting it, the urlencode translation converts those characters to something completely different, which doesn't work.

More precisely, ó is produced when an ó in utf-8 is interpreted as it was in ISO-9959-1, or at least that's what my browser does, if I set to view the page in utf-8 then I see ó, if I set the browser to view the page in ISO-9959-1 then I see ó, other encodings produce different symbols.

I tried to convert the results of the page, and also that specific string to utf-8, I did also try to set the headers to accept only utf-8, but that is not working either. I'm quite certain that is the problem but I'm running out of ideas. I changed the configuration in php.ini but maybe I did not restart yet, basically this is like shooting in the dark, and some help would be greatly appreciated.

If this helps: The specific code is here: https://github.com/trylks/golem/blob/master/php/copperGolem.php

The method is "form", when obtaining one of the parameter values from a previously obtained page with GET.

Thank you.

PD solved: I've been working on this for the last few hours, I can't tell if I changed many other things that are necessary. In any case, the last change that made it work was changing line 60 to be this: $dom->loadHTML(mb_convert_encoding($p, 'html-entities', mb_detect_encoding($p))); That made it. The problem is not libcurl but DomDocument, as explained here: PHP DomDocument failing to handle utf-8 characters (☆)

4

1 回答 1

0

The problem is in the DomDocument, it doesn't properly handle utf-8. Converting to html-entities is the safest option and it works like magic when outputting these characters back with echo (even using cli) or urlencoding these characters. Basically DomDocument doesn't accept utf-8 but it outputs utf-8, or so it seems. So it's a weird conversion that has to be made, so that DomDocument undoes it and everything is back to normal again.

To do this, and being $dom a DomDocument it's enough to do this on every call to $dom->loadHTML($p):

$dom->loadHTML(mb_convert_encoding($p, 'html-entities', mb_detect_encoding($p)));

This is explained better in this other question: PHP DomDocument failing to handle utf-8 characters (☆)

于 2013-03-30T02:30:46.587 回答