1

charset set on the page to curl is Shift_JIS and lang set to jp

    function jp_new ($jp_text) 
{
// Begin Curl
$session = curl_init();
//$url1 = "http://nihongo.j-talk.com/index.php";
$url1 = "http://www.romaji.org/index.php";
$parameters = '&text='.urlencode($jp_text).'&save=convert+text+to+Romaji';
$header = array(
"Accept-Language: jp",
"Accept-Charset: Shift_JIS");
// $header[] = "Accept-Language: ja"; 
//$parameters = 'kanji='.urlencode($jp_text).'&converter=spaced&Submit=Translate+Now';
curl_setopt($session, CURLOPT_HTTPHEADER, $header); 
curl_setopt($session, CURLOPT_POSTFIELDS, $parameters);
curl_setopt($session, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
curl_setopt($session, CURLOPT_POST, true);
curl_setopt($session, CURLOPT_URL, $url1);
$jp_page = curl_exec($session); 
curl_close($session);

//$pattern = "/romaji'>(.+?)</s";
$pattern = "/color=\"red\">(.+?)</s";
preg_match_all ($pattern, $jp_page, $result_ro);
return $result_ro[1];

}

i get a result but its messed up and not the same result i would get if i submited the form from romaji.com manually. result i get when jptext = "犬猫" is "kou (kigou)(kigou) shin i"

im sure the preg match only will find one match and its finding it in the right place. but it seems like some sort of encoding problem, but idk really.

a similar curl worked for "http://nihongo.j-talk.com/index.php" (the commented out variables) but it seems they have banned me so i need to adapt it to work for this new url romaji.org

UPDATE: the charset on the romaji.org page is Shift_JIS, and my page is UTF-8 so i tried adding the curlopt header to the curl as in the code example now, the result in the output differed little, one of the words in brackets was removed, result is still messed up.

4

2 回答 2

0

If you get different results from browser to script, your aim is to simulate the same data send by the browser. What usually differs are the "headers", including cookies.

Part 1: trapping the headers

  1. You can use the netwrok monitoring tools of various browsers (press F12 in Chrome / IE, or install Firebug and find the network monitors). Start logging and check the requests sent. Specifically, look for the headers.
  2. You may find this easier with "Fiddler"
  3. Check for any headers, and any cookies send with the request. If a cookie, you may need to go backwards and work out how hte cookie got there, but for now, you cna hard code it.

Part 2: sending headers

In you case, it's probably the "Accept-Charset" and "Accept-Language" headers. You can speify these with

$headers = array(
    "Accept-Language: en-us",
    "Accept-Charset: utf-8");
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

If you need other headers, add to the array. If you need cookies, follow the documentation or google other examples.

NOTE: I notice that you also set "POSTFIELDS" but don' actually specify you're sending a POST. Add the following if you want to send as a POST:

curl_setopt($ch, CURLOPT_POST, true);
于 2012-12-04T03:53:20.427 回答
0

A Simple Conversion would do the trick

$jp_text = iconv('UTF-8','Shift_JIS' , $jp_text);

Or

$jp_text = mb_convert_encoding($jp_text, "Shift_JIS", "UTF-8");

If you run

header('Content-Type: text/html; charset=utf-8');
$str = "犬猫";
var_dump(jp_new($str));

Output

array (size=1)
  0 => string 'inuneko' (length=7)

Online Demo

Modified Function

header('Content-Type: text/html; charset=utf-8');
$str = "犬猫";
var_dump(jp_new($str));

function jp_new($jp_text) {
    $session = curl_init();
    $url1 = "http://www.romaji.org/index.php";
    //$jp_text = iconv('UTF-8','Shift_JIS' , $jp_text);
    $jp_text = mb_convert_encoding($jp_text, "Shift_JIS", "UTF-8");
    $parameters = '&text=' . urlencode($jp_text) . '&save=convert+text+to+Romaji';
    $header = array(
            "Accept-Language: en-US,en;q=0.8",
            "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3",
            "Referer: http://www.romaji.org/index.php",
            "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
    curl_setopt($session, CURLOPT_HTTPHEADER, $header);
    curl_setopt($session, CURLOPT_POSTFIELDS, $parameters);
    curl_setopt($session, CURLOPT_CONNECTTIMEOUT, 5);
    curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($session, CURLOPT_POST, true);
    curl_setopt($session, CURLOPT_URL, $url1);
    $jp_page = curl_exec($session);
    curl_close($session);

    // $pattern = "/romaji'>(.+?)</s";
    $pattern = "/color=\"red\">(.+?)</s";
    preg_match_all($pattern, $jp_page, $result_ro);
    return $result_ro[1];
}
于 2012-12-07T11:45:25.373 回答