2

我不确定我的脚本的哪一部分实际上是错误的,但是我在解析带有 unicode 字符的推文文本时遇到了一些困难:

推文示例:

Landsliðsmaður með viti. #rafhlaða #hræddur http://t.co/ci03F3vUNM

当我使用twitteroauth获取它并将其保存到 .txt 文件时,此字符串在文件中转换为:

Landsli\u00f0sma\u00f0ur me\u00f0 viti. #rafhla\u00f0a #hr\u00e6ddur http:\/\/t.co\/ci03F3vUNM

我正在使用简单preg_replace 的超链接替换文本

function twitterify($ret) {
  $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("/@(\w+)/", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
  $ret = preg_replace("/#(\w+)/", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
  return $ret;
}

但这会在遇到 unicode 字符之一时失败:
#rafhlaða变得<a href="#">#rafhla</a>ða
#hræddur变得<a href="#">#hr</a>æddur
和相似。

我在这里做错了什么?使用 PHP 保存/打开我的文本文件或解析 unicode 编码字符串?

4

1 回答 1

1

看这里,我将 u 修饰符放在所有正则表达式的末尾,它起作用了。将文件另存为 utf8。如果你有 json 编码的字符串,你可以解码它,使用这个解决方案:Php/json: decode utf8?

<?php
function ewchar_to_utf8($matches) {
    $ewchar = $matches[1];
    $binwchar = hexdec($ewchar);
    $wchar = chr(($binwchar >> 8) & 0xFF) . chr(($binwchar) & 0xFF);
    return iconv("unicodebig", "utf-8", $wchar);
}

function special_unicode_to_utf8($str) {
    return preg_replace_callback("/\\\u([[:xdigit:]]{4})/i", "ewchar_to_utf8", $str);
}

$text = 'Landsli\u00f0sma\u00f0ur me\u00f0 viti. #rafhla\u00f0a #hr\u00e6ddur http:\/\/t.co\/ci03F3vUNM';
$text = special_unicode_to_utf8($text);

function twitterify($ret) {
  $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
  $ret = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
  return $ret;
}

$text = twitterify($text);
print $text;

印刷:

Landsliðsmaður með viti. <a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a> <a href="http://search.twitter.com/search?q=hræddur" target="_blank">#hræddur</a> http:\/\/t.co\/ci03F3vUNM

于 2013-07-20T17:21:12.090 回答