2

我正在尝试在 PHP 5.3 中解析推文中的文本,但在解析包含 Unicode 字符的用户提及、主题标签和链接时遇到问题。

首先,我获取推文并将其存储到 txt 文件中:

$tweets_file = createFile('cache/'.$twitteruser.'-tweets.txt', json_encode($tweets));

之后,在我的文本文件中,我可以看到一堆 Unicode 字符(例如Landsli\u00f0sma\u00f0ur)。

当我尝试显示所有推文时,我会这样做:

function twitterify($text) {
  $text = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $text);
  $text = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $text);
  $text = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $text);
  $text = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $text);
  return $text;
}

$tweets_file = file_get_contents('cache/'.$queried_user.'-tweets.txt');
$tweets = json_decode($tweets_file);
foreach($tweets as $tweet) {
  echo twitterify($tweet->text);
  // do other stuff...
}

例如,在主题标签中有 Unicode 字符之前,这里一切正常。我preg_replace停在那个字符和一个像#rafhlaða渲染到的主题标签上<a href="#">#rafhla</a>ða

我该怎么做才能正确渲染出带有 Unicode 字符的文本?

4

2 回答 2

1

我无法重现您的错误。我从 pastebin 中获取 JSON 数据并将其修改为最简单的情况:

[{"text":"#rafhla\u00f0a"}]

所以,文字只有 1 个字:rafhlaða

然后运行以下脚本:

<?php
function twitterify($ret) {
    $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
    $ret = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
    return $ret;
}


$tweets_file = file_get_contents('file.txt');
$tweets = json_decode($tweets_file);
foreach($tweets as $tweet) {
    print $tweet->text;
    print "\n";
    echo twitterify($tweet->text);
    exit;
}

它打印:

#rafhlaða
<a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a>

这与你的说法相矛盾:

#rafhlaða renders to <a href="#">#rafhla</a>ða

更新

<?php
function twitterify($ret) {
    $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("/@(.+?)(?=\s|$)/", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
    $ret = preg_replace("/#(.+?)(?=\s|$)/", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
    return $ret;
}


$tweet = '[{"text":"#rafhla\u00f0a #rafhla\u00f0a"}]';
$tweet = json_decode($tweet);
print $tweet[0]->text;
print "\n";
echo twitterify($tweet[0]->text);

印刷:

#rafhlaða #rafhlaða

<a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a> <a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a>

于 2013-07-20T21:01:52.830 回答
0

尝试将此添加到您的脚本中(并省略 preg_replace):

header('Content-Type: application/json; Charset=UTF-8');

解决方案二:

$tweets_file = file_get_contents('cache/'.$queried_user.'-tweets.txt', FILE_TEXT);
于 2013-07-20T20:45:42.813 回答