6

我正在使用 librets 从我的 RETS 服务器中检索数据。不知何故,librets 编码方法不起作用,我在输出中收到了一些奇怪的字符。我注意到像'''这样的字符被替换为€™。我无法找到 librets 的修复程序,所以我决定在下载数据后用实际值替换这些垃圾字符。我需要的是此类垃圾字符串及其等效字符的列表。我搜索了这个但没有找到任何资源。谁能指出这些垃圾字母的列表及其实际值或可以生成此类字母的一段代码。

谢谢

4

2 回答 2

11
于 2012-08-19T06:48:15.863 回答
0

Question reminder:

"...I noticed characters like '’' is replaced with ’... i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters."

Strictly dealing with this part:

"What I need is a list of such garbage string and their equivalent characters."

Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.

<?php
  for ($i=128; $i<258; $i++)
    $tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";

  echo "<table border=1>
    <tr><td>&#</td><td>&quot;Garbage&quot;</td><td>symbol</td></tr>";
    echo $tmp1;
  echo "</table>";
?>

From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.

In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

.

"i decided to replace such garbage characeters with actual values after downloading data"

You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.

First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:

<?php
  //ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
  $tmp1 = "\$SearchArr = array(";
  for ($i=128; $i<258; $i++)
    $tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
  $tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
  $tmp1 .= ");";
  $tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>

The second array is the replace term:

<?php
  //Adapt for your relevant range.
  $tmp2 = "\$ReplaceArr = array(\n";
  for ($i=128; $i<258; $i++)
    $tmp2 .= "\"&#".$i.";\", ";
  $tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
  $tmp2 .= ");";

  echo $tmp1."\n<br><br>\n";
  echo $tmp2."\n";
?>

Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:

$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);

Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

于 2013-11-01T02:02:03.490 回答