我在文件中有一组行,其中每行可能代表多行注释。原始开发人员选择的行分隔符是 pilcrow (¶),因为他认为这永远不会出现在某人的评论中。我现在将这些放入数据库并希望使用更典型的行分隔符(尽管可能由应用程序安装程序设置)。
问题是某些行使用 ISO-8859-1 编码(十六进制 b6),而其他行使用 UTF-8 编码(十六进制 c2b6)。我正在寻找一种优雅的方式来处理这个问题,它比我目前正在做的事情有更好的支持。
到目前为止,这就是我处理它的方式,但我宁愿寻找一个更优雅的解决方案:
// Due to the way the quote file is stored, line breaks can either be
// in 2-byte or 1-byte characters for the pilcrow. Since we're dealing
// with them on a unix system, it makes more sense to replace these
// funky characters with a newline character as is more standard.
//
// To do this, however, requires a bit of chicanery. We have to do
// 1-byte replacement, but with a 2-byte character.
//
// First, some constants:
define('PILCROW', '¶'); // standard two-byte pilcrow character
define('SHORT_PILCROW', chr(0XB6)); // the one-byte version used in the source data some places
define('NEEDLE', '/['.PILCROW.SHORT_PILCROW.']/'); // this is what is searched for
define('REPLACEMENT', $GLOBALS['linesep']);
function fix_line_breaks($quote)
{
$t0 = preg_replace(NEEDLE,REPLACEMENT,$quote); // convert either long or short pilcrow to a newline.
return $t0;
}