我编写了一个代码来将输入的 UCS-2LE 文件转换为普通的 8 位 ISO-8859-1 文本。转换后,我使用 strtok 函数将整个文本拆分为单词。现在我正在对获得的每个单词应用 strlen,但是我的单词长度很奇怪,我无法理解。
<?php
$fileData = file('input.txt');
foreach( $fileData as $txt ){
$txt = iconv( 'ISO-8859-1', 'UCS-2LE', $txt );
$tok = strtok($txt, " \n\t");
while ($tok !== false) {
echo 'Word = '.$tok.', Length = '.strlen($tok).'<br />';
$tok = strtok(" \n\t");
}
}
?>
输入文件,文件名 = input.txt(在 UCS-2LE 中)是
Slot# NumJobs ActiveJobID ActiveBatchJob ActiveProcStartTime
0 0 1 input3.dat 7:20 PM
1 0 2 input3.dat 7:20 PM
输出是
Word = ÿþSlot#, Length = 24
Word = NumJobs, Length = 31
Word = ActiveJobID, Length = 47
Word = ActiveBatchJob, Length = 59
Word = ActiveProcStartTime , Length = 83
Word = , Length = 1
Word = 0, Length = 6
Word = 0, Length = 7
Word = 1, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = 1, Length = 6
Word = 0, Length = 7
Word = 2, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = , Length = 2
1)长度显示不正确是怎么回事。
2) 输出中的第 6 行是新行字符,它没有被 strtok 正确标记。为什么?
3)我读了一点BOM,我知道文件中的前两个字符用于识别所用字符的格式。有没有办法避免这些字符,比如在输出的第一行,它显示了两个额外的字符。
在此先感谢您的帮助。