0

我编写了一个代码来将输入的 UCS-2LE 文件转换为普通的 8 位 ISO-8859-1 文本。转换后,我使用 strtok 函数将整个文本拆分为单词。现在我正在对获得的每个单词应用 strlen,但是我的单词长度很奇怪,我无法理解。

<?php
$fileData = file('input.txt');

foreach( $fileData as $txt ){

    $txt = iconv( 'ISO-8859-1', 'UCS-2LE', $txt );
    $tok = strtok($txt, " \n\t");
    while ($tok !== false) {
        echo 'Word = '.$tok.', Length = '.strlen($tok).'<br />';
        $tok = strtok(" \n\t");
    }
}
?>

输入文件,文件名 = input.txt(在 UCS-2LE 中)是

 Slot#  NumJobs ActiveJobID ActiveBatchJob  ActiveProcStartTime
 0  0   1   input3.dat  7:20 PM
 1  0   2   input3.dat  7:20 PM

输出是

Word = ÿþSlot#, Length = 24
Word = NumJobs, Length = 31
Word = ActiveJobID, Length = 47
Word = ActiveBatchJob, Length = 59
Word = ActiveProcStartTime , Length = 83
Word = , Length = 1
Word = 0, Length = 6
Word = 0, Length = 7
Word = 1, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = 1, Length = 6
Word = 0, Length = 7
Word = 2, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = , Length = 2

1)长度显示不正确是怎么回事。

2) 输出中的第 6 行是新行字符,它没有被 strtok 正确标记。为什么?

3)我读了一点BOM,我知道文件中的前两个字符用于识别所用字符的格式。有没有办法避免这些字符,比如在输出的第一行,它显示了两个额外的字符。

在此先感谢您的帮助。

4

0 回答 0