2

I need to convert text files' character encodings without hogging the server's memory, while the input file is user configured and its size isn't limited.

Would it be more efficient to wrap an unix's iconv command using exec() (which I'd rather avoid, although I already use it in the application for other file operations), or should I read the file line by line and output it into another file?

I'm thinking working this way:

$in = fopen("in.txt", "r");
$out = fopen("out.txt", "w+");
while(($line = fgets($in, 4096)) !== false) {
    $converted = iconv($charset["in"], $charset["out"], $line);
    fwrite($out, $converted);
}
rename("out.txt", "in.txt");

Is there any better approach to convert the file fast and efficiently? I'm thinking this might be rather CPU intensive, but then I believe iconv itself is an expensive task so I'm not sure if I can make it actually not eat the server much at all.

Thanks!

4

4 回答 4

2

Alright, thanks for the input, I did "my homework" based on it and got the results, working with 50MB sample of actual CSV data:

First, iterating over the file using PHP:

$in = fopen("a.txt", "r");
$out = fopen("p.txt", "w+");

$start = microtime(true);

while(($line = fgets($in)) !== false) {
    $converted = iconv("UTF-8", "EUC-JP//TRANSLIT", $line);
    fwrite($out, $converted);
}

$elapsed = microtime(true) - $start;
echo "<br>Iconv took $elapsed seconds\r\n";


Iconv took 2.2817220687866 seconds

That's not so bad I guess, so I tried the exact same approach in #bash, so it wouldn't have to load all the file but iterate over each line instead (which might not exactly happen as I understand what Lajos Veres answered). Indeed, this method wasn't exactly efficient (CPU was under a heavy load all the time). Also, the output file is smaller than the other 2, although after a quick look it looks the same, so I must have made a mistake in the bash script, however, that shouldn't have such effect on performance anyway:

#!/bin/bash
echo "" > b.txt
time echo $(
    while read line
    do
        echo $line |iconv -f utf-8 -t EUC-JP//TRANSLIT >> b.txt
    done < a.txt
)

real 9m40.535s user 2m2.191s sys 3m18.993s

And then the classic approach which I would have expected to hog the memory, however, checking the CPU/Memory usage, it didn't seem to take any more memory than any other approach, therefore being a winner:

#!/bin/bash
time echo $(
    iconv -f utf-8 -t EUC-JP//TRANSLIT a.txt -o b2.txt
)

real 0m0.256s user 0m0.195s sys 0m0.060s

I'll try to get a bigger file sample to test the 2 more efficient methods to make sure the memory usage doesn't get significant, however, the result seems obvious enough to assume the single pass through the whole file in bash is the most efficient (I didn't try that in PHP, as I believe loading an entire file to an array/string in PHP isn't ever a good idea).

于 2013-09-29T08:37:46.663 回答
1

仅供参考:http: //sourceware.org/bugzilla/show_bug.cgi ?id=6050

无论如何,操作系统迟早需要读取整个文件。这意味着当它读取缓存时清除类似 lru 的逻辑将释放内存。lru 意味着可能会丢弃较旧的页面。

您不能 100% 确定您的系统将如何容忍这种情况。您必须将此过程与不同的硬件或虚拟化分开,但这些解决方案也会产生瓶颈。

审慎测试可能是最具成本效益的方式。但不是实施会导致大多数令人头疼的问题,而是预期的工作量。

我的意思是在一百个并行线程中处理大量 g 文件与每天处理几个文件完全不同。

于 2013-09-28T23:20:32.183 回答
1

这是使用 PHP 的 Iconv 和使用 Unix Bash 的 Iconv 的基准测试。

对于 PHP ->

<?php
$text = file('a.txt');
$text = $text[0];
$start = microtime(true);
for ($i = 0; $i < 1000; $i++) {
 $str =  iconv("UTF-8", "EUC-JP", $text);
}
$elapsed = microtime(true) - $start;
echo "<br>Iconv took $elapsed seconds\r\n";
?>

取决于我的服务器结果,

root@ubuntu:/var/www# php benc.php
<br>Iconv took 0.0012350082397461 seconds

对于 Unix Bash ->

#!/bin/bash
begin_time=$(($(date +%N)/10000000))
for i in {0..1000}
 do
      iconv -f utf-8 -t EUC-JP a.txt -o b.txt
 done
end_time=$(($(date +%s%N)/1000000))
total_time=$((end_time-begin_time))
echo ${total_time}

取决于我的服务器结果,

root@ubuntu:/var/www#bash test.sh
1380410308211

结果清楚地表明,在 CPU 使用率方面,您可以从 iConv 和 PHP 获得更高的性能。需要指出的是,获胜者较少使用内存作为 CPU。

注意:如果你运行,你应该在同一个字典中创建和 a.txt 文件与*.sh 和 * .php 文件。

于 2013-09-28T23:32:05.427 回答
1

为什么不直接在系统中执行,而不是逐块读取文件。鉴于iconv您的系统中存在

system(sprintf('iconv -f %s -t %s %s > %s',
                $charset['in'], $charset['out'], "in.txt", "out.txt"));
于 2015-11-05T18:20:19.720 回答