php - Best way to extract text from a 1.3GB text file using PHP?

Question

I have a 1.3GB text file that I need to extract some information from in PHP. I have researched it and have come up with a few various ways to do what I need to do, but as always am after a little clarification on which method would be best or if another better exists that I do not know about?

The information I need in the text file is only the first 40 characters of each line, and there are around 17million lines in the file. The 40 characters from each line will be inserted into a database.

The methods I have are below;

// REMOVE TIME LIMIT
set_time_limit(0);
// REMOVE MEMORY LIMIT
ini_set('memory_limit', '-1');
// OPEN FILE
$handle = @fopen('C:\Users\Carl\Downloads\test.txt', 'r');
if($handle) {
    while(($buffer = fgets($handle)) !== false) {
        $insert[] = substr($buffer, 0, 40);
    }
    if(!feof($handle)) {
        // END OF FILE
    }
    fclose($handle);
}

Above is read each line at a time and get the data, I have all the database inserts sorted, doing 50 inserts at a time ten times over in a transaction.

The next method is the same as above really but calling file() to store all the lines in an array before doing a foreach to get the data? I am not sure about this method though as the array would essentially have over 17 million values.

Another method would be to extract only part of the file, rewrite the file with the unused data, and after that part has been executed recall the script using a header call?

What would be the best way in terms of getting this done in the most quick and efficient manner? Or is there a better way to approach this that I have thought of?

Also I plan to use this script with wamp, but running it in a browser while testing has caused problems with timeout even with setting script time out to 0. Is there a way I can execute the script to run without accessing the page through a browser?

score 5 · Accepted Answer

到目前为止你做得很好，不要使用“file()”函数，因为它很可能会达到 RAM 使用限制并终止你的脚本。

我什至不会将东西累积到“插入[]”数组中，因为这也会浪费 RAM。如果可以，请立即插入数据库。

顺便说一句，有一个很好的工具叫做“cut”，你可以用它来处理文件。

cut -c1-40 file.txt

您甚至可以将 cut 的标准输出重定向到一些插入数据库的 PHP 脚本。

cut -c1-40 file.txt | php -f inserter.php

inserter.php 然后可以从 php://stdin 读取行并插入 DB。

“cut”是所有 Linux 上可用的标准工具，如果您使用 Windows，您可以使用 MinGW shell 获得它，或者作为 msystools 的一部分（如果您使用 git）或使用gnuWin32安装本机 win32 应用程序。

score 2 · Accepted Answer

当您的 RDBMS 几乎可以肯定内置了批量导入功能时，您为什么要在 PHP 中执行此操作？例如，MySQL 有LOAD DATA INFILE：

LOAD DATA INFILE 'data.txt'
INTO TABLE `some_table`
  FIELDS TERMINATED BY ''
  LINES TERMINATED BY '\n';
  ( @line )
SET `some_column` = LEFT( @line, 40 );

一问。

MySQL 还具有mysqlimport从命令行包装此功能的实用程序。

score 1 · Accepted Answer

以上都不是。使用的问题fgets()是它不能按您预期的那样工作。当达到最大字符数时，下一次调用fgets()将在同一行继续。您已经正确识别了使用file(). 第三种方法是一个有趣的想法，您也可以使用其他解决方案来实现它。

也就是说，您的第一个使用想法fgets()非常接近，但是我们需要稍微修改它的行为。这是一个自定义版本，可以按您的预期工作：

function fgetl($fp, $len) {
    $l = 0;
    $buffer = '';
    while (false !== ($c = fgetc($fp)) && PHP_EOL !== $c) {
        if ($l < $len)
            $buffer .= $c;
        ++$l;
    }
    if (0 === $l && false === $c) {
        return false;
    }
    return $buffer;
}

立即执行插入操作，否则会浪费内存。确保您prepared statements用于插入这么多行；这将大大减少执行时间。当您只能提交数据时，您不想提交每个插入的完整查询。

php - Best way to extract text from a 1.3GB text file using PHP?

3 回答 3

Related

Reference