php - 如何从 CSV 文件中删除重复的行？

Question

有没有一种简单的方法可以从 CSV 文件中查找和删除重复的行？

示例 test.csv 文件：

row1 test tyy......
row2 tesg ghh
row2 tesg ghh
row2 tesg ghh
....
row3 tesg ghh
row3 tesg ghh
...
row4 tesg ghh

预期成绩：

row1 test tyy......
row2 tesg ghh
....
row3 tesg ghh
...
row4 tesg ghh

我可以从哪里开始使用 PHP 完成此任务？

score 12 · Accepted Answer

直截了当的方法是逐行读取文件并跟踪您之前看到的每一行。如果当前行已经被看到，跳过它。

类似以下（未经测试）的代码可能会起作用：

<?php
// array to hold all "seen" lines
$lines = array();

// open the csv file
if (($handle = fopen("test.csv", "r")) !== false) {
    // read each line into an array
    while (($data = fgetcsv($handle, 8192, ",")) !== false) {
        // build a "line" from the parsed data
        $line = join(",", $data);

        // if the line has been seen, skip it
        if (isset($lines[$line])) continue;

        // save the line
        $lines[$line] = true;
    }
    fclose($handle);
}

// build the new content-data
$contents = '';
foreach ($lines as $line => $bool) $contents .= $line . "\r\n";

// save it to a new file
file_put_contents("test_unique.csv", $contents);
?>

此代码使用fgetcsv()并使用空格逗号作为列分隔符（基于问题注释中的示例数据）。

如上所述，存储已看到的每一行将确保删除文件中的所有重复行，无论它们是否直接相互跟随。如果它们总是背靠背，一种更简单的方法（并且更有记忆力）将只存储最后看到的行，然后与当前行进行比较。

更新（通过 SKU 列重复行，而不是全行）
根据评论中提供的示例数据，“重复行”实际上并不相等（尽管它们相似，但它们的列数不同）。它们之间的相似性可以链接到单个列，即sku.

以下是上述代码的扩展版本。此块将解析 CSV 文件的第一行（列列表）以确定哪一列包含sku代码。从那里，它将保留一个唯一的 SKU 代码列表，如果当前行有一个“新”代码，它将使用以下命令将该行写入新的“唯一”文件fputcsv()：

<?php
// array to hold all unique lines
$lines = array();

// array to hold all unique SKU codes
$skus = array();

// index of the `sku` column
$skuIndex = -1;

// open the "save-file"
if (($saveHandle = fopen("test_unique.csv", "w")) !== false) {
    // open the csv file
    if (($readHandle = fopen("test.csv", "r")) !== false) {
        // read each line into an array
        while (($data = fgetcsv($readHandle, 8192, ",")) !== false) {
            if ($skuIndex == -1) {
                // we need to determine what column the "sku" is; this will identify
                // the "unique" rows
                foreach ($data as $index => $column) {
                    if ($column == 'sku') {
                        $skuIndex = $index;
                        break;
                    }
                }
                if ($skuIndex == -1) {
                    echo "Couldn't determine the SKU-column.";
                    die();
                }
                // write this line to the file
                fputcsv($saveHandle, $data);
            }

            // if the sku has been seen, skip it
            if (isset($skus[$data[$skuIndex]])) continue;
            $skus[$data[$skuIndex]] = true;

            // write this line to the file
            fputcsv($saveHandle, $data);
        }
        fclose($readHandle);
    }
    fclose($saveHandle);
}
?>

总的来说，这种方法对内存更加友好，因为它不需要在内存中保存每一行的副本（仅 SKU 代码）。

score 0 · Accepted Answer

0

一线解决方案：

file_put_contents('newdata.csv', array_unique(file('data.csv')));

于 2020-09-04T10:25:01.713 回答

php - 如何从 CSV 文件中删除重复的行？

2 回答 2

Related

Reference