java - 根据列之一将输入文件划分为多个文件

Question

我有一个分号分隔的输入文件，其中第一列是 3 个字符的固定宽度代码，而其余列是一些字符串数据。

001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str

我想根据第一列的不同值将上述文件划分为多个文件。

例如在上面的例子中，第一列有三个不同的值，所以我将文件分成三个文件，即。001.txt、002.txt、003.txt

输出文件应包含作为第一行的项目计数和作为剩余行的数据。

所以有 5 001 行，所以 001.txt 将是：

5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str

同样，002 文件的第一行为 4，然后是 4 行数据，003 文件的第一行为 5，然后是 5 行数据。

考虑到超过 100,000 行的非常大的输入文件，实现这一目标的最有效方法是什么？

我写了下面的代码来从文件中读取行：

try{
          FileInputStream fstream = new FileInputStream(this.inputFilePath);
          DataInputStream in = new DataInputStream(fstream);
          BufferedReader br = new BufferedReader(new InputStreamReader(in));
          String strLine;

          while ((strLine = br.readLine()) != null)   {
              String[] tokens = strLine.split(";");
         }

          in.close();
    }catch(IOException e){
        e.printStackTrace();
    }

score 1 · Accepted Answer

对于每一行
提取块名称，例如 001
查找名为“001-tmp.txt”的文件
如果存在，请读取第一行 - 它会给您行数，然后使用带有参数 0 的seek函数增加值并写入同一个文件，然后使用writeUTF覆盖字符串。也许这里必须应用一些字符串长度计算，例如留 10 个空格的占位符。
如果一个不存在，则创建一个并写 1 作为第一行，用 10 个空格填充
将当前行追加到文件中
关闭当前文件
继续下一行源文件

score 1 · Accepted Answer

想到的解决方案之一是保留“地图”并且每个文件只打开一次。但是你不能这样做，因为你有大约 1 lac 行，所以没有操作系统会允许你打开那么多文件描述符。

因此，一种方法是以附加模式打开文件并继续写入并关闭它。但是由于大量文件的打开关闭调用，该过程可能会变慢。不过你可以自己测试一下。

如果以上没有提供令人满意的结果，您可以尝试方法 1 和 2 的混合，在这种方法中，您随时只打开 100 个打开的文件，并且只有在需要写入尚未打开的新文件时才关闭文件……

score 0 · Accepted Answer

首先，创建HashMap<String, ArrayList<String>> map以收集文件中的所有数据。其次，使用strLine.split(";",2)代替strLine.split(";"). 结果将是长度为 2 的数组，第一个元素是代码，第二个元素是数据。然后，将解码后的字符串添加到地图中：

ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
   map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);

最后，扫描map.keySet()每个键的和，创建一个名为该键的文件，并将列表的大小和列表的内容写入其中。

score 0 · Accepted Answer

对于每三个字符的代码，您将有一个输入行列表。对我来说，显而易见的解决方案是使用Map,String键（您的三个字符代码）指向List包含所有行的对应项。

对于这些键中的每一个，您将创建一个具有相关名称的文件，第一行将是列表的大小，然后您将对其进行迭代以写入剩余的行。

score 0 · Accepted Answer

我猜您并没有固定在三个文件中，所以我建议您创建一个编写器映射，将您的三个字符代码作为键，编写器作为值。

对于您阅读的每一行，您选择或创建所需的阅读器并将这些行写入。您还需要第二张地图来维护所有文件的行数值。

读取完源文件后，刷新并关闭所有写入器并再次读取文件。这次您只需在文件前面添加行数。据我所知，除了重写整个文件之外别无他法，因为在不缓冲和重写整个文件的情况下，不可能直接将任何内容添加到文件的开头。我建议您为此使用一个临时文件。

此答案仅适用于您的文件太大而无法完全存储在内存中的情况。如果可以存储，则有更快的解决方案。StringBuffer就像在将文件内容写入文件之前将其完全存储在对象中一样。

java - 根据列之一将输入文件划分为多个文件

5 回答 5

Related

Reference