0

我需要处理一个大文本文件(大约 600 MB)才能正确格式化,将格式化的输出写入一个新的文本文件。问题在于将内容写入新文件时会停止在大约 6.2 MB 处。这是代码:

/* Analysis of the text in fileName to see if the lines are in the correct format 
     * (Theme\tDate\tTitle\tDescription). If there are lines that are in the incorrect format,
     * the method corrects them.
     */
    public static void cleanTextFile(String fileName, String destFile) throws IOException {
        OutputStreamWriter writer = null;
        BufferedReader reader = null;

        try {
            writer = new OutputStreamWriter(new FileOutputStream(destFile), "UTF8");
        } catch (IOException e) {
            System.out.println("Could not open or create the file " + destFile);
        }

        try {
            reader = new BufferedReader(new FileReader(fileName));
        } catch (FileNotFoundException e) {
            System.out.println("The file " + fileName + " doesn't exist in the folder.");
        }

        String line;
        String[] splitLine;
        StringBuilder stringBuilder = new StringBuilder("");

        while ((line = reader.readLine()) != null) {
            splitLine = line.split("\t");
            stringBuilder.append(line);

            /* If the String array resulting of the split operation doesn't have size 4,
             * then it means that there are elements of the news item missing in the line
             */
            while (splitLine.length != 4) {
                line = reader.readLine();
                stringBuilder.append(line);

                splitLine = stringBuilder.toString().split("\t");
            }
            stringBuilder.append("\n");
            writer.write(stringBuilder.toString());
            stringBuilder = new StringBuilder("");

            writer.flush();
        }

        writer.close();
        reader.close();

    }

我已经在寻找答案,但问题通常与作者没有被关闭或没有flush()方法有关。因此,我认为问题出在 BufferedReader 中。我错过了什么?

4

3 回答 3

3

看看这个循环:

while (splitLine.length != 4) {
    line = reader.readLine();
    stringBuilder.append(line);

    splitLine = stringBuilder.toString().split("\t");
}

如果您最终在. _ 我不知道这是否正在发生(我们不知道您的数据是什么样的),但这当然是可行的,您应该提防它。splitLinenullStringBuilder

(您还应该使用try/finally块来关闭资源,但这是另一回事。)

于 2012-10-10T18:34:05.020 回答
0
  1. try/catch 没有很好地编码,如果出现错误,过程会继续。
  2. 你可以更换

        stringBuilder = new StringBuilder("");
    

    经过

        stringBuilder.setLength( 0 );
    
  3. 使用您自己的解析器line.indexOf('\t',from)代替String.split()

  4. 将使用 line.substring( b, e ) 获得的部分添加到 List<String>
  5. 使用具有正确字符集的 PrintStream,使用带有两个参数的构造函数
  6. 当 list.size() >= 4 时,将信息 4 x 4 写入,使用列表中的数据
于 2012-10-10T18:54:50.167 回答
0

将 FileOutputStream 分离为它自己的变量并关闭它:

FileOutputStream fos = new FileOutputStream(destFile);
writer = new OutputStreamWriter(fos);

   ...

writer.flush();
fos.flush();
于 2012-10-10T18:37:55.843 回答