0

I need to parse a csv file at work. Each line in the file is not very long, only a few hundred characters. I used the following code to read the file into memory.

def lines = []
new File( fileName ).eachLine { line -> lines.add( line ) }

When the number of lines is 10,000, the code works just fine. However, when I increase the number of lines to 100,000. I got this error:

java.lang.OutOfMemoryError: Java heap space

For 10,000 lines, the file size is about 7 MB, and ~70 MB for 100,000 lines. So, how would you solve this problem? I know increasing the heap size is a work-around. But are there any other solutions? Thank you in advance.

4

2 回答 2

1
def lines = []

In groovy, this creates an ArrayList<E> with size 0 and no preallocation of the internal Object[].

When adding items, if capacity is reached, a new ArrayList is created. The larger the list, the more time spent reallocating a new list to accommodate new entries. I suspect that's where your memory issue occurs because, although I'm not exactly sure how ArrayList allocates a new list, if you're getting OOM for a relatively small data set, that's where I'd look first. For 100,000 entries, you create a new list roughly 29 times (assuming expansion factor of 1.5) when starting with an empty ArrayList.

If you have a general idea how large the list needs to be, just set the initial capacity, doing so avoids all the reallocating nonsense; see if this works:

def lines = new ArrayList<String>(100000)
于 2013-08-26T20:08:34.903 回答
0

Assuming that you are likely trying to place the CSV file in a database you can do something like this. The key groovy feature is splitEachLine(yourDelimiter) and using the fields array in the closure.

import groovy.sql.*

def sql = Sql.newInstance("jdbc:oracle:thin:@localhost:1521:ORCL",
    "scott", "tiger", "oracle.jdbc.driver.OracleDriver")

//define a variable that matches a table definition (jdbc dataset
def student = sql.dataSet("TEMP_DATA");
//now iterate over the csv file splitting each line on commas and load the into table.
new File("C:/temp/file.csv").splitEachLine(","){ fields ->
//insert each column we have into the temp table.
 student.add(
        STUDENT_ID:fields[0],
        FIRST_NAME:fields[1],
        LAST_NAME:fields[2]
    )
}
//yes the magic has happened the data is now in the staging table TEMP_DATA.
println "Number of Records  " + sql.firstRow("Select count(*) from TEMP_DATA")
于 2013-08-29T14:28:12.270 回答