java - Designing format for intermediate data files?

Question

We are dumping a lot of data (in terms of volume than frequency i.e. dumping 100K or > 400MB records at once ) from database to Excel files. The dumping process is currently being performed in Python, R and Java (using POI library). As part of the dump process, we read the data from the database to a intermediate file (pipe delimited text file) which is then picked up by the code to update excel files. Recently, we ran into issues where text from database with newline characters caused the pipe -delimited files to be invalid since 1 record spanned over multiple lines as opposed to just one line. For example,

| Col1 | Col2 | Col3 |  
| Val  | Val2 | Val3 |

is a valid example of pipe delimited file. If the data contains any new lines, then:

| Col1 | Col2 | Col3 |
| Val1


| Val2 | Val3 


|

Such scenarios become harder to catch and results in more coding then needs to be done in order to make such checks.

I was wondering if there are any libraries/techniques that can be used to write out such temp data. I am not sure if XML would be a solution, considering that performance might become a issue for such a large volume of data. JSON might seem a better fit, but then I don't know all my options here.

score 2 · Accepted Answer

If the number of columns is always guaranteed to be the same, this is just an odd dialect of csv, which you should be able to parse with the csv module in Python, and I suspect Java (but maybe not R) has similar functionality in either a built-in or readily-available library.

Or, if you've written the parsers yourself for some reason, it should be pretty easy to extend them to handle newlines. For example, instead of reading a line and splitting on | and assuming you've got all the fields, read a line, split on |, count to see if you have enough fields, and if not read the next line and append and try again. But you're really better off using code that's already been written and tested than trying to hack it together yourself.

(Of course if the fields can contain | characters, then this format is ambiguous, and can't be parsed by anything, unless you're escaping them somehow.)

Another option is to just quote or escape the newlines (and other special characters) on one end and unescape them on the other. Again, this is something any decent csv library will do for you (almost whether you want it to or not).

And you might want to consider using the quasi-standard csv dialect (usually meaning "as defined by Excel's defaults") instead of coming up with a similar but not identical custom format.

One obvious advantage of using a standard csv dialect is that Excel can then read the results directly, which may take one layer out of your long chain. (Of course you might be able to take even more layers out, by using Excel's data access features to just import or front for the actual database.)

If you want to change to JSON, there's no reason you can't. But there doesn't seem to be any compelling reason to do so here. When you have flexible, dynamic record types, JSON (or something similar, like YAML) is definitely the way to go. But when you have static record types repeated over and over again, JSON means repeating the names of those fields over and over. It's not as bad as XML would be, but it's still extra information to create, pass, and parse for no real benefit.

So, I think the right answer here is: Excel-style csv if possible, your own unique csv dialect with a rule added for how to deal with newlines if for some reason that's not possible.

java - Designing format for intermediate data files?

1 回答 1

Related

Reference