I've got a problem at work that requires me to insheet some MASSIVE tab-separated values files (think 8-15 GB .txt files) into my PostgreSQL DB, but I've run into a problem with the way the data was formatted in the first place. Basically, the way we are given the data (and unfortunately we cannot get the data in a better format), there are some backslashes that appear and cause a return/new line.
So, there are lines (rows of data, tab-delim) that get chopped up into multiple lines, where the last character of line n is a \ , and the first character of line n+1 is a tab. Usually line n will be broken up into 1-3 additional lines (e.g. line n ends in a "\", lines n+1 and n+2 start with a tab and end with a "\", and line n+3 starts with a tab).
I need to write a script that can work with these huge files (this will run on a linux server with 192 GB of RAM) to look for the lines that begin with a tab, and then remove the return (and "\" wherever it exists) and save the text file.
To recap, the customer's logging program splits the original line N into lines n, n+1, and sometimes n+2 and n+3 (depending on how many \ characters appear in line N), and I need to write a python script to recreate the original line N.