1

I've got a problem at work that requires me to insheet some MASSIVE tab-separated values files (think 8-15 GB .txt files) into my PostgreSQL DB, but I've run into a problem with the way the data was formatted in the first place. Basically, the way we are given the data (and unfortunately we cannot get the data in a better format), there are some backslashes that appear and cause a return/new line.

So, there are lines (rows of data, tab-delim) that get chopped up into multiple lines, where the last character of line n is a \ , and the first character of line n+1 is a tab. Usually line n will be broken up into 1-3 additional lines (e.g. line n ends in a "\", lines n+1 and n+2 start with a tab and end with a "\", and line n+3 starts with a tab).

I need to write a script that can work with these huge files (this will run on a linux server with 192 GB of RAM) to look for the lines that begin with a tab, and then remove the return (and "\" wherever it exists) and save the text file.

To recap, the customer's logging program splits the original line N into lines n, n+1, and sometimes n+2 and n+3 (depending on how many \ characters appear in line N), and I need to write a python script to recreate the original line N.

4

3 回答 3

2
#!/usr/bin/python

import re,sys

lastLine = None
incomplete = re.compile("\\\\+$")
indented = re.compile("^\\t")

for line in open(sys.argv[1]):
    line = line.rstrip()
    line = incomplete.sub("", line)
    if indented.match(line):
        lastLine += indented.sub("",line)
    else:
        if lastLine:
            print lastLine
            lastLine = None
        lastLine = line

print lastLine

基本上,我忽略了末尾的 \,因为下一行的选项卡告诉您无论如何它是一个延续。

于 2012-07-10T15:26:16.923 回答
1

将 "\n" 序列替换为空:

In [20]: a="blabla\tblabla\tblabla\\\n\tblabla\tblabla"

In [21]: print(a)
blabla  blabla  blabla\
    blabla  blabla

In [22]: a=a.replace('\\\n', '')

In [23]: print(a)
blabla  blabla  blabla  blabla  blabla

:)

于 2012-07-10T14:33:35.753 回答
0

这是基于@user665637 的好答案。

#!/usr/bin/python

import re, sys

pat_incomplete = re.compile(r'\\\s*$')
pat_indented = re.compile(r'^\t')

try:
    _, fname_in, fname_out = sys.argv
except ValueError:
    print("Usage: python line_joiner.py <input_filename> <output_filename>")
    sys.exit(1)

with open(fname_in) as in_f, open(fname_out, "w") as out_f:
    lines = iter(in_f)
    try:
        line = next(lines)
        s = pat_incomplete.sub('', line)
    except StopIteration:
        print("Input file did not contain any data")
        sys.exit(2)

    for line in lines:
        line = pat_incomplete.sub('', line)
        if pat_indented.match(line):
            s += pat_indented.sub('',line)
        else:
            out_f.write(s)
            s = line
    out_f.write(s)

变化:

  • 对正则表达式使用“原始字符串”,这样更易​​于阅读。

  • 从命令行参数获取输出文件名并写入该文件。如果用户提供了错误数量的参数,则打印一条消息并退出。当我们解压sys.argv以获取参数时,我们使用约定_为我们不关心的部分使用变量名。

  • 不去除行尾,因此输出文件将具有与输入文件相同的行尾。(当连接线时,当然它会去除行尾以进行连接。)

  • 不从输入中过滤掉空行。这有点棘手,但是通过创建一个迭代器并调用next()它,它会在开始循环之前获得第一行输入;因此s从一个有效值而不是 开始None,我们不必每次都测试它来查看是否打印它。原始if lastLine:测试,在被剥离的输入行上,不仅可以防止初始None值,lastLine而且还会从输入中过滤掉任何空白行。

  • 如果您必须在 Python 3.0 或 Python 2.6 中使用它,则不能有一个执行两次调用的with语句;open()但是你可以把它变成两个嵌套with语句,每个语句都执行一个open().

于 2012-07-10T18:53:13.113 回答