
   JOB  REF Comment V2  Other
1   3   45  This was a small job    NULL    sdnsdf
2   4   456 This was a large job and I have to go onto a new line, 
    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
3   7   354 NULL    NULL    NULL

# dat <- readLines("the-Dirty-Tab-Delimited-File.txt")
dat <- c("\tJOB\tREF\tComment\tV2\tOther", "1\t3\t45\tThis was a small job\tNULL\tsdnsdf", 
"2\t4\t456\tThis was a large job and I have\t\t", "\t\"to go onto a new line, but I didn't properly escape so it's on the next row whoops!\"\tNULL\tNULL\t\t", 

我知道这可能是不可能的,但这些坏的换行符只出现在一个字段(第 10 列)中。我对 R(首选)或 python 中的解决方案感兴趣。

我的想法是引入一个正则表达式,在 10 个且只有 10 个制表符之后寻找换行符。我首先使用readLines并尝试删除出现在空格+单词末尾的所有换行符:

dat <- gsub("( [a-zA-Z]*)\t\n", "\\1", dat)

但似乎很难扭转readLines. 我应该做什么?


140338  28855   WA  2   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    1   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    1000    NULL    NULL    NULL    NULL    NULL    NULL    YNNNNNNN    (Some text with two newlines)

The remainder of the text beneath two newlines  NULL    NULL    NULL    3534a   NULL    email   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

2 回答 2



with open("filename", "r") as data:
    for count,linedata in enumerate(data):

for count,x in enumerate(datadict):
    if count==0: #get rid of the first line
    if not datadict[count][1].isdigit(): #if item #2 isn't a number

for x in extra_line_numbers:

with open("newfile",'w') as data:
    data.writelines(['\t'.join(x)+'\n' for x in datadict.values()])
于 2013-10-30T06:55:11.677 回答

这是我在 Python 中的答案。

import re

# This pattern should match correct data lines and should not
# match "continuation" lines (lines added by the unquoted newline).
# This pattern means: start of line, then a number, then white space,
# then another number, then more white space, then another number.

# This program won't work right if this pattern isn't correct.
pat = re.compile("^\d+\s+\d+\s+\d+")

def collect_lines(iterable):
    itr = iter(iterable)  # get an iterator

    # First, loop until we find a valid line.
    # This will skip the first line with the "header" info.
    line = next(itr)
    while True:
        line = next(itr)
        if pat.match(line):
            # found a valid line; hold it as cur
            cur = line
    for line in itr:
        # Look at the line after cur.  Is it a valid line?
        if pat.match(line):
            # Line after cur is valid!
            yield cur  # output cur
            cur = line  # hold new line as new cur
            # Line after cur is not valid; append to cur but do not output yet.
            cur = cur.rstrip('\r\n') + line
    yield cur

data = """\
   JOB  REF Comment V2  Other
@@@1   3   45  This was a small job    NULL    sdnsdf
@@@2   4   456 This was a large job and I have to go onto a new line, 
@@@    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
@@@3   7   354 NULL    NULL    NULL

lines = data.split('@@@')
for line in collect_lines(lines):


with open("filename", "rt") as f:
    for line in collect_lines(f):
        # do something with each line




于 2013-10-30T06:18:57.747 回答