4

我有以下问题:我有一个将近 500mb 的文件。它的文本,都在一行中。文本以虚拟行结尾分隔,称为 ROW_DEL,在文本中如下所示:

this is a line ROW_DEL and this is a line

现在我需要做以下事情,我想把这个文件分成几行,所以我得到一个像这样的文件:

this is a line
and this is a line

问题,即使我用windows文本编辑器打开它,它也会因为文件太大而中断。

是否可以像我提到的用 C#、Java 或 Python 分割这个文件?什么是最好的办法,不要过度杀伤我的 CPU。

4

3 回答 3

1

分块读取此文件,例如StreamReader.ReadBlock在 c# 中使用。您可以在此处设置要读取的最大字符数。

对于每个读取的块,您可以替换并将其附加到新文件中ROW_DEL\r\n

请记住将当前索引增加您刚刚阅读的字符数。

于 2013-05-16T09:28:40.383 回答
1

实际上 500mb 的文本并没有那么大,只是记事本很烂。由于您在 Windows 上,因此您可能没有 sed 可用,但至少在 python 中尝试天真的解决方案,我认为它会正常工作:

import os
with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out:
  f_out.write(f_in.read().replace('ROW_DEL ', os.linesep))
于 2013-05-16T09:39:47.150 回答
1

这是我的解决方案。
原则上很容易(ŁukaszW.pl 给出了它),但如果想要处理特殊情况(ŁukaszW.pl 没有),编码就不是那么容易了。

特殊情况是分隔符 ROW_DEL 被分成两个读取块(正如 I4V 指出的那样),甚至更微妙的是,如果有两个连续的 ROW_DEL 其中第二个被分成两个读取块。

由于 ROW_DEL 比任何可能的换行符 ( '\r', '\n', '\r\n') 都长,因此它可以在文件中被操作系统使用的换行符替换。这就是我选择自己重写文件的原因。
为此,我使用 mode 'r+',它不会创建新文件。
使用二进制模式也是绝对必要的'b'

原理是读取一个块(在现实生活中它的大小将是 262144)和x 个附加字符,其中 x是分隔符 -1 的长度。
然后检查分隔符是否存在于块的末尾+ x 字符中。
根据它是否存在,在执行 ROW_DEL 的转换之前,该块被缩短或不被缩短,并在原地重写。

裸码是:

text = ('The hospital roommate of a man infected ROW_DEL'
        'with novel coronavirus (NCoV)ROW_DEL'
        '—a SARS-related virus first identified ROW_DELROW_DEL'
        'last year and already linked to 18 deaths—ROW_DEL'
        'has contracted the illness himself, ROW_DEL'
        'intensifying concerns about the ROW_DEL'
        "virus's ability to spread ROW_DEL"
        'from person to person.')

with open('eessaa.txt','w') as f:
    f.write(text)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ch.replace('ROW_DEL','ROW_DEL\n')
    print '\nlength of the text : %d chars\n' % len(text)

#==========================================

from os.path import getsize
from os import fsync,linesep

def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
    if chunk_length<len(sep):
        print 'Length of second argument, %d , is '\
              'the minimum value for the third argument'\
              % len(sep)
        return

    x = len(sep)-1
    x2 = 2*x
    file_length = getsize(whichfile)
    with open(whichfile,'rb+') as fR,\
         open(whichfile,'rb+') as fW:
        while True:
            chunk = fR.read(chunk_length)
            pch = fR.tell()
            twelve = chunk[-x:] + fR.read(x)
            ptw = fR.tell()

            if sep in twelve:
                pt = twelve.find(sep)
                m = ("\n   !! %r is "
                     "at position %d in twelve !!" % (sep,pt))
                y = chunk[0:-x+pt].replace(sep,OSeol)
            else:
                pt = x
                m = ''
                y = chunk.replace(sep,OSeol)

            pos = fW.tell()
            fW.write(y)
            fW.flush()
            fsync(fW.fileno())

            if fR.tell()<file_length:
                fR.seek(-x2+pt,1)
            else:
                fW.truncate()
                break

rewrite('eessaa.txt','ROW_DEL',14)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
    print '\nlength of the text : %d chars\n' % len(ch)

为了跟踪执行,这是另一个一直打印消息的代码:

text = ('The hospital roommate of a man infected ROW_DEL'
        'with novel coronavirus (NCoV)ROW_DEL'
        '—a SARS-related virus first identified ROW_DELROW_DEL'
        'last year and already linked to 18 deaths—ROW_DEL'
        'has contracted the illness himself, ROW_DEL'
        'intensifying concerns about the ROW_DEL'
        "virus's ability to spread ROW_DEL"
        'from person to person.')

with open('eessaa.txt','w') as f:
    f.write(text)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print ch.replace('ROW_DEL','ROW_DEL\n')
    print '\nlength of the text : %d chars\n' % len(text)

#==========================================

from os.path import getsize
from os import fsync,linesep

def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
    if chunk_length<len(sep):
        print 'Length of second argument, %d , is '\
              'the minimum value for the third argument'\
              % len(sep)
        return

    x = len(sep)-1
    x2 = 2*x
    file_length = getsize(whichfile)
    with open(whichfile,'rb+') as fR,\
         open(whichfile,'rb+') as fW:
        while True:
            chunk = fR.read(chunk_length)
            pch = fR.tell()
            twelve = chunk[-x:] + fR.read(x)
            ptw = fR.tell()

            if sep in twelve:
                pt = twelve.find(sep)
                m = ("\n   !! %r is "
                     "at position %d in twelve !!" % (sep,pt))
                y = chunk[0:-x+pt].replace(sep,OSeol)
            else:
                pt = x
                m = ''
                y = chunk.replace(sep,OSeol)
            print ('chunk  == %r   %d chars\n'
                   ' -> fR now at position  %d\n'
                   'twelve == %r   %d chars   %s\n'
                   ' -> fR now at position  %d'
                   % (chunk ,len(chunk),      pch,
                      twelve,len(twelve),m,   ptw) )

            pos = fW.tell()
            fW.write(y)
            fW.flush()
            fsync(fW.fileno())
            print ('          %r   %d long\n'
                   ' has been written from position %d\n'
                   ' => fW now at position  %d'
                   % (y,len(y),pos,fW.tell()))

            if fR.tell()<file_length:
                fR.seek(-x2+pt,1)
                print ' -> fR moved %d characters back to position %d'\
                       % (x2-pt,fR.tell())
            else:
                print (" => fR is at position %d == file's size\n"
                       '    File has thoroughly been read'
                       % fR.tell())
                fW.truncate()
                break

            raw_input('\npress any key to continue')


rewrite('eessaa.txt','ROW_DEL',14)

with open('eessaa.txt','rb') as f:
    ch = f.read()
    print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
    print '\nlength of the text : %d chars\n' % len(ch)

为了检测 ROW_DEL 是否跨越两个块以及是否有两个 ROW_DEL 连续,在处理块的末端时有一些微妙之处。这就是为什么我花了很长时间来发布我的解决方案:我终于不得不写fR.seek(-x2+pt,1),不仅fR.seek(-2*x,1)或者fR.seek(-x,1)根据sep是否跨越(代码中的 2*x 是 x2,ROW_DEL x 和 x2 是 6 和 12)。对这一点感兴趣的任何人都将通过更改相应部分中的代码来检查它if 'ROW_DEL' is in twelve

于 2013-05-16T15:42:24.083 回答