这是我的解决方案。
原则上很容易(ŁukaszW.pl 给出了它),但如果想要处理特殊情况(ŁukaszW.pl 没有),编码就不是那么容易了。
特殊情况是分隔符 ROW_DEL 被分成两个读取块(正如 I4V 指出的那样),甚至更微妙的是,如果有两个连续的 ROW_DEL 其中第二个被分成两个读取块。
由于 ROW_DEL 比任何可能的换行符 ( '\r'
, '\n'
, '\r\n'
) 都长,因此它可以在文件中被操作系统使用的换行符替换。这就是我选择自己重写文件的原因。
为此,我使用 mode 'r+'
,它不会创建新文件。
使用二进制模式也是绝对必要的'b'
。
原理是读取一个块(在现实生活中它的大小将是 262144)和x 个附加字符,其中 x是分隔符 -1 的长度。
然后检查分隔符是否存在于块的末尾+ x 字符中。
根据它是否存在,在执行 ROW_DEL 的转换之前,该块被缩短或不被缩短,并在原地重写。
裸码是:
text = ('The hospital roommate of a man infected ROW_DEL'
'with novel coronavirus (NCoV)ROW_DEL'
'—a SARS-related virus first identified ROW_DELROW_DEL'
'last year and already linked to 18 deaths—ROW_DEL'
'has contracted the illness himself, ROW_DEL'
'intensifying concerns about the ROW_DEL'
"virus's ability to spread ROW_DEL"
'from person to person.')
with open('eessaa.txt','w') as f:
f.write(text)
with open('eessaa.txt','rb') as f:
ch = f.read()
print ch.replace('ROW_DEL','ROW_DEL\n')
print '\nlength of the text : %d chars\n' % len(text)
#==========================================
from os.path import getsize
from os import fsync,linesep
def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
if chunk_length<len(sep):
print 'Length of second argument, %d , is '\
'the minimum value for the third argument'\
% len(sep)
return
x = len(sep)-1
x2 = 2*x
file_length = getsize(whichfile)
with open(whichfile,'rb+') as fR,\
open(whichfile,'rb+') as fW:
while True:
chunk = fR.read(chunk_length)
pch = fR.tell()
twelve = chunk[-x:] + fR.read(x)
ptw = fR.tell()
if sep in twelve:
pt = twelve.find(sep)
m = ("\n !! %r is "
"at position %d in twelve !!" % (sep,pt))
y = chunk[0:-x+pt].replace(sep,OSeol)
else:
pt = x
m = ''
y = chunk.replace(sep,OSeol)
pos = fW.tell()
fW.write(y)
fW.flush()
fsync(fW.fileno())
if fR.tell()<file_length:
fR.seek(-x2+pt,1)
else:
fW.truncate()
break
rewrite('eessaa.txt','ROW_DEL',14)
with open('eessaa.txt','rb') as f:
ch = f.read()
print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
print '\nlength of the text : %d chars\n' % len(ch)
为了跟踪执行,这是另一个一直打印消息的代码:
text = ('The hospital roommate of a man infected ROW_DEL'
'with novel coronavirus (NCoV)ROW_DEL'
'—a SARS-related virus first identified ROW_DELROW_DEL'
'last year and already linked to 18 deaths—ROW_DEL'
'has contracted the illness himself, ROW_DEL'
'intensifying concerns about the ROW_DEL'
"virus's ability to spread ROW_DEL"
'from person to person.')
with open('eessaa.txt','w') as f:
f.write(text)
with open('eessaa.txt','rb') as f:
ch = f.read()
print ch.replace('ROW_DEL','ROW_DEL\n')
print '\nlength of the text : %d chars\n' % len(text)
#==========================================
from os.path import getsize
from os import fsync,linesep
def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
if chunk_length<len(sep):
print 'Length of second argument, %d , is '\
'the minimum value for the third argument'\
% len(sep)
return
x = len(sep)-1
x2 = 2*x
file_length = getsize(whichfile)
with open(whichfile,'rb+') as fR,\
open(whichfile,'rb+') as fW:
while True:
chunk = fR.read(chunk_length)
pch = fR.tell()
twelve = chunk[-x:] + fR.read(x)
ptw = fR.tell()
if sep in twelve:
pt = twelve.find(sep)
m = ("\n !! %r is "
"at position %d in twelve !!" % (sep,pt))
y = chunk[0:-x+pt].replace(sep,OSeol)
else:
pt = x
m = ''
y = chunk.replace(sep,OSeol)
print ('chunk == %r %d chars\n'
' -> fR now at position %d\n'
'twelve == %r %d chars %s\n'
' -> fR now at position %d'
% (chunk ,len(chunk), pch,
twelve,len(twelve),m, ptw) )
pos = fW.tell()
fW.write(y)
fW.flush()
fsync(fW.fileno())
print (' %r %d long\n'
' has been written from position %d\n'
' => fW now at position %d'
% (y,len(y),pos,fW.tell()))
if fR.tell()<file_length:
fR.seek(-x2+pt,1)
print ' -> fR moved %d characters back to position %d'\
% (x2-pt,fR.tell())
else:
print (" => fR is at position %d == file's size\n"
' File has thoroughly been read'
% fR.tell())
fW.truncate()
break
raw_input('\npress any key to continue')
rewrite('eessaa.txt','ROW_DEL',14)
with open('eessaa.txt','rb') as f:
ch = f.read()
print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
print '\nlength of the text : %d chars\n' % len(ch)
为了检测 ROW_DEL 是否跨越两个块以及是否有两个 ROW_DEL 连续,在处理块的末端时有一些微妙之处。这就是为什么我花了很长时间来发布我的解决方案:我终于不得不写fR.seek(-x2+pt,1)
,不仅fR.seek(-2*x,1)
或者fR.seek(-x,1)
根据sep是否跨越(代码中的 2*x 是 x2,ROW_DEL x 和 x2 是 6 和 12)。对这一点感兴趣的任何人都将通过更改相应部分中的代码来检查它if 'ROW_DEL' is in twelve
。