python - 根据模式将一个文件拆分为多个文件（剪切可以发生在行内）

Question

存在很多解决方案，但这里的特殊性是我需要能够在一行内分割，剪切应该发生在模式之前。前任：

文件：

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>

应该变成带图案<?xml

输出文件1：

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>

输出文件2：

<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>

文件3：

<?xml 2><blabla><blabla>

实际上，此处perl验证答案中的脚本适用于我的小示例。但它会为我更大（约 6GB）的实际文件生成错误。错误是：

panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.

我没有评论的权限，这就是我开始新帖子的原因。最后，一个Python解决方案将更加感激，因为我更好地理解它。

score 13 · Accepted Answer

这会在不将所有内容读入 RAM 的情况下执行拆分：

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

一个警告：如果您的模式分布在多行（即包含“\n”），这将不起作用。如果是这种情况，请考虑mmap 解决方案。

score 7 · Accepted Answer

Perl 可以逐行解析大文件，而不是将整个文件放入内存中。这是一个简短的脚本（带有解释）：

perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
   print $fh $1 if $1;
   open $fh, ">output." . ++$i;
   print $fh $2;
} else { print $fh $_ }'  in.txt

perl -n：-n标志将逐行遍历您的文件（将内容设置为 $_）

-E: 执行以下文本（Perl 默认需要一个文件名）

if (/(.*)(<\?xml.*) )如果一行匹配<?xml，则将该行（使用正则表达式匹配）拆分为 $1 和 $2。

print $fh $1 if $1将行的开头打印到旧文件。

open $fh, ">output.". ++$i;为写入创建一个新的文件句柄。

print $fh $2将该行的其余部分打印到新文件中。

} else { print $fn $_ }如果该行不匹配<?xml，只需将其打印到当前文件句柄。

注意：此脚本假定您的输入文件以<?xml.

score 5 · Accepted Answer

对于这种大小的文件，您可能需要使用该mmap模块，因此您不必自己处理文件的分块。从那里的文档：

内存映射文件对象的行为既像字符串，又像文件对象。然而，与普通字符串对象不同的是，它们是可变的。您可以在大多数需要字符串的地方使用 mmap 对象；例如，您可以使用 re 模块搜索内存映射文件。由于它们是可变的，您可以通过执行更改单个字符obj[index] = 'a'，或通过分配给切片来更改子字符串： obj[i1:i2] = '...'。您还可以从当前文件位置开始读取和写入数据，并seek()通过文件到不同位置。

这是一个快速示例，向您展示如何<?xml #>在文件中查找每个出现的。您可以随时将块写入新文件，但我没有写那部分。

import mmap
import re

# a regex to match the "xml" nodes
r = re.compile(r'\<\?xml\s\d+\>')

with open('so.txt','r+b') as f:
    mp = mmap.mmap(f.fileno(),0)
    for m in r.finditer(mp):
        # here you can start collecting the starting positions and 
        # writing chunks to new files 
        print m.start()

score 0 · Accepted Answer

只需对您的搜索字词进行拆分

for i,part in enumerate(my_xml_Text_string.split("<?xml")):
    if not part.strip():continue # make sure its not empty
    with open("file%d.xml"%i,"w") as f: #open a file to write to
         f.write("<?xml"+part) #write the content putting your search term back in

python - 根据模式将一个文件拆分为多个文件（剪切可以发生在行内）

4 回答 4

Related

Reference