python - 根据位置将大文本文件拆分为小文件

Question

假设我有一个大文件 file.txt，它的数据约为 300,000。我想根据某些关键位置拆分它。请参阅下面的 file.txt：

Line 1: U0001;POUNDS;**CAN**;1234
Line 2: U0001;POUNDS;**USA**;1234
Line 3: U0001;POUNDS;**CAN**;1234
Line 100000; U0001;POUNDS;**CAN**;1234

这些位置仅限于 10-15 个不同的国家。而且我需要将特定国家/地区的每条记录分隔在一个特定文件中。如何在 Python 中完成此任务

感谢帮助

score 2 · Accepted Answer

这将在读取每一行时以非常低的内存开销运行。

算法：

打开输入文件
从输入文件中读取一行
从行中获取国家
如果是新国家，则为国家打开文件
将行写入国家/地区的文件
如果有更多行则循环
关闭文件

代码：

with open('file.txt', 'r') as infile:
    try:
        outfiles = {}
        for line in infile:
            country = line.split(';')[2].strip('*')
            if country not in outfiles:
                outfiles[country] = open(country + '.txt', 'w')
            outfiles[country].write(line)
    finally:
        for outfile in outfiles.values():
            outfile.close()

score 0 · Accepted Answer

with open("file.txt") as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
text = [x.strip() for x in content] 

x = [i.split(";") for i in text]
x.sort(key=lambda x: x[2])
from itertools import groupby
from operator get itemgetter
y = groupby(x, itemgetter(2))
res = [(i[0],[j for j in i[1]]) for i in y]
for country in res:
     with open(country[0]+".txt","w") as writeFile:
             writeFile.writelines("%s\n" % ';'.join(l) for l in country[1])

将按您的项目分组！希望能帮助到你！

score -1 · Accepted Answer

# the formatting-function for the filename used for saving
outputFileName = "{}.txt".format
# alternative:
##import time
##outputFileName = lambda loc: "{}_{}.txt".format(loc, time.asciitime())

#make a dictionary indexed by location, the contained item is new content of the file for the location
sortedByLocation = {}
f = open("file.txt", "r")

#iterate each line and look at the column for the location
for l in f.readlines():
    line = l.split(';')
    #the third field (indices begin with 0) is the location-abbreviation
    # make the string lower, cause on some filesystems the file with upper chars gets overwritten with only the elements with lower characters, while python differs between the upper and lower
    location = line[2].lower().strip()
    #get previous lines of the location and store it back
    tmp = sortedByLocation.get(location, "")
    sortedByLocation[location]=tmp+l.strip()+'\n'

f.close()

#save file for each location
for location, text in sortedByLocation.items():
    with open(outputFileName(location) as f:
        f.write(text)

score -1 · Accepted Answer

看起来你拥有的是一个csv文件。 csv代表逗号分隔的值，但任何使用不同分隔符（在本例中为分号;）的文件都可以视为csv文件。

我们将使用python模块csv读取文件，然后为每个国家写一个文件

import csv 
from collections import defaultdict

d = defaultdict(list)
with open('file.txt', 'rb') as f:
    r = csv.reader(f, delimiter=';')
    for line in r:
        d[l[2]].append(l)

for country in d:
    with open('{}.txt'.format(country), 'wb') as outfile:
        w = csv.writer(outfile, delimiter=';')
        for line in d[country]:
            w.writerow(line)

python - 根据位置将大文本文件拆分为小文件

4 回答 4

Related

Reference