6

我有一个包含超过 1000 万行的文本文件。像这样的行:

37024469;196672001;255.0000000000
37024469;196665001;396.0000000000
37024469;196664001;396.0000000000
37024469;196399002;85.0000000000
37024469;160507001;264.0000000000
37024469;160506001;264.0000000000

如您所见,分隔符是“;”。我想根据第二个元素使用 python 对这个文本文件进行排序。我无法使用拆分功能。因为它会导致 MemoryError。我该如何管理它?

4

3 回答 3

22

不要对内存中的 1000 万行进行排序。而是分批拆分:

  • 运行 100 100k 行排序(使用文件作为迭代器,结合islice()或类似选择一个批次)。写出到其他地方的单独文件。

  • 合并排序的文件。这是一个合并生成器,您可以传递 100 个打开的文件,它会按排序顺序生成行。逐行写入新文件:

    import operator
    
    def mergeiter(*iterables, **kwargs):
        """Given a set of sorted iterables, yield the next value in merged order
    
        Takes an optional `key` callable to compare values by.
        """
        iterables = [iter(it) for it in iterables]
        iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)}
        if 'key' not in kwargs:
            key = operator.itemgetter(0)
        else:
            key = lambda item, key=kwargs['key']: key(item[0])
    
        while True:
            value, i, it = min(iterables.values(), key=key)
            yield value
            try:
                iterables[i][0] = next(it)
            except StopIteration:
                del iterables[i]
                if not iterables:
                    raise
    
于 2013-01-22T18:10:47.297 回答
5

Based on Sorting a million 32-bit integers in 2MB of RAM using Python:

import sys
from functools import partial
from heapq import merge
from tempfile import TemporaryFile

# define sorting criteria
def second_column(line, default=float("inf")):
    try:
        return int(line.split(";", 2)[1]) # use int() for numeric sort
    except (IndexError, ValueError):
        return default # a key for non-integer or non-existent 2nd column

# sort lines in small batches, write intermediate results to temporary files
sorted_files = []
nbytes = 1 << 20 # load around nbytes bytes at a time
for lines in iter(partial(sys.stdin.readlines, nbytes), []):
    lines.sort(key=second_column) # sort current batch
    f = TemporaryFile("w+")
    f.writelines(lines)
    f.seek(0) # rewind
    sorted_files.append(f)

# merge & write the result
sys.stdout.writelines(merge(*sorted_files, key=second_column))

# clean up
for f in sorted_files:
    f.close() # temporary file is deleted when it closes

heapq.merge() has key parameter since Python 3.5. You could try mergeiter() from Martijn Pieters' answer instead or do Schwartzian transform on older Python versions:

iters = [((second_column(line), line) for line in file)
         for file in sorted_files] # note: this makes the sort unstable
sorted_lines = (line for _, line in merge(*iters))
sys.stdout.writelines(sorted_lines)

Usage:

$ python sort-k2-n.py < input.txt > output.txt
于 2013-06-06T06:11:53.420 回答
1

您可以通过os.system()调用 bash 函数来完成sort

sort -k2 yourFile.txt 
于 2013-01-22T18:12:01.810 回答