python - 根据文件中的条件随机抽取文件样本

Question

我有一个巨大的文件列表（20k）。每个文件的第一行都有一个唯一的标识符字符串。第一行仅包含此标识符字符串。文件列表有大约n不同的标识符，每个标识符至少有 500 个文件（但每个标识符的文件数量不相等）。

我需要随机抽样 500 个文件（每个标识符的）并将它们复制到另一个目录，以便最终得到原始列表的一个子集，并且每个标识符都以等量的文件表示

我知道random.sample()可以给我一个随机列表，但不考虑第一行的约束，并且shutil.copy()可以复制文件......

但是如何通过遵守文件第一行中标识符的约束在python中（有效地）做到这一点？

score 3 · Accepted Answer

根据您的描述，您必须阅读每个文件的第一行才能按标识符组织它们。我认为这样的事情会做你正在寻找的东西：

import os
import collections
import random
import shutil

def get_identifier(path):
    with open(path) as fd:
        return fd.readline().strip()       #assuming you don't want the \n in the identifier

paths = ['/home/file1', '/home/file2', '/home/file3']
destination_dir = '/tmp'
identifiers = collections.defaultdict(list)
for path in paths:
    identifier = get_identifier(path)
    identifiers[identifier].append(path)

for identifier, paths in identifiers.items():
    sample = random.sample(paths, 500)
    for path in sample:
        file_name = os.path.basename(path)
        destination = os.path.join(destination_dir, file_name)
        shutil.copy(path, destination)

python - 根据文件中的条件随机抽取文件样本

1 回答 1

Related

Reference