python - 如何根据它们的 id 从给定的文本文件中提取字符串的第一个、第二个和最后一个实例？

Question

我有一个文本文件，其中包含以下形式的字符串：

66_0M100
66_1M101
66_2M102
66_3M103
66_4M103
66_5M103
67_0M100
67_1M102
67_2M105
67_3M103
67_4M106

“M”前面的数字代表实例的数量。我必须提取每个 id 的第一个、第二个和最后一个实例（id 是字符串的第一部分，在下划线之前。这里是 66 和 67）。此外，如果有任何 id 没有至少 3 个实例，则应该忽略它。

例如，id 66 和 67 的输出将是：

66_0M100 (1st instance of 66)
66_1M101 (2nd instance of 66)
66_5M103 (last instance of 66)
67_0M100 (1st instance of 67)
67_1M102 (2nd instance of 67)
67_4M106 (last instance of 67)

此输出应写入新的文本文件。

我尝试了以下代码，它给了我第一个和第二个实例，但我无法提取最后一个实例。

import numpy as np
from collections import defaultdict
data = defaultdict(list)
for fileName in ["list.txt"]:
    with open(fileName,'r') as file1:
        for line in file1:
            col1,col2 = line.split("_")
            for i in np.unique(col1):
                id1,id2 = col2.split("M")
                if ((int(id1) == 0) or (int(id1) == 1)):
                    print(line)

score 1 · Accepted Answer

关键逻辑（将跳过无效实例并收集所有有效实例）：

def ensure_instances(data_dict, id_key):
    if len(d[id_key]) < 3:
        del d[id_key]   # eliminating identifiers with less than 3 instances
    else:
        d[id_key] = d[id_key][:2] + [d[id_key][-1]]


with open('file.txt') as f:
    d = defaultdict(list)
    prev_id = None   # refers to previous identifier
    for line in f:
        id_, rest = line.split('_')
        if prev_id and id_ != prev_id:
            ensure_instances(d, prev_id)
        d[id_].append(line)
        prev_id = id_
    ensure_instances(d, id_)    # check the last identifier
    print(''.join(line for l in d.values() for line in l))

样本输出：

66_0M100
66_1M101
66_5M103
67_0M100
67_1M102
67_4M106

如果您需要将每个输入文件的输出写入单独的文本文件 - 打开目标文件（以写入模式'w'）以及输入文件，例如：

with open('file.txt') as f, open('result.txt', 'w') as out_file:
    ...
    out_file.write(''.join(line for l in d.values() for line in l))

score 1 · Accepted Answer

一个简单的正则表达式、groupby 和 itemgetter 可以解决这个问题：

from itertools import groupby
from operator import itemgetter
import re

pat = re.compile(r'^(\d\d)_')

def search_for_id(line):
    m = pat.search(line)
    return m.group(1) if m else ''

with open('list.txt') as f:
    which_ones = itemgetter(0, 1, -1)

    for id_key, group in groupby(f, search_for_id):
        items = list(group)
        if id_key and len(items) >= 3:
            selected_items = which_ones([x.strip() for x in items])
            print(selected_items)

score 0 · Accepted Answer

你可以试试：

from collections import defaultdict

data = defaultdict(list)

for fileName in ["list.txt"]:
    with open(fileName,'r') as file1:
        for line in file1:
            id_, extra = line.split("_")
            instance_no = extra.split('M')[0]
            data[id_].append((instance_no, line.strip()))

for id_, values in data.items():
    instances_in_order = sorted(values)
    if len(values) >= 3:
        print(f'{instances_in_order[0][1]} (1st instance of {id_})')
        print(f'{instances_in_order[0][1]} (2nd instance of {id_})')
        print(f'{instances_in_order[-1][1]} (last instance of {id_})')

输出：

66_0M100 (1st instance of 66)
66_0M100 (2nd instance of 66)
66_5M103 (last instance of 66)
67_0M100 (1st instance of 67)
67_0M100 (2nd instance of 67)
67_4M106 (last instance of 67)

score 0 · Accepted Answer

尝试这个

from collections import defaultdict
data = defaultdict(list)

with open('filename.txt') as file1:
    for line in file1:
        id = line[:2]  # first 2 characters
        instance = m.split('M')[0].split('_')[1]  # part between _ and M
        data[id].append(instance)

for id, strings in data.items():
    print('ID: ' + id)
    print(strings[0]) # first instance
    print(strings[1]) # second instance
    print(strings[-1]) # last instance

score 0 · Accepted Answer

NumPy 太过分了；你可以用一个简单的字典来做到这一点。如果您需要额外的形式，请使用具有四列的数据框，但您只是在复制 dict 的工作。

就像您现在所做的那样，读取一行并提取您需要的字段。dict键是col1值。对于数据处理，让我从那一点开始编码：

data = {}
...
label = line.strip()
ID = label.split('_')[0]
if ID in data:
    seen = len(data[ID])
    # If we've already seen 3 instances, replace the last one;
    #   otherwise, just append the new sighting
    if seen == 3:
        data[ID][-1] = label
    else:
        data[ID].append(label)
# New ID; store the first value
else:
    data[ID] = [label]

您现在有一个由 ID 键入的字典。每个值都是第一次、第二次和最近的目击。根据需要写入文件。这可以通过更高效的代码（平滑处理新条目逻辑）来完成，但这将使您对机制有一个很好的理解。

python - 如何根据它们的 id 从给定的文本文件中提取字符串的第一个、第二个和最后一个实例？

5 回答 5

Related

Reference