2

我有一本字典和一个 CSV 文件(实际上是制表符分隔的):

dict1

{1 : ['Charles', 22],
2: ['James', 36],
3: ['John', 18]}

data.csv


[ 22 | Charles goes to the cinema | Activity    ]
[ 46 | John is a butcher          | Profession  ]
[ 95 | Charles is a firefighter   | Profession  ]
[ 67 | James goes to the zoo      | Activity    ]

我想在dict1's 的值的第一项中获取字符串(名称)并在 csv 的第二列中搜索它。如果名字出现在句子中,我想打印第一个(并且只有第一个)句子。

但是我在搜索时遇到问题 - 如何在迭代时访问列/行数据dict1?我尝试过这样的事情:

with open('data.csv', 'r', encoding='utf-8') as file:
    reader = csv.reader(file, delimiter='\t')
    for (id, (name, age)) in dict1.items():
        if name in reader.row[1] # reader.row[1] is wrong!!!
        print(reader.row[1])
4

1 回答 1

1

是的,罗根乔什是对的。更好的方法是遍历 CSV 文件并找到任何键。

requested = {d[0] for d in dict1.values()}
with open('/tmp/f.csv', newline='') as csvfile:
    for row in csv.reader(csvfile, delimiter='\t'):
        sentence = row[1]
        found = {n for n in requested if n in sentence}
        for n in found:
            print(f'{n}: {sentence}')
        requested -= found
        if not requested:  # optimization, all names used
            break

编辑:回答问题,而不是我的想象


EDIT2:在澄清(和一些新要求)之后......我希望我成功了。

每行仅打印句子。它不检查同一个句子是否在另一行中。您可以set()用于保持匹配的句子并在 CVS 文件处理完毕后打印它们。

我将正则表达式用于匹配世界而不是任何子字符串。

import csv
import re

requested = {re.compile(r'\b' + re.escape(d[0]) + r'\b') for d in dict1.values()}
with open('/tmp/f.csv', newline='') as csvfile:
    for row in csv.reader(csvfile, delimiter='\t'):
        sentence = row[1]
        found = {n for n in requested if n.search(sentence)}
        if found:
            requested -= found
            print(sentence)
        if not requested:
            break

EDIT3:恢复命中名称(新要求——就像在真正的开发项目中一样:-P)

首先,您可以匹配多个名称(请参阅 参考资料len(found))。

在上一个示例中,您可以从已编译r'\b的正则表达式中恢复名称(因为在名称之前和之后添加了之前):

found_names = [r.pattern[2:-2] for r in found]

但我认为这不是最好的方法。

更好的方法是将原始名称添加到requested. 我决定使用set. tuples对集合的操作非常快。

requested = {(re.compile(r'\b' + re.escape(d[0]) + r'\b'), d[0])
             for d in dict1.values()}
with open('/tmp/f.csv', newline='') as csvfile:
    for row in csv.reader(csvfile, delimiter='\t'):
        sentence = row[1]
        found = {(r, n) for r, n in requested if r.search(sentence)}
        if found:
            found_names = tuple(n for r, n in found)
            print(found_names, sentence)
            requested -= found
        if not requested:
            break

现在找到的名称(原始d[0])在 list 中found_names。您可以根据需要使用它。例如更改为字符串(替换found_name=和打印行):

found_names = ', '.join(n for r, n in found)
print(f'{found_names}: {sentence}')
于 2019-11-17T13:29:30.657 回答