python - 如何将匹配的字符串提取到 defaultdict(set) 中？Python

Question

我有一个包含这样行的文本文件（见下文），其中一个英语句子后跟一个西班牙语句子，等效翻译表由“ {##}”分隔。（如果你知道它是的输出giza-pp）

你要求在接下来的几天里，在这个部分会议期间就这个主题进行辩论。{##} sus señorías han solicitado un 辩论 sobre el tema para los próximos días , en el curso de este período de sesiones 。{##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17 -14 9-15 10-16 11-17 18-18 17-19 19-21 20-22

翻译表是这样理解的，0-0 0-1意思是英语中的第0个单词（ie you）匹配西班牙语中的第0个和第1个单词（ie sus señorías）

假设我想course从句子中知道西班牙语的翻译是什么，通常我会这样做：

from collections import defaultdict
eng, spa, trans =  x.split(" {##} ")
tt = defaultdict(set)
for s,t in [i.split("-") for i in trans.split(" ")]:
  tt[s].add(t)

query = 'course'
for i in spa.split(" ")[tt[eng.index(query)]]:
  print i

有没有一种简单的方法来完成上述操作？可以regex吗？line.find()?

经过一些尝试后，我必须这样做以涵盖许多其他问题，例如 MWE 和缺少翻译：

def getTranslation(gizaline,query):
    src, trg, trans =  gizaline.split(" {##} ")
    tt = defaultdict(set)
    for s,t in [i.split("-") for i in trans.split(" ")]:
        tt[int(s)].add(int(t))
    try:
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]
    except ValueError:
        for i in src.split(" "):
            if "-"+query or query+"-" in i:
                query = i
                break
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]

    if len(query_translated) > 0:
        return ":".join(query_translated)
    else:
        return "#NULL"

score 2 · Accepted Answer

这种方式效果很好，但我会稍微不同，使用list而不是set这样我们就可以正确排序单词（set将按字母顺序输出单词，而不是我们想要的）：

文件：q_15125575.py

#-*- encoding: utf8 -*-
from collections import defaultdict

INPUT = """you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22"""

if __name__ == "__main__":
    english, spanish, trans = INPUT.split(" {##} ")
    eng_words = english.split(' ')
    spa_words = spanish.split(' ')
    transtable = defaultdict(list)
    for e, s in [i.split('-') for i in trans.split(' ')]:
        transtable[eng_words[int(e)]].append(spa_words[int(s)])

    print(transtable['course'])
    print(transtable['you'])
    print(" ".join(transtable['course']))
    print(" ".join(transtable['you']))

输出：
['curso']
['sus', 'se\xc3\xb1or\xc3\xadas']
curso
sus señorías

它的代码稍长，因为我使用的是实际单词而不是索引 - 但这允许您直接从transtable

但是，您的方法和我的方法都在同一个问题上失败：重复单词。
print(" ".join(transtable['this'])
给出：
el este
至少按照单词出现的顺序，所以它是可行的。想要第一次出现'this'翻译？
transtable['this'][0]会给你第一个字。

并使用您的代码：

tt = defaultdict(set)
for e, s in [i.split('-') for i in trans.split(' ')]:
    tt[int(e)].add(int(s))

query = 'this'
for i in tt[eng_words.index(query)]:
    print i

给出：
7

您的代码只会打印单词第一次出现的索引。

python - 如何将匹配的字符串提取到 defaultdict(set) 中？Python

1 回答 1

Related

Reference