python - 以更快的方式创建字典 - Python

Question

我有以下包含超过 500.000 行的文件。这些行如下所示：

0-0 0-1 1-2 1-3 2-4 3-5
0-1 0-2 1-3 2-4 3-5 4-6 5-7 6-7
0-9 1-8 2-14 3-7 5-6 4-7 5-8 6-10 7-11

对于每个元组，第一个数字表示文本 a 中第 n 行上的单词的索引，第二个数字表示同一行 n 但在文本 b 中的单词的索引。还值得指出的是，文本 a 中的同一个词可能会连接到文本 b 中的多个词；与索引 0 处的行的情况一样，文本 a 中位置 0 处的单词连接到文本 b 中位置 0 和 1 处的两个单词。现在我想从上面的行中提取信息，以便轻松检索文本 a 中的哪个单词连接到文本 b 中的哪个单词。我的想法是使用字典，如下面的代码：

#suppose that I have opened the file as f
for line in f.readlines():
    #I create a dictionary to save my results
    dict_st=dict()
    #I split the line so to get items like '0-0', '0-1', etc.
    items=line.split()  
    for item in align_spl:
        #I split each item at the hyphen so to get the two digits that are now string.
        als=item.split('-')
        #I fill the dictionary
        if dict_st.has_key(int(als[0]))==False:
            dict_st[int(als[0])]=[int(als[1])]
        else: dict_st[int(als[0])].append(int(als[1]))

在提取与跨文本的单词对应关系的所有信息之后，我然后打印彼此对齐的单词。现在这种方法很慢；特别是如果我必须从超过 500.000 个句子中重复它。我想知道是否有更快的方法来提取这些信息。谢谢你。

score 3 · Accepted Answer

嗨，我不确定这是你需要的

如果您需要每行的字典：

for line in f:
    dict_st=dict()
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

如果您需要整个文件的字典：

dict_st={}
for line in f:
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

我使用set而不是list防止值重复。如果您需要这些重复，请使用“列表”

dict_st={}
for line in f:
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, []).append(v)

注意，可以在不使用内存读取文件的情况下迭代文件readlines()

python - 以更快的方式创建字典 - Python

1 回答 1

Related

Reference