python - Python从文本文件创建书籍索引

Question

我有一个可能看起来像这样的文本文件...

3:degree
54:connected
93:adjacent
54:vertex
19:edge
64:neighbor
72:path
55:shortest path
127:tree
3:degree
55:graph
64:adjacent   and so on....

我想让我的函数读取每一行文本，并在冒号处将其拆分为字典，其中单词位于“键”位置，页码位于字典的“值”位置 - I'然后必须创建一个新字典并扫描每个单词，如果它已经在字典中，只需在其后面添加页码，如果它不在字典中，我会将它添加到字典中。

这是我目前的想法...

def index(fileName):

    inFile=open(fileName,'r')
    index={}
    for line in inFile:
        line=line.strip()      #This will get rid of my new line character
        word=line[1]
        if word not in index:
            index[word]=[]
            index[word].append(line)
    return index

fileName='terms.txt'

print(index(fileName))

我在正确的页面上，但只需要一点帮助即可开始。

score 0 · Accepted Answer

您可以使用str.split将字符串分隔为标记。在您的情况下，分隔符是:.

records = """3:degree
     54:connected
     93:adjacent
     54:vertex"""
index = {}
for line in records.split('\n'):
     page, word = line.split(':')
     index[word] = int(page.strip())

index
# {'vertex': 54, 'connected': 54, 'adjacent': 93, 'degree': 3}

在某些时候，您将需要处理具有多个页面引用的单词。为此，我建议创建一个collections.defaultdictwithlist作为默认值：

from collections import defaultdict
index = defaultdict(list)
index[word].append(page)  # add reference to this page

score 0 · Accepted Answer

编辑我评论的行# edit

def index(fileName):
    inFile=open(fileName,'r')
    index={}
    for line in inFile:
        line=line.strip().split(':',1) # edit
        word,index=line # edit
        if word not in index:
            index[word]=[]
        index[word].append(index) # edit
    return index

score 0 · Accepted Answer

您没有分割线，您只是在位置 1 获取角色。

用于.split(':', 1)分割线一次:：

def index(filename):
    with open(filename) as infile:
        index = {}
        for line in infile:
            page, word = map(str.strip, line.split(':', 1))
            index.setdefault(word, []).append(int(page))
        return index

您可能希望使用一个集合来避免将相同的页码添加两次。您还可以collections.defaultdict进一步简化这一点：

from collections import defaultdict

def index(filename):
    with open(filename) as infile:
        index = defaultdict(set)
        for line in infile:
            page, word = map(str.strip, line.split(':', 1))
            index[word].add(int(page))
        return index

这给出了：

defaultdict(<type 'set'>, {'neighbor': set([64]), 'degree': set([3]), 'tree': set([127]), 'vertex': set([54]), 'shortest path': set([55]), 'edge': set([19]), 'connected': set([54]), 'adjacent': set([64, 93]), 'graph': set([55]), 'path': set([72])})

用于您的输入文本；adefaultdict是一个子类，dict其行为与普通字典一样，只是它会set为您尝试访问但尚未出现的每个键创建一个新的。

python - Python从文本文件创建书籍索引

3 回答 3

Related

Reference