python - 从文本中的单词索引获取字符索引

Question

给定文本中单词的索引，我需要获取字符索引。例如，在下面的文本中：

"The cat called other cats."

单词“cat”的索引是 1。我需要 cat 的第一个字符的索引，即 c，它将是 4。我不知道这是否相关，但我正在使用 python-nltk 来获取单词。现在我能想到的唯一方法是：

 - Get the first character, find the number of words in this piece of text
 - Get the first two characters, find the number of words in this piece of text
 - Get the first three characters, find the number of words in this piece of text
 Repeat until we get to the required word.

但这将是非常低效的。任何想法将不胜感激。

score 1 · Accepted Answer

你可以在dict这里使用：

>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
    start, word = dic[word_ind]
    ind = word.find(char)
    if ind != -1:
        return start + ind
...     
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21

score 0 · Accepted Answer

import re
def char_index(sentence, word_index):
    sentence = re.split('(\s)',sentence) #Parentheses keep split characters
    return len(''.join(sentence[:word_index*2]))

>>> s = 'The die has been cast'
>>> char_index(s,3)    #'been' has index 3 in the list of words
12
>>> s[12]
'b'
>>>

score 0 · Accepted Answer

利用enumerate()

>>> def obt(phrase, indx):
...     word = phrase.split()[indx]
...     e = list(enumerate(phrase))
...     for i, j in e:
...             if j == word[0] and ''.join(x for y, x in e[i:i+len(word)]) == word:
...                     return i
... 
>>> obt("The cat called other cats.", 1)
4

python - 从文本中的单词索引获取字符索引

3 回答 3

Related

Reference