1

我有一个 python 字符串和一个选定文本的子字符串。例如,字符串可以是

stringy = "the bee buzzed loudly"

我想在这个字符串中选择文本“bee buzzed”。我有这个特定字符串的字符偏移量,即 4-14。因为这些是所选文本之间的字符级别索引。

将这些转换为单词级别索引的最简单方法是什么,即 1-2,因为正在选择第二个和第三个单词。我有很多这样标记的字符串,我想简单有效地转换索引。数据当前存储在字典中,如下所示:

data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

我想把它转换成这种形式

data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}

谢谢!

4

3 回答 3

2

这是一个简单的列表索引方法:

# set up data
string  = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"", "start_word":0,"end_word":0}


#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly', 'start_word': 1, 'end_word': 2}

请注意,这假设您在字符串中使用按时间顺序排列的单词

于 2020-10-20T15:35:42.547 回答
2

这似乎是一个标记化问题。我的解决方案是使用跨度标记器,然后在跨度中搜索子字符串跨度。所以使用 nltk 库:

import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b, sub_e = 4, 14  # substring begin and end
[i for i, (b, e) in enumerate(tokenizer.span_tokenize(stringy))
 if b >= sub_b and e <= sub_e]

但这有点复杂。 tokenizer.span_tokenize(stringy)返回它识别的每个标记/单词的跨度。

于 2020-10-20T15:43:20.163 回答
0

请尝试此代码;

def char_change(dic, start_char, end_char, *arg):
    dic[arg[0]] = start_char
    dic[arg[1]] = end_char


data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))

char_change(data, start_char, end_char, "start_char", "end_char")

print(data)

默认字典:

data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

输入

Please enter your start character: 1
Please enter your end character: 2

输出字典:

{'string': 'the bee buzzed loudly', 'start_char': 1, 'end_char': 2}
于 2020-10-20T16:07:37.597 回答