我需要一个函数来提供字符串列表最适合与更大字符串对齐的索引。
例如:
给定字符串:
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
和字符串列表:
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
可以创建一个函数来产生:
indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
这是我创建的一个脚本来说明这一点:
from re import split
from numpy import vstack, zeros
import numpy as np
# I need a function which takes a string and the tokenized list
# and returns the indices for which the tokens were split at
def index_of_split(text_str, list_of_strings):
#?????
return indices
# The text string, string token list, and character binary annotations
# are all given
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose'])
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# Here we would apply our function
indices = index_of_split(text, tok)
# This list is the desired output
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
# We could now split the binary array based on these indices
bin_ann_toked = np.split(bin_ann, indices)
# and combine with the tokenized list
tokenized_strings = np.vstack((tok, bin_ann_toked)).T
# Then we can remove the trailing zeros,
# which are likely caused from spaces,
# or other non tokenized text
for i, el in enumerate(tokenized_strings):
tokenized_strings[i][1] = el[1][:len(el[0])]
print(tokenized_strings)
假设函数按描述工作,这将提供以下输出:
[['Kir4.3' array([1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['a' array([0])]
['inwardly-rectifying'
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])]
['channel' array([0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]
['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['useful' array([0, 0, 0, 0, 0, 0])]
['in' array([0, 0])]
['glucose' array([1, 1, 1, 1, 1, 1, 1])]
['-' array([0])]
['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])]
['channels' array([0, 0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]]