0

我正在准备一个脚本,该脚本从具有特定标签的标记的标记化文本中重构多标记字符串。我的标记与它们在原始文本中的开始和结束索引相关联。

这是一段文本示例:

t = "Breakfast at Tiffany's is a novella by Truman Capote."

包含原始文本索引和标签的令牌数据结构:

[(['Breakfast', 0, 9], 'BOOK'),
 (['at', 10, 12], 'BOOK'),
 (['Tiffany', 13, 20], 'BOOK'),
 (["'", 20, 21], 'BOOK'),
 (['s', 21, 22], 'BOOK'),
 (['is', 23, 25], 'O'),
 (['a', 26, 27], 'O'),
 (['novella', 28, 35], 'O'),
 (['by', 36, 38], 'O'),
 (['Truman', 39, 45], 'PER'),
 (['Capote', 46, 52], 'PER'),
 (['.', 52, 53], 'O')]

该数据结构由t以下生成

import re

tokens = [[m.group(0), m.start(), m.end()] for m in re.finditer(r"\w+|[^\w\s]", t, re.UNICODE)]
tags = ['BOOK', 'BOOK', 'BOOK', 'BOOK', 'BOOK', 'O', 'O', 'O', 'O', 'PER', 'PER', 'O']
token_tuples = list(zip(tokens, tags))

我希望我的脚本做的是迭代token_tuples,如果它遇到非O令牌,它会从主迭代中中断并重新构成标记的多令牌跨度,直到它碰到最近的令牌O

这是当前脚本:

for i in range(len(token_tuples)):

    if token_tuples[i][1] != 'O':

        tag = token_tuples[i][1]
        start_ix = token_tuples[i][0][1]

        slider = i+1

        while slider < len(token_tuples):

            if tag != token_tuples[slider][1]:

                end_ix = token_tuples[slider][0][2]

                print((t[start_ix:end_ix], tag))
                break

            else:
                slider+=1

这打印:

("Breakfast at Tiffany's is", 'BOOK')
("at Tiffany's is", 'BOOK')
("Tiffany's is", 'BOOK')
("'s is", 'BOOK')
('s is', 'BOOK')
('Truman Capote.', 'PER')
('Capote.', 'PER')

需要修改什么,以便此示例的输出为:

> ("Breakfast at Tiffany's", "BOOK")
> ("Truman Capote", "PER")
4

1 回答 1

0

这是一个解决方案。如果你能想出一些不那么啰嗦的东西,我很乐意选择你的答案!

def extract_entities(t, token_tuples):

    entities = []
    tag = ''

    for i in range(len(token_tuples)):

        if token_tuples[i][1] != 'O':

            if token_tuples[i][1] != tag:
                tag = token_tuples[i][1]
                start_ix = token_tuples[i][0][1]

            if i+1 < len(token_tuples):

                if tag != token_tuples[i+1][1]:
                    end_ix = token_tuples[i][0][2]
                    entities.append((t[start_ix:end_ix], tag))
                    tag = ''

    return(entities)
于 2020-04-06T14:39:59.753 回答