python - 将句子拆分为其标记作为字符注释 Python

Question

经过长时间的搜索，我没有找到任何问题的答案，这就是为什么我决定把我的问题放在这里。我正在尝试使用 RE 和 NLTK 获得一些特定的结果。给定一个句子，在每个字符上我必须使用BIS格式，即将每个字符标记为B (beginning of the token), I (intermediate or end position of the token), S for space。例如，给定句子：

笔在桌子上。

系统必须提供以下输出：

BIISBIISBISBISBIISBIIIIB

可以读作：

<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <1-char token>)

我的结果有点接近，但不是：

BIISBIISBISBISBIISBIIIIB

我得到：

BIISBIISBISBISBIISBIIIISB

意思是我在table和点之间有空格. 输出应该是：

<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <1-char token>

我的是：

<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <space> <1-char token>

到目前为止我的代码：

from nltk.tokenize import word_tokenize
import re
p = "The pen is on the table."
# Split text into words using NLTK
text = word_tokenize(p)
print(text)
initial_char = [x.replace(x[0],'B') for x in text]
print(initial_char)
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 
new = listToString(initial_char)
print(new)
def start_from_sec(my_text):
    return ' '.join([f'{word[0]}{(len(word) - 1) * "I"}' for word in my_text.split()])
res = start_from_sec(new)
p = re.sub(' ', 'S', res)
print(p)

score 2 · Accepted Answer

您可以使用单个正则表达式来标记字符串：

(\w)(\w*)|([^\w\s])|\s

查看正则表达式演示

图案细节

(\w)(\w*)- 第 1 组：任何字字符（字母、数字或_），然后第 2 组：任何 0 个或多个字字符
|- 或者
([^\w\s]) - 第 3 组：除单词和空格字符外的任何字符
|- 或者
\s- 一个空格字符

如果 Group 1 匹配，则返回值为B+ 与IGroup 2 中的字符数相同的 s 数。如果 Group 3 匹配，则替换为B。否则，匹配一个空格，替换为S.

这可以进一步定制，例如

仅视为_标点符号：r'([^\W_])([^\W_]*)|([^\w\s]|_)|\s'
用单个替换 1 个或多个空格S：r'([^\W_])([^\W_]*)|([^\w\s]|_)|\s+'

在线查看Python 演示：

import re
p = "The pen is on the table."
def repl(x):
    if x.group(1):
        return "B{}".format("I"*len(x.group(2)))
    elif x.group(3):
        return "B"
    else:
        return "S"

print( re.sub(r'(\w)(\w*)|([^\w\s])|\s', repl, p) )
# => BIISBIISBISBISBIISBIIIIB

python - 将句子拆分为其标记作为字符注释 Python

1 回答 1

Related

Reference