python - 在python中组合列表中的元素

Question

我正在处理 ngram 模型的填充。我的代码是这样的。

n = 5
text = "hello how are"
tokens = text[-n:]
prefix = tokens[:-1]
toPad = (n) - len(prefix)-1
prefix = "<s>"*toPad+tokens
print(list(prefix))

这给了我['w', ' ', 'a', 'r', 'e']对我来说正确的输出。但是当输入文本是"he"它给我输出['<', 's', '>', '<', 's', '>', '<', 's', '>', 'h', 'e']。

但不是这个，我想要的输出是

['<s>', '<s>', '<s>', 'h', 'e']

请帮我解决这个问题。

score 1 · Accepted Answer

因为prefix它是一个字符串，所以函数 list() 会将它标记为一个字符列表，因为<s>它是一个字符串，它会将它拆分为['<','s','>']. 您可以在循环中生成一个列表，例如：

n = 5
text = "he"
tokens = text[-n:]
prefix = tokens[:-1]
toPad = (n) - len(prefix)-1
prefix = "<s>"*toPad+tokens
prefList = []
i = 0
while i < len(prefix):
    if prefix[i] == "<":
        prefList.append("<s>")
        i += 3
    else:
        prefList.append(prefix[i])
        i += 1

print(prefList)

输出：['<s>', '<s>', '<s>', 'h', 'e']

score 1 · Accepted Answer

使用正则表达式中的 findall 创建列表，而不是列表

代码

import re

def parse(text):
  n = 5
  tokens = text[-n:]
  prefix = tokens[:-1]
  toPad = (n) - len(prefix)-1
  prefix = "<s>"*toPad+tokens

  # Use regex findall to create list
  return re.findall(r'<s>|.', prefix)  # Creates list of either <s> or any character

测试

print(parse("hello how are"))  # ['w', ' ', 'a', 'r', 'e']
print(parse("he"))             # ['<s>', '<s>', '<s>', 'h', 'e']

python - 在python中组合列表中的元素

2 回答 2

Related

Reference