我正在用 PyBrain 构建一个用于文本分类问题的循环神经网络。经过多次尝试,我仍然无法弄清楚如何将字符串列表转换为可用作数据集的数组。我做了什么:
import collections,re
from pybrain.datasets import SupervisedDataSet
#create the supervised dataset variable with 5 inputs and 1 output
windowSize=5
main_ds = SupervisedDataSet(windowSize,1)
with open('ltest5lg_d1.fr','r') as train_1:
import_data_train=train_1.readlines()
train_data = []
for lines in import_data_train:
s = lines.split()
for words in s:
train_data.append(words)
bagsofwords = [collections.Counter(re.findall(r'\w+', txt)) for txt in train_data]
sumbags = sum(bagsofwords, collections.Counter())
所以我得到了训练数据的频率表,但我无法弄清楚如何将数据本身转换为可以用作 main_ds 变量中的输入的某种格式。