python - 将数据集（CONLL 格式）拆分为开发、训练和测试

Question

我有一个遵循 CONLL 格式的数据集，带有令牌级注释。

token   label
Also    O
,   O
outdoor B-claim
activities  I-claim
enable  I-claim
me  I-claim
to  I-claim
socialize   I-claim
with    I-claim
other   I-claim
people  I-claim
and I-claim
enjoy   I-claim
natural I-claim
beauty  I-claim
.   O
                    
There   O
are O
strong  O
advantages  O
to  O
spend   O
leisure O
time    O
outdoors    O
.   O

空行分隔文档的句子。每个句子都被视为机器学习模型中的实例。我想将数据集拆分为训练、测试和开发，但要确保数据集之间没有拆分任何句子。python中是否有任何库可用于拆分此类数据集，还是必须手动执行此操作？

提前致谢！

score 0 · Accepted Answer

我还没有找到自动拆分它的方法，所以我构建了一个函数来做到这一点。
希望它会帮助你。

def read_conll_data(filepath = "/path/to/ner_data.txt"):

    count = 0
    num_examples = 0

    first_line = True

    # Change the extension from conll to txt to read the file properly
    with open(filepath, 'r') as f:
        data = f.read()

    # Count number of examples in the file
    for line in data.splitlines():
        if line == "":
            num_examples += 1

    for line in data.splitlines():

        # Collect 80% of the data as training examples
        if count < int(num_examples * 0.8):
            with open("train_set.txt", "a") as train_set:
                if line != "":
                    train_set.write(line)
                    train_set.write("\n")
                else:
                    count += 1
                    train_set.write("\n")

        else:
            with open("test_set.txt", "a") as test_set:
                # Make sure that -DOCSTART- -X- O is included at the top of the file
                if first_line:
                    test_set.write(data.splitlines()[0])
                    test_set.write("\n")
                    first_line = False
                
                if line != "":
                    test_set.write(line)
                    test_set.write("\n")
                else:
                    test_set.write("\n")

python - 将数据集（CONLL 格式）拆分为开发、训练和测试

1 回答 1

Related

Reference