python - 将文本句子转换为 CONLL 格式

Question

我想将普通英文文本转换为 CONLL-U 格式供 maltparser 查找 Python 文本中的依赖项。我在java中试过但没有这样做，下面是我正在寻找的格式 -

String[] tokens = new String[11];
tokens[0] = "1\thappiness\t_\tN\tNN\tDD|SS\t2\tSS";
tokens[1] = "2\tis\t_\tV\tVV\tPS|SM\t0\tROOT";
tokens[2] = "3\tthe\t_\tAB\tAB\tKS\t2\t+A";
tokens[3] = "4\tkey\t_\tPR\tPR\t_\t2\tAA";
tokens[4] = "5\tof\t_\tN\tEN\t_\t7\tDT";
tokens[5] = "6\tsuccess\t_\tP\tTP\tPA\t7\tAT";
tokens[6] = "7\tin\t_\tN\tNN\t_\t4\tPA";
tokens[7] = "8\tthis\t_\tPR\tPR\t_\t7\tET";
tokens[8] = "9\tlife\t_\tR\tRO\t_\t10\tDT";
tokens[9] = "10\tfor\t_\tN\tNN\t_\t8\tPA";
tokens[10] = "11\tsure\t_\tP\tIP\t_\t2\tIP";

我在java中尝试过，但我不能使用standford API，我想在python中也一样。

//这是java代码的例子，但是这里创建的令牌需要通过代码而不是手动解析-

MaltParserService service =  new MaltParserService(true);
// in the CoNLL data format.
String[] tokens = new String[11];
tokens[0] = "1\thappiness\t_\tN\tNN\tDD|SS\t2\tSS";
tokens[1] = "2\tis\t_\tV\tVV\tPS|SM\t0\tROOT";
tokens[2] = "3\tthe\t_\tAB\tAB\tKS\t2\t+A";
tokens[3] = "4\tkey\t_\tPR\tPR\t_\t2\tAA";
tokens[4] = "5\tof\t_\tN\tEN\t_\t7\tDT";
tokens[5] = "6\tsuccess\t_\tP\tTP\tPA\t7\tAT";
tokens[6] = "7\tin\t_\tN\tNN\t_\t4\tPA";
tokens[7] = "8\tthis\t_\tPR\tPR\t_\t7\tET";
tokens[8] = "9\tlife\t_\tR\tRO\t_\t10\tDT";
tokens[9] = "10\tfor\t_\tN\tNN\t_\t8\tPA";
tokens[10] = "11\tsure\t_\tP\tIP\t_\t2\tIP";
// Print out the string array
for (int i = 0; i < tokens.length; i++) {
    System.out.println(tokens[i]);
}
// Reads the data format specification file
DataFormatSpecification dataFormatSpecification = service.readDataFormatSpecification(args[0]);
// Use the data format specification file to build a dependency structure based on the string array
DependencyStructure graph = service.toDependencyStructure(tokens, dataFormatSpecification);
// Print the dependency structure
System.out.println(graph);

score 0 · Accepted Answer

现在有一个将 Standford 库移植到 Python（经过改进）的端口，称为 Stanza。你可以在这里找到它：https ://stanfordnlp.github.io/stanza/

使用示例：

>>> import stanza
>>> stanza.download('en') # download English model
>>> nlp = stanza.Pipeline('en') # initialize English neural pipeline
>>> doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence

python - 将文本句子转换为 CONLL 格式

1 回答 1

Related

Reference