您可以将其他标记化工具与 udpipe R 包结合使用。这显示在https://bnosac.github.io/udpipe/docs/doc2.html。例如,使用特定于 twitter 消息的标记器,然后使用 udpipe 完成词性标记、形态特征注释和依赖解析
library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"),
lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 <NA> 1 7 1 1 #rstats #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs 4 nsubj <NA> <NA>
doc1 1 1 <NA> 9 10 2 2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop <NA> <NA>
doc1 1 1 <NA> 12 12 3 3 a a DET DT Definite=Ind|PronType=Art 4 det <NA> <NA>
doc1 1 1 <NA> 14 33 4 4 programming_language programming_language NOUN NN Number=Sing 0 root <NA> <NA>
doc2 1 1 <NA> 1 3 1 1 you you PRON PRP Case=Nom|Person=2|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 <NA> 5 7 2 2 can can AUX MD VerbForm=Fin 3 aux <NA> <NA>
doc2 1 1 <NA> 9 15 3 3 combine combine VERB VB VerbForm=Inf 0 root <NA> <NA>
doc2 1 1 <NA> 17 19 4 4 the the DET DT Definite=Def|PronType=Art 6 det <NA> <NA>
doc2 1 1 <NA> 21 31 5 5 #tokenizers #tokenizer NOUN NNS Number=Plur 6 compound <NA> <NA>
doc2 1 1 <NA> 33 39 6 6 package package NOUN NN Number=Sing 3 obj <NA> <NA>
doc2 1 1 <NA> 41 44 7 7 with with ADP IN <NA> 9 case <NA> <NA>
doc2 1 1 <NA> 46 52 8 8 @udpipe @udpipe NOUN NN Number=Sing 9 compound <NA> <NA>
doc2 1 1 <NA> 54 60 9 9 parsing parsing NOUN NN Number=Sing 6 nmod <NA> <NA>
>