0

在 MLLIB 管道中,如何在 Stemmer(来自 Spark NLP)之后链接 CountVectorizer(来自 SparkML)?

当我尝试在管道中同时使用两者时,我得到:

myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.

问候,

4

1 回答 1

0

您需要在 Spark NLP 管道中添加 Finisher。试试看:

  val documentAssembler =
    new DocumentAssembler().setInputCol("text").setOutputCol("document")
  val sentenceDetector =
    new SentenceDetector().setInputCols("document").setOutputCol("sentences")
  val tokenizer =
    new Tokenizer().setInputCols("sentences").setOutputCol("token")
  val stemmer = new Stemmer()
    .setInputCols("token")
    .setOutputCol("stem")

  val finisher = new Finisher()
    .setInputCols("stem")
    .setOutputCols("token_features")
    .setOutputAsArray(true)
    .setCleanAnnotations(false)

  val cv = new CountVectorizer()
    .setInputCol("token_features")
    .setOutputCol("features")

  val pipeline = new Pipeline()
    .setStages(
      Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stemmer,
        finisher,
        cv
      ))

val data =
  Seq("Peter Pipers employees are picking pecks of pickled peppers.")
    .toDF("text")

val model = pipeline.fit(data)
val df = model.transform(data)

输出:

+--------------------------------------------------------------------+
|features                                                            |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+
于 2021-10-08T11:29:52.493 回答