scala - 要求失败：johnsnowlabs.nlp 中的 inputCols 注释器错误或缺失

Question

我正在使用com.johnsnowlabs.nlp-2.2.2spark-2.4.4 来处理一些文章。在那些文章中，有一些我不感兴趣的很长的词，它们会大大降低 POS 标记的速度。我想在标记化之后和 POSTagging 之前排除它们。

我试图编写更小的代码来重现我的问题

import sc.implicits._
val documenter = new DocumentAssembler().setInputCol("text").setOutputCol("document").setIdCol("id")
val tokenizer = new Tokenizer().setInputCols(Array("document")).setOutputCol("token")
val normalizer = new Normalizer().setInputCols("token").setOutputCol("normalized").setLowercase(true)

val df = Seq("This is a very useless/ugly sentence").toDF("text")

val document = documenter.transform(df.withColumn("id", monotonically_increasing_id()))
val token = tokenizer.fit(document).transform(document)

val token_filtered = token
  .drop("token")
  .join(token
    .select(col("id"), col("token"))
    .withColumn("tmp", explode(col("token")))
    .groupBy("id")
    .agg(collect_list(col("tmp")).as("token")),
    Seq("id"))
token_filtered.select($"token").show(false)
val normal = normalizer.fit(token_filtered).transform(token_filtered)

转换时出现此错误token_filtered

+--------------------+---+--------------------+--------------------+--------------------+
|                text| id|            document|            sentence|               token|
+--------------------+---+--------------------+--------------------+--------------------+
|This is a very us...|  0|[[document, 0, 35...|[[document, 0, 35...|[[token, 0, 3, Th...|
+--------------------+---+--------------------+--------------------+--------------------+


Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: Wrong or missing inputCols annotators in NORMALIZER_4bde2f08742a.
Received inputCols: token.
Make sure such annotators exist in your pipeline, with the right output
names and that they have following annotator types: token

如果我直接拟合并转换它就可以正常工作token似乎normalizer 在explode//期间丢失了一些信息，但架构和数据看起来相同groupBy。collect_list

任何想法？

score 1 · Accepted Answer

但是，要更新@ticapix给出的正确答案，在新版本中添加了两个功能SentenceDetector，Tokenizer它们是minLenght和maxLength：

您可以简单地过滤掉您不想通过管道提供的令牌：

val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
      .setMinLength(4)
      .setMaxLength(10)

参考资料：

score 0 · Accepted Answer

答案是：不可行。（https://github.com/JohnSnowLabs/spark-nlp/issues/653）

Annotator 在groupBy操作过程中被销毁。

解决方案是：

实施自定义Transformer
使用 UDF
在将数据输入管道之前对其进行预处理

scala - 要求失败：johnsnowlabs.nlp 中的 inputCols 注释器错误或缺失

2 回答 2

Related

Reference