python - 如何在 pyspark 列上使用“LanguageDetectorDL”火花 NLP？

Question

我正在使用 pyspark 数据框。
我的 df 看起来像这样：

df.select('words').show(5, truncate = 130)

+----------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   words          |
+----------------------------------------------------------------------------------------------------------------------------------+
|[content, type, multipart, alternative, boundary, nextpart, da, df, nextpart, da, df, content, type, text, plain, charset, asci...|
|[receive, ameurht, eop, eur, prod, protection, outlook, com, cyprmb, namprd, prod, outlook, com, https, via, cyprca, namprd, pr...|
|[plus, every, photographer, need, mm, lens, digital, photography, school, email, newsletter, http, click, aweber, com, ct, l, m...|
|[content, type, multipart, alternative, boundary, nextpart, da, beb, nextpart, da, beb, content, type, text, plain, charset, as...|
|[original, message, customer, service, mailto, ilpjmwofnst, qssadxnvrvc, narrig, stepmotherr, eviews, com, send, thursday, dece...|
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

我需要在类型列上LanguageDetectorDL使用spark NLP，以便它检测英语并仅保留英语单词并删除其他单词。wordsarray<strings>

我已经习惯DocumentAssembler()将数据转换为注释格式：

documentAssembler = DocumentAssembler().setInputCol('words').setOutputCol('document')

但我不知道如何LanguageDetectorDL在列上使用并摆脱非英语单词？

score 1 · Accepted Answer

Spark-NLP 的语言检测器在 char 级别工作。这意味着它不使用字典来匹配单词。如果您提供整个句子，它肯定会更好地工作，但如果您只是以您想要检测的语言传递一大串连接的标记，它应该表现得可以接受，例如使用这个检测 21 种不同语言的预训练模型，

from sparknlp.pretrained import PretrainedPipeline
language_detector_pipeline = PretrainedPipeline('detect_language_21', lang='xx')

language_detector_pipeline.annotate("«Нападение на 13-й участок»")


{'document': ['«Нападение на 13-й участок»'],
 'sentence': ['«Нападение на 13-й участок»'],
 'language': ['bg']}

检查您将使用的语言是否在模型支持的语言中，

https://nlp.johnsnowlabs.com/2020/12/05/detect_language_21_xx.html

并确保您传递了一个大约 150 个字符长的字符串，以便模型有更多机会返回一个好的答案。

python - 如何在 pyspark 列上使用“LanguageDetectorDL”火花 NLP？

1 回答 1

Related

Reference