我正在使用 pyspark 数据框。
我的 df 看起来像这样:
df.select('words').show(5, truncate = 130)
+----------------------------------------------------------------------------------------------------------------------------------+
| words |
+----------------------------------------------------------------------------------------------------------------------------------+
|[content, type, multipart, alternative, boundary, nextpart, da, df, nextpart, da, df, content, type, text, plain, charset, asci...|
|[receive, ameurht, eop, eur, prod, protection, outlook, com, cyprmb, namprd, prod, outlook, com, https, via, cyprca, namprd, pr...|
|[plus, every, photographer, need, mm, lens, digital, photography, school, email, newsletter, http, click, aweber, com, ct, l, m...|
|[content, type, multipart, alternative, boundary, nextpart, da, beb, nextpart, da, beb, content, type, text, plain, charset, as...|
|[original, message, customer, service, mailto, ilpjmwofnst, qssadxnvrvc, narrig, stepmotherr, eviews, com, send, thursday, dece...|
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows
我需要在类型列上LanguageDetectorDL
使用spark NLP,以便它检测英语并仅保留英语单词并删除其他单词。words
array<strings>
我已经习惯DocumentAssembler()
将数据转换为注释格式:
documentAssembler = DocumentAssembler().setInputCol('words').setOutputCol('document')
但我不知道如何LanguageDetectorDL
在列上使用并摆脱非英语单词?