1

我有一个巨大的文本文件,我必须从这个文件中只提取命名实体。为此,我正在使用 Scala 语言和 Databricks 集群。

val input = sc.textFile('....Mypath...').flatMap(line => line.split("""\W+"""))

val namedEnt = something(input)

谁能告诉我要编码什么来获得命名实体?

4

1 回答 1

1

If you convert your input to a DataFrame (ex: .toDF), this is how you can get the Named Entities out:

Just an example of Spark NLP installation

spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0

Actual example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version() 
// make sure you are using the latest release 2.4.x

// Download and load the pre-trained pipeline that has NER in English
// Full list: https://github.com/JohnSnowLabs/spark-nlp-models
val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")

//Transfrom your DataFrame to a new DataFrame that has NER column
val annotation = pipeline.transform(inputDF)

// This would look something like this:
/*
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id|                text|            document|            sentence|               token|          embeddings|                 ner|       entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
|  2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/

// This is where the results for entities are:

annotation.select("entities.result").show

Let me know if you have any questions or problems with your input data and I'll update my answer.

References:

于 2020-02-14T10:52:13.907 回答