我有一个巨大的文本文件,我必须从这个文件中只提取命名实体。为此,我正在使用 Scala 语言和 Databricks 集群。
val input = sc.textFile('....Mypath...').flatMap(line => line.split("""\W+"""))
val namedEnt = something(input)
谁能告诉我要编码什么来获得命名实体?
我有一个巨大的文本文件,我必须从这个文件中只提取命名实体。为此,我正在使用 Scala 语言和 Databricks 集群。
val input = sc.textFile('....Mypath...').flatMap(line => line.split("""\W+"""))
val namedEnt = something(input)
谁能告诉我要编码什么来获得命名实体?
If you convert your input
to a DataFrame (ex: .toDF
), this is how you can get the Named Entities out:
Just an example of Spark NLP installation
spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0
Actual example:
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
// make sure you are using the latest release 2.4.x
// Download and load the pre-trained pipeline that has NER in English
// Full list: https://github.com/JohnSnowLabs/spark-nlp-models
val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")
//Transfrom your DataFrame to a new DataFrame that has NER column
val annotation = pipeline.transform(inputDF)
// This would look something like this:
/*
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id| text| document| sentence| token| embeddings| ner| entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/
// This is where the results for entities are:
annotation.select("entities.result").show
Let me know if you have any questions or problems with your input data and I'll update my answer.
References: