我正在尝试从 sparknlp 中提取输出(使用 Pretrained Pipeline 'explain_document_dl')。我花了很多时间寻找方法(UDF、爆炸等),但无法接近可行的解决方案。假设我想在 column 下result
和metadata
从 column中提取值entities
。在该列中有一个包含多个字典的数组
当我使用df.withColumn("entity_name", explode("entities.result"))
时,只提取第一个字典中的值。
“实体”列的内容是字典列表。
尝试提供可重现的示例/重新创建数据框(感谢下面@jonathan 提供的建议):
# content of one cell as an example:
d = [{"annotatorType":"chunk","begin":2740,"end":2747,"result":"•Ability","metadata":{"entity":"ORG","sentence":"8","chunk":"22"},"embeddings":[],"sentence_embeddings":[]}, {"annotatorType":"chunk","begin":2740,"end":2747,"result":"Fedex","metadata":{"entity":"ORG","sentence":"8","chunk":"22"},"embeddings":[],"sentence_embeddings":[]}]
from pyspark.sql.types import StructType, StructField, StringType
from array import array
schema = StructType([StructField('annotatorType', StringType(), True),
StructField('begin', IntegerType(), True),
StructField('end', IntegerType(), True),
StructField('result', StringType(), True),
StructField('sentence', StringType(), True),
StructField('chunk', StringType(), True),
StructField('metadata', StructType((StructField('entity', StringType(), True),
StructField('sentence', StringType(), True),
StructField('chunk', StringType(), True)
)), True),
StructField('embeddings', StringType(), True),
StructField('sentence_embeddings', StringType(), True)
]
)
df = spark.createDataFrame(d, schema=schema)
df.show()
在这种字典列表的情况下,它可以工作:
+-------------+-----+----+--------+--------+-----+------------+----------+-------------------+
|annotatorType|begin| end| result|sentence|chunk| metadata|embeddings|sentence_embeddings|
+-------------+-----+----+--------+--------+-----+------------+----------+-------------------+
| chunk| 2740|2747|•Ability| null| null|[ORG, 8, 22]| []| []|
| chunk| 2740|2747| Fedex| null| null|[ORG, 8, 22]| []| []|
+-------------+-----+----+--------+--------+-----+------------+----------+-------------------+
但是我被困在如何将其应用于一列,该列包含一些具有多个字典数组的单元格(因此原始单元格有多行)。
我试图将相同的模式应用于entities
列,我必须先将列转换为 json。
ent1 = ent1.withColumn("entities2", to_json("entities"))
它适用于具有 1 个字典数组的null
单元格,但适用于具有多个字典数组(第 4 行)的单元格:
ent1.withColumn("entities2", from_json("entities2", schema)).select("entities2.*").show()
+-------------+-----+----+------+--------+-----+------------+----------+-------------------+
|annotatorType|begin| end|result|sentence|chunk| metadata|embeddings|sentence_embeddings|
+-------------+-----+----+------+--------+-----+------------+----------+-------------------+
| chunk| 166| 169| Lyft| null| null|[MISC, 0, 0]| []| []|
| chunk| 11| 14| Lyft| null| null|[MISC, 0, 0]| []| []|
| chunk| 52| 55| Lyft| null| null|[MISC, 1, 0]| []| []|
| null| null|null| null| null| null| null| null| null|
+-------------+-----+----+------+--------+-----+------------+----------+-------------------+
所需的输出是
+-------------+-----+----+----------------+------------------------+----------+-------------------+
|annotatorType|begin| end| result | metadata |embeddings|sentence_embeddings|
+-------------+-----+----+----------------+------------------------+----------+-------------------+
| chunk| 166| 169|Lyft |[MISC] | []| []|
| chunk| 11| 14|Lyft |[MISC] | []| []|
| chunk| 52| 55|Lyft. |[MISC] | []| []|
| chunk| [..]|[..]|[Lyft,Lyft, |[MISC,MISC,MISC, | []| []|
| | | |FedEx Ground..] |ORG,LOC,ORG,ORG,ORG,ORG]| | |
+-------------+-----+----+----------------+------------------------+----------+-------------------+
我还尝试将每一行转换为 json,但我忘记了原始行并得到了扁平的儿子:
new_df = sqlContext.read.json(ent2.rdd.map(lambda r: r.entities2))
new_df.show()
+-------------+-----+----------+----+------------+----------------+-------------------+
|annotatorType|begin|embeddings| end| metadata| result|sentence_embeddings|
+-------------+-----+----------+----+------------+----------------+-------------------+
| chunk| 166| []| 169|[0, MISC, 0]| Lyft| []|
| chunk| 11| []| 14|[0, MISC, 0]| Lyft| []|
| chunk| 52| []| 55|[0, MISC, 1]| Lyft| []|
| chunk| 0| []| 11| [0, ORG, 0]| FedEx Ground| []|
| chunk| 717| []| 720| [1, LOC, 4]| Dock| []|
| chunk| 811| []| 816| [2, ORG, 5]| Parcel| []|
| chunk| 1080| []|1095| [3, ORG, 6]|Parcel Assistant| []|
| chunk| 1102| []|1108| [4, ORG, 7]| • Daily| []|
| chunk| 1408| []|1417| [5, ORG, 8]| Assistants| []|
+-------------+-----+----------+----+------------+----------------+-------------------+
我尝试应用 UDF 来遍历“实体”中的数组列表:
def flatten(my_dict):
d_result = defaultdict(list)
for sub in my_dict:
val = sub['result']
d_result["result"].append(val)
return d_result["result"]
ent = ent.withColumn('result', flatten(df.entities))
TypeError: Column is not iterable
我发现这篇文章Apache Spark Read JSON With Extra Columns与我的问题非常相似,但是在将列转换entities
为 json 之后,我仍然无法通过该帖子中提供的解决方案来解决它。
任何帮助表示赞赏!理想情况下,python 中的解决方案,但 scala 中的示例也很有帮助!