-2

json 文件的结构如下:

{"text":"I","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}}
{"text":"love","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}}
{"text":"Coca-cola.","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}}
{"text":"He","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}}
{"text":"loves","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}}
{"text":"Pepsi.","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}}

我想连接属于同一篇论文(paper_id)的句子,最终有:

{"text":"I love Coca-cola. ","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}}
{"text":"He loves Pepsi.","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}}

任何想法如何解决这个问题?我坚持迭代那些嵌套字典。

将数据加载到列表中

data = [json.loads(line) for line in open('datafile_path', 'r')]
for sentence in data:
    for key,dict_n in sentence.items():
        for key2,value in dict_n.items():
            print(value)

这会引发错误:AttributeError:“str”对象没有属性“items”

4

2 回答 2

1

首先你得到这样的ID:

def getIds(data):
    ids = []
    for i in data:
        if i['meta']['paper_id'] not in ids:
            ids.append(i['meta']['paper_id'])
return ids

然后遍历列表:

concatenate_sentence = {"text":"","meta":{"paper_id":"","title":""}}
for id in paper_ids:
    for sentence in data_list:
        if sentence['meta']['paper_id'] == id:
            concatenate_sentence['text'] += sentence['text'] + ' '
            concatenate_sentence['meta']['paper_id'] = id
            concatenate_sentence['meta']['title'] = sentence['meta']['title']

    new_data.append(concatenate_sentence)     
    concatenate_sentence = {"text":"","meta":{"paper_id":"","title":""}}   

 print(new_data)

输出:

[{'text': 'I love Coca-cola. ', 'meta': {'paper_id': 'cadf94cda790ae1bd90c32fbe441bb68a8637d83', 'title': 'title1'}}, {'text': 'He loves Pepsi. ', 'meta': {'paper_id': '0f3402fa5b44e121d410ec73dfc21937074e5fa3', 'title': 'title2'}}]
于 2020-04-12T18:55:35.317 回答
0

您可以将您的 json 附加到 1 个列表中,例如:

     a=[{"text":"I","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}},
       {"text":"love","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}},
       {"text":"Coca-cola.","meta":{"paper_id":"cadf94cda790ae1bd90c32fbe441bb68a8637d83","title":"title1"}},
       {"text":"He","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}},
       {"text":"loves","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}},
       {"text":"Pepsi.","meta":{"paper_id":"0f3402fa5b44e121d410ec73dfc21937074e5fa3","title":"title2"}}]

然后,将其转换为数据框

df = pd. DataFrame.from_dict(a)
df["meta"].apply(pd.Series)
df=pd.concat([df, df['meta'].apply(pd.Series)], axis=1)
df1=df.groupby('paper_id')['text'].apply(' '.join).reset_index()
df=df.drop(['text',"meta"], axis=1)
df=df.drop_duplicates("paper_id")
df1=df1.merge(df,how="inner",on="paper_id")
print(df1)

输出将是必需的数据框,然后您可以将其转换为任何所需的数据类型:字典、数组、任何东西

paper_id    text    title
0   0f3402fa5b44e121d410ec73dfc21937074e5fa3    He loves Pepsi.     title2
1   cadf94cda790ae1bd90c32fbe441bb68a8637d83    I love Coca-cola.   title1

此外,字典的 reqd_list 可以这样制作,

reqd_list_dict=[]
values = df1.iloc[:,:].values
for i in values:
    temp ={}
    temp["text"] = i[1]
    temp["meta"] = {"paper_id":i[0],"title":i[2]}
    reqd_list_dict.append(temp)
print(reqd_list_dict)

输出:

[{'meta': {'paper_id': '0f3402fa5b44e121d410ec73dfc21937074e5fa3', 'title': 'title2'},'text': 'He loves Pepsi.'},
{'meta': {'paper_id': 'cadf94cda790ae1bd90c32fbe441bb68a8637d83', 'title': 'title1'},'text': 'I love Coca-cola.'}]
于 2020-04-12T18:44:38.987 回答