python-3.x - 从熊猫中的嵌套字典中自动提取列

Question

所以我在 jsonl 文件列中有这个嵌套的多个字典，如下所示：

    `df['referenced_tweets'][0]`

生产（缩短产量）

  'id': '1392893055112400898',
  'public_metrics': {'retweet_count': 0,
   'reply_count': 1,
   'like_count': 2,
   'quote_count': 0},
  'conversation_id': '1392893055112400898',
  'created_at': '2021-05-13T17:22:37.000Z',
  'reply_settings': 'everyone',
  'entities': {'annotations': [{'start': 65,
     'end': 77,
     'probability': 0.9719000000000001,
     'type': 'Person',
     'normalized_text': 'Jill McMillan'}],
   'mentions': [{'start': 23,
     'end': 36,
     'username': 'usasklibrary',
     'protected': False,
     'description': 'The official account of the University Library at USask.',
     'created_at': '2019-06-04T17:19:12.000Z',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'url': '*removed*',
         'expanded_url': 'http://library.usask.ca',
         'display_url': 'library.usask.ca'}]}},
     'name': 'University Library',
     'url': '....',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
     'verified': False,
     'id': '1135959197902921728',
     'public_metrics': {'followers_count': 365,
      'following_count': 119,
      'tweet_count': 556,
      'listed_count': 9}}]},
  'text': 'Wonderful session with @usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!', 
...

我的意图是创建一个函数，该函数将自动提取整个数据框（而不仅仅是一行）中的特定列（例如文本、类型）。所以我写了这个函数：

### x = df['referenced_tweets']

def extract_TextType(x):
    dic = {}
    for i in x:
        if i != " ":
            new_df= pd.DataFrame.from_dict(i)
            dic['refd_text']=new_df['text']
            dic['refd_type'] = new_df['type']
        else:
            print('none')
    return dic

但是运行该功能：

df['referenced_tweets'].apply(extract_TextType)

产生错误：

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

重点是从原始“引用推文”列中提取这两个嵌套列（文本和类型），并将它们与原始行匹配。

请问我在做什么错？

PS原始df在下面被抓拍：

score 0 · Accepted Answer

这里需要考虑几件事。referenced_tweets包含一个列表，因此该行new_df= pd.DataFrame.from_dict(i)很可能无法按照您输入的方式正确解析。

此外，由于该列表中可能有多个推文，因此您正确地对其进行了迭代，但您无需将其放入 df 中即可。当您使用.apply(). 如果这就是你想要的，那没关系。如果您真的只想要一个新的数据框，您可以调整以下内容。我没有访问权限，referenced_tweets所以我entities以它为例。这是我的例子：

ents = df[df.entities.notnull()]['entities']

dict_hold_list = []
for ent in ents:
    # print(ent['hashtags'])
    for htag in ent['hashtags']:
        # print(htag['text'])
        # print(htag['indices'])
        dict_hold_list.append({'text': htag['text'], 'indices': htag['indices']})
df_hashtags = pd.DataFrame(dict_hold_list)

因为您没有提供良好的工作 json 或数据框，所以我无法对此进行测试，但您的解决方案可能看起来像这样

refs = df[df.referenced_tweets.notnull()]['referenced_tweets']

dict_hold_list = []
for ref in refs:
    # print(ref)
    for r in ref:
        # print(r['text'])
        # print(r['type'])
        dict_hold_list.append({'text': r['text'], 'type': r['type']})
df_ref_tweets = pd.DataFrame(dict_hold_list)

python-3.x - 从熊猫中的嵌套字典中自动提取列

1 回答 1

Related

Reference