0

我有一个 .gz 类型的文件,里面有 JSON 对象,例如:

input:

{ "name":"John", "age":21, "gender":"male" }
{ "name":"Mike", "age":29, "gender":"male" }
{ "name":"Tim", "age":20, "gender":"male" }
{ "name":"Kim", "age":39, "gender":"female" }

注意:请注意,每个 JSON obj 的末尾没有逗号。

我使用以下内容将其保存到数据框:

import pandas as pd
data_location = 's3://myBucket/myFolder'
raw_json_data = pd.read_json(data_location, lines=True)
raw_json_data.head(2)

问题:我想将其转换为 CSV,可能是这样的:

expected output:

name, age, gender
John, 21, male
Mike, 29, male
Tim, 20, male
Kim, 39, female

我使用了这个,但没有提供预期的输出 - 我错过了什么吗?

df=pd.read_json(raw_json_data)
df.to_csv('results.csv')
4

2 回答 2

1

首先,您可以使用一列字典创建数据框

import json
from io import StringIO

df = pd.read_csv(StringIO("""
{ "name":"John", "age":21, "gender":"male" }
{ "name":"Mike", "age":29, "gender":"male" }
{ "name":"Tim", "age":20, "gender":"male" }
{ "name":"Kim", "age":39, "gender":"female" } 
"""), delimiter='|', header=None)  # instead of StringIO part, you can have the path of input file

df    
                 0
0   { "name":"John", "age":21, "gender":"male" }
1   { "name":"Mike", "age":29, "gender":"male" }
2   { "name":"Tim", "age":20, "gender":"male" }
3   { "name":"Kim", "age":39, "gender":"female" }

您可以使用json_normalize将单个字典转换为数据框

def func(x):
    result = pd.json_normalize(json.loads(x.iloc[0]))
    return result

result = df.apply(func, axis=1)
result
0       name  age gender
0  John  21   male 
1       name  age gender
0  Mike  29   male 
2      name  age gender
0  Tim  20   male   
3      name  age  gender
0  Kim  39   female
dtype: object

上面的输出将是一系列数据帧并将其转换为单个数据帧,您可以执行以下操作

pd.concat([r for r in result], ignore_index=True)

    name    age gender
0   John    21  male
1   Mike    29  male
2   Tim     20  male
3   Kim     39  female
于 2021-01-14T22:33:17.080 回答
0
  • 我有一个 .gz 类型的文件,里面有 JSON 对象,这意味着有一个.gz文件,.json里面有一个文件。
  • 使用pathlib方法将文件读入,然后将行拆分成list一个strings
    • Path('test.json'):'test.json()'如果文件位于不同的目录中,则可以是文件的路径。
  • 转换stringsdicts_ast.literal_eval
import pandas as pd
from pathlib import Path
from ast import literal_eval

# read the file in using the pathlib methods
text = Path('test.json').read_text().split('\n')

# map the strings to dicts
text = map(literal_eval, text)

# load the list of dicts into a dataframe
df = pd.DataFrame(text)

# save to a csv
df.to_csv('results.csv', index=False)

.gz文件中读取

  • 使用模块读取行json是有问题的,因为数据不是格式正确的.json文件。
import gzip
import pandas as pd
from ast import literal_eval

# open the gzip file
with gzip.open('testing.json.gz', 'rt', encoding='UTF-8') as zipfile:
    data = [literal_eval(v.strip()) for v in zipfile]

# create the dataframe
df = pd.DataFrame(data)

# save to a csv
df.to_csv('results.csv', index=False)
于 2021-01-14T22:25:49.107 回答