程序从 RESTApi 检索 JSON 数据
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows',1000)
url = 'http://xxxxxxxxxxx:7180/api/v15/clusters/cluster/services/impala/impalaQueries?from=2018-07-23T14:00:00&to=2018-07-23T21:00:00&filter=(query_state+%3D+%22FINISHED%22)&offset=0&limit=1000'
username = 'xxxxx'
password = 'xxxxx'
result = requests.get(url, auth=(username,password))
outJSON = result.json()
df = pd.io.json.json_normalize(outJSON['queries'])
filename ="tempNew.csv"
df.to_csv(filename)
CSV 数据包含一些字段的空值和少数字段的 NaN。
输入:
Admitted immediately,,BLAHBLAH,0,NaN,0,0,0,0,0.0,,,,
在使用 fillna 将所有 Null 和 NaN 替换为 0 时,因为它们是目标表中的数字字段。
试过的代码:
for col in df:
df[col].fillna(0,inplace=True)
df.fillna(0,inplace=True)
输出:
'立即录取','0','BLAHBLAH','0','NaN','0','0','0','0','0.0','0','0',' 0'
如何确保我的数据框中的所有 NaN 值都更改为 0,因为它们加载到的表由于 NaN 值而拒绝值?
我从 RESTAPI 逐行处理数据切换到 Dataframe,印象是使用 DF 更容易处理数据。如果fillna不起作用,是否有更好的方法来按摩df中的数据而无需逐行迭代?
更新:
df = pd.io.json.json_normalize(outJSON['queries'])
fname = "WithouFilna_1.csv"
df.to_csv(fname)
df.fillna(0,inplace=True)
filename ="fillna_1.csv"
df.to_csv(filename)
我尝试在之前和之后编写 df.fillna 的输出。少数领域出现了部分变化,但并非所有领域都发生了变化
前:
859,Unknown,,,2,0,xxxx,RESET_METADATA,,,,,,,,,,,,,,
860,Admitted immediately,0,,1,2,xxxx,,0,,NaN,0,0,,0
861,Admitted immediately,0,,0,0,xxxx,,0,,NaN,0,0,,0
后:
859,Unknown,0,0,2,0,xxxx,RESET_METADATA,0,,0,0,0,0,0,0,0,0,0,0,0
860,Admitted immediately,0,0,1,2,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
861,Admitted immediately,0,0,0,0,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
df.dtypes 输出
attributes.admission_result object
attributes.admission_wait object
attributes.bytes_streamed object
attributes.client_fetch_wait_time object
attributes.client_fetch_wait_time_percentage object
attributes.connected_user object
attributes.ddl_type object
attributes.estimated_per_node_peak_memory object
attributes.file_formats object
attributes.hdfs_average_scan_range object
attributes.hdfs_bytes_read object
attributes.hdfs_bytes_read_from_cache object
attributes.hdfs_bytes_read_from_cache_percentage object
attributes.hdfs_bytes_read_local object
attributes.hdfs_bytes_read_local_percentage object
attributes.hdfs_bytes_read_remote object
attributes.hdfs_bytes_read_remote_percentage object
attributes.hdfs_bytes_read_short_circuit object
attributes.hdfs_bytes_read_short_circuit_percentage object
attributes.hdfs_scanner_average_bytes_read_per_second object
df.values[5:6, :15]
array([['Unknown', nan, nan, '1', '8', 'xxxxx',
'SHOW_DBS', nan, '', nan, nan, nan, nan, nan, nan]], dtype=object)