2

我有一个带有以下数据类型的熊猫数据框:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 579585 entries, 0 to 579613
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   itemName     579585 non-null  object        
 1   itemId       579585 non-null  string        
 2   Count        579585 non-null  int32         
 3   Sales        579585 non-null  float64       
 4   Date         579585 non-null  datetime64[ns]
 5   Unit_margin  579585 non-null  float64       
 6   GrossProfit  579585 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int32(1), object(1), string(1)
memory usage: 33.2+ MB

我使用以下方法将其上传到 BigQuery 表:

df_extended_full.to_gbq('<MY DATSET>.profit', project_id='<MY PROJECT>', chunksize=None,  if_exists='append', auth_local_webserver=False, location=None, progress_bar=True)

itemId除了a 列string变成 afloat并且所有前导 0:s (我需要)因此被删除(无论有什么)之外,一切似乎都运行良好。

我当然可以为我的表定义一个模式,但我想避免这种情况。我错过了什么?

4

1 回答 1

1

The problem is for the “to_gbq” component. For some reason this output omits the quotes from the data field. And without quotes, it changes the datatype to a number.

BigQuery needs this format:

{"itemId": "12345", "mappingId":"abc123"}

You sent this format:

{"itemId": 12345, "mappingId":abc123}

A solution in this case. You can cast the field “itemId” from pandas using the command “astype”. Here is more documentation about this command.

This is an example.

df['externalId'] = df['externalId'].astype('str')

Another option is to use the parameter table_schema with the to_gbq method. And list the Bigquery table fields which will be according to DataFrame conforms.

[{'name': 'col1', 'type': 'STRING'},...]

Last option, you can change to google-cloud-bigquery instead of pandas-gbq. You can see this comparison.

于 2022-01-27T18:09:26.743 回答