我有来自许多站点的气象数据,这些数据在一个数组中编译在一起,例如下面的不同列Station
Year
Month
Rainfall
分别引用(带有的行nan nan nan nan
是因为.csv
文件具有分隔每个数据集的字符串标题)。
我使用的数据是大约 111 个站点,每个站点都有 40 多年的降雨数据,这只是我正在试验的一个子集 -
[[ nan nan nan nan]
[ 1.47130000e+04 1.96800000e+03 1.00000000e+00 2.79000000e+01]
[ 1.47130000e+04 1.96800000e+03 2.00000000e+00 1.30700000e+02]
[ 1.47130000e+04 1.96800000e+03 3.00000000e+00 8.49000000e+01]
[ 1.47130000e+04 1.96800000e+03 4.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 1.96800000e+03 5.00000000e+00 2.41000000e+01]
[ 1.47130000e+04 1.96800000e+03 6.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 1.96800000e+03 7.00000000e+00 3.45000000e+01]
[ 1.47130000e+04 2.00900000e+03 3.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 4.00000000e+00 5.65000000e+01]
[ 1.47130000e+04 2.00900000e+03 5.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 6.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 7.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 8.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 9.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 1.00000000e+01 0.00000000e+00]
[ 1.47130000e+04 2.00900000e+03 1.10000000e+01 6.20000000e+00]
[ 1.47130000e+04 2.01000000e+03 1.00000000e+00 2.33300000e+02]
[ 1.47130000e+04 2.01000000e+03 2.00000000e+00 8.71000000e+01]
[ 1.47130000e+04 2.01000000e+03 3.00000000e+00 4.08000000e+01]
[ 1.47130000e+04 2.01000000e+03 4.00000000e+00 9.62000000e+01]
[ 1.47130000e+04 2.01000000e+03 5.00000000e+00 2.21000000e+01]
[ 1.47130000e+04 2.01000000e+03 6.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.01000000e+03 7.00000000e+00 2.20000000e+00]
[ 1.47130000e+04 2.01000000e+03 8.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.01000000e+03 9.00000000e+00 0.00000000e+00]
[ 1.47130000e+04 2.01000000e+03 1.00000000e+01 8.60000000e+00]
[ 1.47130000e+04 2.01000000e+03 1.10000000e+01 1.63000000e+01]
[ 1.47130000e+04 2.01100000e+03 1.00000000e+00 1.10800000e+02]
[ 1.47130000e+04 2.01100000e+03 2.00000000e+00 6.76000000e+01]
[ 1.47130000e+04 2.01100000e+03 3.00000000e+00 1.98000000e+02]
[ 1.47130000e+04 2.01100000e+03 6.00000000e+00 4.10000000e+00]
[ 1.47130000e+04 2.01100000e+03 1.00000000e+01 2.52000000e+01]
[ 1.47130000e+04 2.01100000e+03 1.10000000e+01 4.17000000e+01]
[ 1.47130000e+04 2.01200000e+03 1.00000000e+00 2.13600000e+02]
[ 1.47130000e+04 2.01200000e+03 2.00000000e+00 7.44000000e+01]
[ 1.47130000e+04 2.01200000e+03 3.00000000e+00 9.14000000e+01]
[ 1.47130000e+04 2.01200000e+03 4.00000000e+00 1.70000000e+01]
[ 1.47130000e+04 2.01200000e+03 5.00000000e+00 1.56000000e+01]
[ 1.47130000e+04 2.01200000e+03 7.00000000e+00 4.20000000e+00]
[ 1.47130000e+04 2.01200000e+03 1.00000000e+01 3.40000000e+00]
[ 1.47130000e+04 2.01200000e+03 1.10000000e+01 7.70000000e+00]
[ nan nan nan nan]
[ 1.47320000e+04 2.00000000e+03 9.00000000e+00 0.00000000e+00]
[ 1.47320000e+04 2.00000000e+03 1.00000000e+01 8.34000000e+01]
[ 1.47320000e+04 2.00000000e+03 1.10000000e+01 1.17000000e+02]
[ 1.47320000e+04 2.00000000e+03 1.20000000e+01 4.90800000e+02]
[ 1.47320000e+04 2.00100000e+03 1.00000000e+00 1.64200000e+02]
[ 1.47320000e+04 2.00100000e+03 2.00000000e+00 6.51600000e+02]
[ 1.47320000e+04 2.00100000e+03 3.00000000e+00 1.36800000e+02]
[ 1.47320000e+04 2.00100000e+03 4.00000000e+00 1.64400000e+02]
[ 1.47320000e+04 2.01000000e+03 9.00000000e+00 0.00000000e+00]
[ 1.47320000e+04 2.01100000e+03 1.00000000e+00 1.82400000e+02]
[ 1.47320000e+04 2.01100000e+03 2.00000000e+00 3.81000000e+02]
[ 1.47320000e+04 2.01100000e+03 3.00000000e+00 4.50800000e+02]
[ 1.47320000e+04 2.01100000e+03 4.00000000e+00 3.12800000e+02]
[ 1.47320000e+04 2.01100000e+03 5.00000000e+00 0.00000000e+00]
[ 1.47320000e+04 2.01100000e+03 6.00000000e+00 0.00000000e+00]
[ 1.47320000e+04 2.01100000e+03 7.00000000e+00 1.60000000e+00]
[ nan nan nan nan]
[ 1.55030000e+04 1.96600000e+03 1.00000000e+00 6.47000000e+01]
[ 1.55030000e+04 1.96600000e+03 2.00000000e+00 1.14000000e+01]
[ 1.55030000e+04 1.96600000e+03 3.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 1.96600000e+03 4.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 1.96600000e+03 5.00000000e+00 2.80000000e+00]
[ 1.55030000e+04 1.96600000e+03 6.00000000e+00 3.47000000e+01]
[ 1.55030000e+04 1.96600000e+03 7.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01100000e+03 2.00000000e+00 1.40500000e+02]
[ 1.55030000e+04 2.01100000e+03 3.00000000e+00 1.13700000e+02]
[ 1.55030000e+04 2.01100000e+03 4.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01100000e+03 5.00000000e+00 4.00000000e-01]
[ 1.55030000e+04 2.01100000e+03 6.00000000e+00 8.60000000e+00]
[ 1.55030000e+04 2.01100000e+03 7.00000000e+00 2.20000000e+00]
[ 1.55030000e+04 2.01100000e+03 8.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01100000e+03 9.00000000e+00 4.50000000e+00]
[ 1.55030000e+04 2.01100000e+03 1.00000000e+01 2.00000000e+00]
[ 1.55030000e+04 2.01100000e+03 1.10000000e+01 2.80000000e+01]
[ 1.55030000e+04 2.01100000e+03 1.20000000e+01 4.18000000e+01]
[ 1.55030000e+04 2.01200000e+03 1.00000000e+00 4.82000000e+01]
[ 1.55030000e+04 2.01200000e+03 2.00000000e+00 5.62000000e+01]
[ 1.55030000e+04 2.01200000e+03 3.00000000e+00 1.45600000e+02]
[ 1.55030000e+04 2.01200000e+03 4.00000000e+00 1.62000000e+01]
[ 1.55030000e+04 2.01200000e+03 5.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01200000e+03 6.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01200000e+03 7.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01200000e+03 8.00000000e+00 0.00000000e+00]
[ 1.55030000e+04 2.01200000e+03 9.00000000e+00 2.60000000e+00]
[ 1.55030000e+04 2.01200000e+03 1.10000000e+01 2.52000000e+01]
[ 1.55030000e+04 2.01200000e+03 1.20000000e+01 1.09900000e+02]]
我需要根据不同的站点(第一列)分解数据。我想我可以使用
B = np.array_split(alldata,np.where(SN == 0)[0])
(SN
第一列数据在哪里,Station
通过最初在excel中替换列中的0)来做到这一点。但是,这会导致0 nan nan nan
行包含在每个拆分数组中。我也尝试过 -for key,items in groupby(alldata,itemgetter(0)): print(key)
,但不确定如何使用该groupby()
函数进一步操作拆分数据集。将数据拆分为单独的数组后,我需要插入缺失月份和年份的行,并将新行中的相应单元格留空以用于降水列。我了解如何加入两个列表,但我不确定是否根据序列中的缺失值插入数据,并将其应用于整个数组。例如,对于上面的数据,我希望所有数组都有从 1966 年到 1981 年的年份和月份列,这样我就可以将不同的月份关联在一起。
一旦所有数据长度相等,我想在不同地点的所有降水数据之间进行回归。例如,假设第一个数据块是“感兴趣的站点”,我想获取其降雨数据与数据集中其他站点之间的相关性的 r^2 值。最后,我还想将降雨数据更改为该站点所有降雨数据的百分位数,然后将所有百分位数形式的降雨值与“感兴趣的站点”相关联
我不确定这是否有意义,请让我知道我应该在问题中添加什么(这是我的第一个问题)。
来自用户 (Brionius) 评论和建议的更新代码:
该代码可以很好地应用回归。不太确定使用掩码应用相关性。
我尝试在不使用掩码的情况下进行相关性,slope, intercept, r_value, p_value, std_err = stats.linregress(rotated[0][0],rotated[i][0])
但它返回nan
了所有 r_values 的值,大概是因为nan
数据集中的值。
import numpy as np
# Import data:
alldata=\
np.genfromtxt("combinedWdata.csv",unpack=False,\
delimiter=",")
# Split array based on where the 'nan' values are
dataSets = filter(lambda x: len(x) > 0,\
np.array_split(alldata,np.where(np.isnan(alldata[:,1]))[0]))
# Delete the rows with 'nan' in it.
dataSets = [np.delete(dataSet, np.where(np.isnan(dataSet)), axis=0)\
for dataSet in dataSets]
# Assign variables to years and months
startYear = 1877
endYear = 2013
startMonth = 1
endMonth = 12
blank_rainfall_value = np.nan
# Insert rows of the form ['nan', year, month, 0] for all relevant \
#years/months except where there is already a row containing that year/month
extendedDataSets = []
for dataSet in dataSets:
missingMonths = [[dataSet[0][0], year, month, blank_rainfall_value] \
for year in range(startYear, endYear+1)\
for month in range(startMonth, endMonth+1) \
if [year, month] not in dataSet[:,1:3].tolist()]
if len(missingMonths) > 0:
extendedDataSets.append(np.vstack((dataSet, missingMonths)))
# Sort arrray by year, then month
finalDataSets = [np.array(sorted(dataSet, key=lambda row:row[1]+row[2]/12.0))\
for dataSet in extendedDataSets]
# Rotate data to compare between columns rather than rows
rotated=[]
for dataSet in finalDataSets:
u=np.rot90(dataSet)
rotated.append(u)
rotated=np.array(rotated)
# Delete year month and station
m=0
rotatedDel=[]
for i in rotated:
v=[rotated[m][0]]
rotatedDel.append(v)
rotatedDel=np.array(rotatedDel)
# Apply regression between first station and all later stations with \
#mask for nan values
m=0
r_values=[]
for i in rotatedDel:
r_value=np.ma.corrcoef(x=\
(np.ma.array(rotatedDel[0],mask=np.isnan(rotatedDel[0]))),y=\
(np.ma.array(rotatedDel[m],mask=np.isnan(rotatedDel[m]))),\
rowvar=False,allow_masked=True)
r_values.append(r_value)
m=m+1