3

编辑:感谢迄今为止做出回应的人;我是 R 的初学者,刚刚为我的理学硕士论文承担了一个大型项目,所以对初始处理有点不知所措。我使用的数据如下(来自 WMO 公开的降雨数据):


120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)

There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".

I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique. Thanks again for the help!

(Original question:) I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year. The following is a simplified version of my code so far:

a <- array(1,dim=c(10,12))
for (i in 1:5) {

  all data:
  assign(paste("station_",i,sep=""), a)

  #march - june data:
  assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}

So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i每次迭代。非常感谢任何帮助!

4

3 回答 3

4

这完全是在乞求一个数据框,然后它只是一个带有电动工具的单线,比如ddply(非常强大):

tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))

按年份给出 M/A/M/J 的总和:

   year station_1 station_2 station_3 station_4 station_5 ...
1  1972  8.618960  5.697739 10.083192  9.264512 11.152378 ...
2  1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3  1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4  1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...

下面是完美的工作代码。我们创建一个col.names为 'station_n' 的数据框;年和月的额外列(因子,或者如果你很懒,则为整数,请参阅脚注)。现在您可以按月或按年进行任意分析(使用 plyr 的 split-apply-combine 范例):

require(plyr) # for d*ply, summarise
#require(reshape) # for melt

# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)  
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')

rain <- data.frame(cbind(
  year=rep(c(1970:2011),12),
  month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)

# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)

# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)

# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))

# voila!!
#    year station_1 station_2 station_3 station_4 station_5
# 1  1972  8.618960  5.697739 10.083192  9.264512 11.152378
# 2  1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3  1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4  1975 16.773286 17.683704 18.259066 14.996550 19.007762

作为脚注,在我将月份从数字转换为因子之前,它正在悄悄地“聚合”(直到我输入“-2”:排除列引用)。然而,更好的是当你把它作为一个因素时,它会拒绝直接聚合,并抛出一个错误(这对于调试来说是可取的):

 ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) : 
  sum not meaningful for factors
于 2012-05-14T20:54:13.917 回答
2

对于您的原始问题,请使用 get():

i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)

正如大卫所说,这可能不是最好的方法,但有时它可能很有用(而且 IMO 的分配/获取结构比 eval(parse) 好得多)

于 2012-05-14T17:44:36.317 回答
1

为什么要使用assign创建变量,如,station1等?将它们存储在一个列表中会更容易和更直观,比如, ,等等。然后可以使用它们的索引访问每个。station2station_3_mamjstations[[1]]stations[[2]]stations_mamj[[3]]

由于您正在处理的每个站数据看起来都是一个大小相同的矩阵,因此您甚至可以将它们作为一个 3 维矩阵来处理。

ETA:顺便说一句,如果你真的想用这种方式解决问题,你会这样做:

eval(parse(text=paste("station", i, "mamj", sep="_")))

但是不使用eval几乎总是不好的做法,并且会使对数据进行即使是简单的操作也变得困难。

于 2012-05-14T17:21:40.913 回答