我在理解最小二乘均值的行为时遇到问题。下面是一个使用随机数据集来演示我的问题的玩具示例。场景是这样的:在 1999 年至 2015 年期间,有 10 个站点在秋季或春季进行了一项称为密度的度量标准的采样。
# Number of observations in data set
n.obs <- 1000
# Create dummy data set
df.tst <- data.frame(density = runif(n.obs, 0, 1),
year = as.factor(round(runif(n.obs, 1999, 2015))),
season = sample(c("Fall", "Spring"), n.obs, replace= TRUE),
station = sample(paste("Stat", 1:10), n.obs, replace= TRUE))
在这里,我随机分配了抽样,以故意代表一个不完整的数据集。由于数据集不完整,我使用最小二乘法,这样我就可以估计一个季节和一年的密度,同时尽量避免抽样偏差的影响。
# Fit linear model to data
fitted.model <- lm(density ~ year + season + station, data=df.tst)
# Calculate least square means
seasonal.model <- summary(lsmeans(fitted.model, c("year", "season")))
为了比较结果,我创建了一个我称之为“异常”的指标,即特定年份/季节的值减去所有年份的平均值(到中心值),并通过标准偏差进行归一化。这个想法是,这提供了一些标准化的度量标准,例如 2005 年春季与其他春季观察结果的差异。所以...
# Anomaly for spring
seasonal.model %>%
filter(season == "Spring") %>%
mutate(anom = (lsmean - mean(lsmean))/sd(lsmean))
这是伟大的。我不明白为什么当我为每个赛季都这样做时,我得到的答案都是一样的。例如...
# Anomaly for fall
seasonal.model %>%
filter(season == "Fall") %>%
mutate(anom = (lsmean - mean(lsmean))/sd(lsmean))
这些给,
year season lsmean SE df lower.CL upper.CL anom
1 1999 Spring 0.5966423 0.05242108 973 0.4937709 0.6995137 1.879970679
2 2000 Spring 0.4510249 0.03717688 973 0.3780688 0.5239810 -1.617190892
3 2001 Spring 0.4683691 0.03713725 973 0.3954908 0.5412474 -1.200649943
4 2002 Spring 0.4451014 0.03730148 973 0.3719008 0.5183020 -1.759450098
5 2003 Spring 0.5035975 0.03881258 973 0.4274315 0.5797635 -0.354600778
6 2004 Spring 0.5263538 0.03505567 973 0.4575604 0.5951472 0.191916022
7 2005 Spring 0.5553849 0.03703036 973 0.4827163 0.6280535 0.889129265
8 2006 Spring 0.5182714 0.03626301 973 0.4471087 0.5894341 -0.002191364
9 2007 Spring 0.4765210 0.04097960 973 0.3961024 0.5569395 -1.004874623
10 2008 Spring 0.5465877 0.04483499 973 0.4586033 0.6345721 0.677854169
11 2009 Spring 0.4959347 0.03185768 973 0.4334170 0.5584523 -0.538633207
12 2010 Spring 0.5359122 0.03934735 973 0.4586968 0.6131276 0.421471255
13 2011 Spring 0.5533493 0.03461044 973 0.4854296 0.6212690 0.840241309
14 2012 Spring 0.5465454 0.03697864 973 0.4739783 0.6191124 0.676838641
15 2013 Spring 0.5594985 0.03436268 973 0.4920650 0.6269320 0.987923047
16 2014 Spring 0.5320895 0.03825361 973 0.4570205 0.6071586 0.329666086
17 2015 Spring 0.5009818 0.04771979 973 0.4073363 0.5946274 -0.417419568
和
year season lsmean SE df lower.CL upper.CL anom
1 1999 Fall 0.5683228 0.05226823 973 0.4657514 0.6708943 1.879970679
2 2000 Fall 0.4227054 0.03638423 973 0.3513048 0.4941060 -1.617190892
3 2001 Fall 0.4400496 0.03752816 973 0.3664042 0.5136951 -1.200649943
4 2002 Fall 0.4167819 0.03741172 973 0.3433649 0.4901989 -1.759450098
5 2003 Fall 0.4752781 0.04039258 973 0.3960115 0.5545447 -0.354600778
6 2004 Fall 0.4980343 0.03492573 973 0.4294959 0.5665728 0.191916022
7 2005 Fall 0.5270654 0.03591814 973 0.4565795 0.5975514 0.889129265
8 2006 Fall 0.4899519 0.03618696 973 0.4189385 0.5609654 -0.002191364
9 2007 Fall 0.4482015 0.04026627 973 0.3691828 0.5272202 -1.004874623
10 2008 Fall 0.5182682 0.04545050 973 0.4290759 0.6074605 0.677854169
11 2009 Fall 0.4676152 0.03210064 973 0.4046207 0.5306096 -0.538633207
12 2010 Fall 0.5075927 0.03880628 973 0.4314391 0.5837464 0.421471255
13 2011 Fall 0.5250298 0.03440547 973 0.4575123 0.5925473 0.840241309
14 2012 Fall 0.5182259 0.03755452 973 0.4445287 0.5919231 0.676838641
15 2013 Fall 0.5311791 0.03463023 973 0.4632205 0.5991376 0.987923047
16 2014 Fall 0.5037701 0.03826525 973 0.4286782 0.5788620 0.329666086
17 2015 Fall 0.4726624 0.04793952 973 0.3785856 0.5667391 -0.417419568
我希望我误解了一些非常简单的事情,但是为什么春季和秋季异常对于给定年份应该完全相同,即使实际的最小二乘均值实际上是不同的。任何见解将不胜感激...
编辑: 响应 bethanyP。这是来自另一次运行,因此可能会有所不同,因为数据是随机的。
Classes ‘summary.ref.grid’ and 'data.frame': 34 obs. of 7 variables:
$ year : Factor w/ 17 levels "1999","2000",..: 1 2 3 4 5 6 7 8 9 10 ...
$ season : Factor w/ 2 levels "Fall","Spring": 1 1 1 1 1 1 1 1 1 1 ...
$ lsmean : num 0.484 0.461 0.441 0.495 0.575 ...
$ SE : num 0.046 0.0361 0.0355 0.0408 0.0347 ...
$ df : num 973 973 973 973 973 973 973 973 973 973 ...
$ lower.CL: num 0.394 0.39 0.371 0.415 0.507 ...
$ upper.CL: num 0.574 0.532 0.511 0.575 0.643 ...
- attr(*, "estName")= chr "lsmean"
- attr(*, "clNames")= chr "lower.CL" "upper.CL"
- attr(*, "pri.vars")= chr "year" "season"
- attr(*, "mesg")= chr "Results are averaged over the levels of: station" "Confidence level used: 0.95"