r - R根据行分位数选择每行的数据

Question

我有一个包含 60 行和 3036 列的数据集。我已经用包中的函数计算了行分rowQuantiles位数matrixStats。从这里我得到了一个列向量[60,1]。现在，我只想从每一行中选择高于行分位数的数据。如果我使用 which 函数如下：

datset_qu95 = which(dataset > rowQuantiles(dataset, probs=c(0.95))

然后我松开数据维度，只得到一个数组而不是一个具有以下维度的矩阵 [60,152]。

有人可以帮助我吗？

谢谢！

score 0 · Accepted Answer

0.05 * 3036 = 151.8但是在每一行中选择大于 95% 分位数的值并不意味着您将系统地拥有 152 个值。如果你想保持你的对象尺寸，你可以尝试用NA's 替换不需要的值。
由于您的对象并不大，您还可以使用数据框对象并沿行维度进行观察。

library(matrixStats)

# To extract your values...
myfun <- function(k, q){x[k, x[k,] > q]}
x <- matrix(sample(1:100, 60*3036, replace=TRUE), ncol=3036)
xx <- mapply(myfun, seq(along=x[,1]), rowQuantiles(x, probs=.95))
# xx is a list, xx[[1]] contains the values of x[1,] > quantile(x[1, ], .95)

# The number of selected values depends on their distribution - with NORM should be stable
x11() ; par(mfrow=c(2,1))
hist(sample(1:100, 60*3036, replace=TRUE)) # UNIF DISTRIB
n.val <- sapply(xx, length)
hist(n.val, xlab="n.val > q_95%")
abline(v=152, col="red", lwd=5)

# Assuming you want the same number of value for each row
n <- min(n.val)
myfun <- function(x){sample(x, n)} # Representative sample - Ordering is possible but introduce bias. Depends on your goals
xx <- t(sapply(xx, myfun))
dim(xx) # 60 n

score 0 · Accepted Answer

我认为不需要 rowQuantile 函数。只需选择最高概率阈值即可：（编辑说明（第一个版本的索引表达式不正确）

> apply( dat, 1, function(x) x[order(x)][1:( (1-0.95)*ncol(dat))])
    obs1     obs2     obs3 
 11.5379 856.3470 136.8860

和往常一样，因为 R 矩阵是面向列的，所以您可能希望t()在结果上使用它来将其恢复为您期望的行方向。

对您的评论：修复它，以便它获取最高值而不是最低值：

 apply( dat, 1, function(x)
                  x[order(x, decreasing=TRUE)][1:( (1-0.95)*ncol(dat))])

r - R根据行分位数选择每行的数据

2 回答 2

Related

Reference