1

我正在尝试找到一个指标来比较类似于这篇博文 pca-vs-autoencoders-for-dimensionality-reduction 中所做的多维降维技术......

具体这部分对比

# pCA reconstruction
pca.recon <- function(pca, x, k){
  mu <- matrix(rep(pca$center, nrow(pca$x)), nrow = nrow(pca$x), byrow = T)
  recon <- pca$x[,1:k] %*% t(pca$rotation[,1:k]) + mu
  mse <- mean((recon - x)^2)
  return(list(x = recon, mse = mse))
}

xhat <- rep(NA, 10)
for(k in 1:10){
  xhat[k] <- pca.recon(pca, x_train, k)$mse
}

ae.mse <- rep(NA, 5)
for(k in 1:5){
  modelk <- keras_model_sequential()
  modelk %>%
    layer_dense(units = 6, activation = "tanh", input_shape = ncol(x_train)) %>%
    layer_dense(units = k, activation = "tanh", name = "bottleneck") %>%
    layer_dense(units = 6, activation = "tanh") %>%
    layer_dense(units = ncol(x_train))

  modelk %>% compile(
    loss = "mean_squared_error", 
    optimizer = "adam"
  )

  modelk %>% fit(
    x = x_train, 
    y = x_train, 
    epochs = 5000,
    verbose = 0
  )

  ae.mse[k] <- unname(evaluate(modelk, x_train, x_train))
}

df <- data.frame(k = c(1:10, 1:5), mse = c(xhat, ae.mse), method = c(rep("pca", 10), rep("autoencoder", 5)))
ggplot(df, aes(x = k, y = mse, col = method)) + geom_line()

我想添加其他技术,例如 Rtsne 包中的 TSNE、umap 包中的 UMAP 和 ivis 包中的 IVIS(目前不在 CRAN 上,但可以像这样安装 ->

devtools::install_github("beringresearch/ivis/R-package")
library(ivis)
install_ivis()

所有技术的数据输入和处理都是相似的,但似乎其中一些技术已经将 mse 确定融入其功能(例如自动编码器)。我想知道是否有人对我正在尝试做的事情有经验。

4

1 回答 1

2

不同的分解方法可以看作是统计机器中可以互换的齿轮,对你这个创造者有用。

要选择最好的齿轮,您评估的指标不一定与齿轮有关,而是机器在分别插入每个齿轮时的整体性能如何。

忽略齿轮规格: 您有几个齿轮,它们都带有自己的工厂验证规格(包装)。这些数字/摘要/规格可能不是您想要的。可能的齿轮不会提供相同的指标,因此很难进行公平的比较。此外,这些指标将全部与齿轮有关,而不是与您的特定机器有关。不要按照博客的建议去做,将机器指标与pca.recon(). 让齿轮成为齿轮,并将度量评估延迟到机器级别。

齿轮是否适合?:您需要检查您的特定机器,所有候选齿轮实际上都适合内部。您的合成/重建机器的齿轮必须能够双向转动t-sne 只是设计用来转正做分解的,所以不可能做有意义的评估。对于 UMAP 也是如此。也许整个重建损失基准测试并不是您一开始想要使用的实际机器。也许只是为另一台机器挑选齿轮的一个副项目……如果你的机器要绘制漂亮的图,那么很难获得好的定量基准。如果您的机器是与简单分类器混合的一些初始分解,那么 t-sne 齿轮将非常适合,并且一些预测准确度指标可能对选择具有 .

连接各种齿轮:由于尺寸和形状不一样,齿轮实际上不会开箱即用地安装到您的机器中。每个齿轮都需要单独调整。您可能很想将机器重新安装到齿轮上,这对几个齿轮就可以了。那就是直接复制粘贴您的机器代码,插入和调整每个齿轮。一种更具可扩展性的方法是只连接齿轮,这样您就可以将它们放在机器旁边的袋子里,让机器人同时插入一个齿轮并给您写一份报告。这是 sklearn、caret 和 keras 等框架的主要卖点。你也可以自己编码。这是一个简单的例子:

rm(list=ls())
#some data
X <- iris[,c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]

#my_gear, prcomp wrapped in an interface
#any gear must have the gear(X, N, ...) signature
pca_decompose <- function(X, N=2, ...) {
  
  #implement gear forward (decompose)
  pca <- prcomp(
    X, rank. = N,
    scale = FALSE #must be false, beacuse reconstructor below does not support re-scaling, because I'm lazy.
  )
  
  #implement gear backward (reconstruct)
  reconstruct <- function(Xnew = pca$x) {
    
    # a pca reconstructor implementation similar to function from the blog, pca already in closure
    # I think the blog mistankenly referred to pca$x instead of x sometimes
    pca.recon <- function(x, k){
      x_recon <- x[,1:k] %*% t(pca$rotation[,1:k])
      #slightly more effecient way to reapply center
      for(i in seq_along(pca$center)) x_recon[,i] <- x_recon[,i] + pca$center[i] 
      return(x_recon)
    }
    X_rc <- pca.recon(Xnew, k=N)
    return(X_rc)
  }
  
  #wrap up the interface
  self <- list(
    X_decomposed = pca$x,  # any decomposition must be named X_dc
    reconstruct = reconstruct
  )
  
  class(self) <- c("my_pca","my_universal_gear")
  return(self)
}

#define a machine with the relevant use case
my_machine <- function(gear, data, ...) {
 dc_obj <- gear(data, ...)
 data_rc <- dc_obj$reconstruct(dc_obj$X_decomposed)
}

#define the most useful metric
my_metric <- function(X,Y) {
  # this 'multivariate' mse, is not commonly used I think.
  # but whatever floats the boat
  mean((X-Y)^2) 
}

#define how to evaluate.
#try the gear in the mahine and meassure outcome with metric
my_evaluation <- function(gear, machine, data, metric, ...) {
  data <- as.matrix(data)
  output <- machine(gear,data, ...)
  my_metric(data,output)
}

#useful syntactic sugar
set_params <- function(gear, ...) {
  params = list(...)
  function(...) do.call(gear,c(list(...),params))
}

#evaluate a gear
my_evaluation(
  gear = pca_decompose,
  machine = my_machine,
  data = X,
  #gear params
  N=2
)

#the same as
my_evaluation(
  gear = set_params(pca_decompose,N=2), #nice to preset gear params
  machine = my_machine,
  data = X
)

#define all gears to evaluate
#the gearbag could also in another usecase be a grid search of optimal hyper-parameters
my_gearbag = list(
  pca_dc_N1 = set_params(pca_decompose,N=1),
  pca_dc_N2 = set_params(pca_decompose,N=2),
  pca_dc_N3 = set_params(pca_decompose,N=3),
  pca_dc_N4 = set_params(pca_decompose,N=4)
  #put also autoencoder or what ever in the gearbag
)

my_robot <- function(evaluation, machine, gearbag, data) {
  results <- sapply(
    X = gearbag, #this X is not the data put placeholder for what to iterate
    FUN = evaluation,
    machine = machine,
    data = X
  )
  
  report = list(
    README = "metric results for gears",
    results = results
  )
}


my_report <- my_robot(my_evaluation, my_machine, my_gearbag, data)

print(my_report)

打印出

$README
[1] "metric results for gears"

$results
   pca_dc_N1    pca_dc_N2    pca_dc_N3    pca_dc_N4 
8.560431e-02 2.534107e-02 5.919048e-03 1.692109e-31 

于 2021-11-02T22:11:29.690 回答