我开始使用 Rstudio,但我无法理解 combinat :: combn () 函数是如何工作的。(我真的不明白如何在以下学术练习中使用它:
请使用数据集:“mtcars”并回答:1- 建立一个 OLS 模型,作为结果的燃油效率变量,这次我们限制了预测变量的数量。2- 在 10 个预测变量中,我们将预测变量的数量限制在 1 到 5 之间。3- 创建一个重复所有可能排列的例程,并推断哪些“TOP-3”模型在效果大小和重要性方面可能具有最佳性能。
我想要遵循的步骤是:
1-首先,我必须获取所有排列并使用合适的 KPI 评估模型(我认为有必要将整个函数打包在一个循环中,因为到目前为止我使用函数 combinat :: combn 阅读了文档( ))。-> 问题是我已经在互联网上搜索了一个没有成功的例子,可以让我深入了解如何开始构建解决方案,我以多种方式探索数据,并使用“lm”函数结束数据,但我这样做了不知道如何将它与“combn”功能一起使用......或者它可能不是最佳选择或最合适的选择。2-然后选择最佳KPI。
下面我留下了我为探索数据而构建的内容,我对 Rstudio 的了解不足,以及我设法使用不同的互联网资源做了什么:
version # --- I leave the version that you are using because I don't know if it affects later
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
# system x86_64, mingw32
# status
# major 4
# minor 0.4
# year 2021
# month 02
# day 15
# svn rev 80002
# language R
# version.string **R version 4.0.4 (2021-02-15)**
# nickname Lost Library Book.
library(tidyverse)
# 25-03-2021
# -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.3.0 --
# v ggplot2 3.3.3 v purrr 0.3.4
# v tibble 3.1.0 v dplyr 1.0.5
# v tidyr 1.1.3 v stringr 1.4.0
# v readr 1.4.0 v forcats 0.5.1
# -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
# x dplyr::filter() masks stats::filter()
# x dplyr::lag() masks stats::lag()
data("mtcars") view(mtcars) ?mtcars
#A data frame with 32 observations on 11 (numeric) variables.
#[, 1] mpg Miles/(US) gallon --------->>>>>>> (fuel consumption efficiency) <<<<<<<-------------------
#[, 2] cyl Number of cylinders
#[, 3] disp Displacement (cu.in.)
#[, 4] hp Gross horsepower
#[, 5] drat Rear axle ratio
#[, 6] wt Weight (1000 lbs)
#[, 7] qsec 1/4 mile time
#[, 8] vs Engine (0 = V-shaped, 1 = straight)
#[, 9] am Transmission (0 = automatic, 1 = manual)
#[,10] gear Number of forward gears
#[,11] carb Number of carburetors
summary(mtcars) # I explore the data a bit:
#mpg cyl disp hp drat wt qsec
#Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
#1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
#Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71
#Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
#3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
#Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
#vs am gear carb
#Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
#1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
#Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
#Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
#3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
#Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
df<- mtcars # asigno los datos a df para recordar mas facilmente el nombre del dataset view(df) print(df)
# I look for the correlation between each of the variables vs to "mpg"
cor.test(df$cyl, df$mpg) # -0.852162 --> relevant cor.test(df$disp, df$mpg) # -0.8475514 --> relevant cor.test(df$hp, df$mpg) # -0.7761684 cor.test(df$drat, df$mpg) # 0.6811719 cor.test(df$wt, df$mpg) # -0.8676594 --> relevant cor.test(df$qsec, df$mpg) # 0.418684 cor.test(df$vs, df$mpg) #
0.6640389 cor.test(df$am, df$mpg) # 0.5998324 cor.test(df$gear, df$mpg) # 0.4802848 cor.test(df$carb, df$mpg) # -0.5509251
# I build the "lm" with all the variables to further explore the data
model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
+ carb, df, na.action = na.exclude)
anova(model)
# Response: mpg
# Df Sum Sq Mean Sq F value Pr(>F)
# cyl 1 817.71 817.71 116.4245 5.034e-10 ***
# disp 1 37.59 37.59 5.3526 0.030911 *
# hp 1 9.37 9.37 1.3342 0.261031
# drat 1 16.47 16.47 2.3446 0.140644
# wt 1 77.48 77.48 11.0309 0.003244 **
# qsec 1 3.95 3.95 0.5623 0.461656
# vs 1 0.13 0.13 0.0185 0.893173
# am 1 14.47 14.47 2.0608 0.165858
# gear 1 0.97 0.97 0.1384 0.713653
# carb 1 0.41 0.41 0.0579 0.812179
# Residuals 21 147.49 7.02
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(model)
# Call:
# lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
# am + gear + carb, data = df, na.action = na.exclude)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.4506 -1.6044 -0.1196 1.2193 4.6271
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 12.30337 18.71788 0.657 0.5181
# cyl -0.11144 1.04502 -0.107 0.9161
# disp 0.01334 0.01786 0.747 0.4635
# hp -0.02148 0.02177 -0.987 0.3350
# drat 0.78711 1.63537 0.481 0.6353
# wt -3.71530 1.89441 -1.961 0.0633 .
# qsec 0.82104 0.73084 1.123 0.2739
# vs 0.31776 2.10451 0.151 0.8814
# am 2.52023 2.05665 1.225 0.2340
# gear 0.65541 1.49326 0.439 0.6652
# carb -0.19942 0.82875 -0.241 0.8122
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.65 on 21 degrees of freedom
# Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
# F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
#I build the graph to visualize the data obtained from the "lm"
par(mfrow=c(2,2)) plot (model, pch=16, col="blue")
#as you can see everything is very basic, now I am trying to start the solution
#To the question, if you can give me a light on how to start, I would appreciate it a lot, I know almost nothing about R
# I hope to learn as much as I can. Thank you very much in advance for any help.