r - splitting a XDF File / Dataset for training and testing

Question

Is it possible to split a .xdf file in (the Microsoft RevoScaleR context) into a let's say 75% training and 25% test set? I know there is a function called rxSplit(), but, the documentation doesn't seem to apply to this case. Most of the examples online assign a column of random numbers to the dataset, and split it using that column.

Thanks. Thomas

score 1 · Accepted Answer

你当然可以使用rxSplit这个。创建一个变量来定义您的训练和测试样本，然后对其进行拆分。

例如，使用mtcars玩具数据集：

xdf <- rxDataStep(mtcars, "mtcars.xdf")
xdfList <- rxSplit(xdf, splitByFactor="test",
    transforms=list(test=factor(runif(.rxNumRows) < 0.25, levels=c("FALSE", "TRUE"))))

xdfList现在是一个包含 2 个 xdf 数据源的列表：一个包含（大约）75% 的数据，另一个包含 25%。

score 0 · Accepted Answer

您可以使用 rxDataStep 从原始 xdf 创建训练和测试数据集。查看此示例：https ://docs.microsoft.com/en-us/r-server/r/how-to-revoscaler-linear-model

bigDataDir <- "C:/MRS/Data"
sampleAirData <- file.path(bigDataDir, "AirOnTime7Pct.xdf")
trainingDataFile <- "AirlineData06to07.xdf"
targetInfile <- "AirlineData08.xdf"

rxDataStep(sampleAirData, trainingDataFile, rowSelection = Year == 1999 |
    Year == 2000 | Year == 2001 | Year == 2002 | Year == 2003 |
    Year == 2004 | Year == 2005 | Year == 2006 | Year == 2007)
rxDataStep(sampleAirData, targetInfile, rowSelection = Year == 2008)

r - splitting a XDF File / Dataset for training and testing

2 回答 2

Related

Reference