1

我使用 Microsoft 的“ Data Science End to End Walkthrough ”为自己设置了 R Server,他们的示例运行良好。

该示例(纽约出租车数据)使用非分类变量(即距离、出租车费等)来预测分类变量(1 或 0 表示是否支付了小费)。

我正在尝试使用分类变量作为输入,使用线性回归(rxLinMod 函数)来预测类似的二进制输出,并且出现错误。

该错误表明参数的数量与变量的数量不匹配,但在我看来,number of variables实际上是每个因子(变量)内的级别数。

复制

在 SQL Server 中创建一个名为 example 的表:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];

将数据放入其中:

insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);

我还使用了一个 SQL 函数,它以表格式返回变量,因为这就是 Microsoft 示例所需要的。创建函数formatAsTable

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @City AS City,
    @Person AS Person
  );

我们现在有一个包含两个分类变量的表 -PersonCity

让我们开始预测。在 R 中,运行以下命令:

library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, 
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, 
                                colClasses = c(City = "factor",Bin="logical",Person="factor"
                                ),
                                rowsPerRead=500)    

现在,建立线性回归模型。

isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)

查看模型对象:

isWonObj

请注意,它看起来像这样:

...
Total independent variables: 11 (Including number dropped: 3)
...

Coefficients:
                           Bin
(Intercept)       6.666667e-01
City=London      -1.666667e-01
City=New York     4.450074e-16
City=Liverpool    3.333333e-01
City=Paris        4.720871e-16
City=Sydney      -1.666667e-01
City=Mexico City       Dropped
Person=John      -1.489756e-16
Person=Paul      -3.333333e-01
Person=George          Dropped
Person=Ringo           Dropped

它说有 11 个变量,这很好,因为这是因子中水平的总和。

现在,当我尝试Bin基于Cityand预测值时Person,我得到一个错误:

首先我格式化City并且Person我想预测为一个表格。然后,我预测将其用作输入。

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

如果您检查该pred对象,它看起来与预期的一样:

> head(pred)
    City Person
1 London George

现在,当我尝试预测时,我得到了一个错误。

scoredOutput <- RxSqlServerData(
  connectionString = connStr,
  table = "binaryOutput"
)

rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput, 
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)

错误说:

INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of  variables: 3 vs. 11. 

我可以看到 11 来自哪里,但我只为预测查询提供了 2 个值 - 所以我看不到 3 来自哪里,或者为什么会出现问题。

任何帮助表示赞赏!

4

3 回答 3

0

虽然只设置因子级别 (... levels(predictionData$fac)<-levels(trainingData$fac ...) 避免了错误,但它也会导致模型使用错误的因子索引,如果设置了 writeModelVars 可以看到为 TRUE。在 RxSqlServerData 中为我的因子设置 colInfo 几乎 10.000 个级别导致应用程序挂起,尽管查询已正确传递给 SQL Server。我将策略更改为将数据加载到没有任何因子的数据框中,然后将 RxFactors 应用于它:

rxSetComputeContext("本地")

sqlPredictQueryDS <- RxSqlServerData(connectionString = sqlConnString, sqlQuery = sqlQuery, stringsAsFactors = FALSE)

predictQueryDS = rxImport(sqlPredictQueryDS)

if ("Artikelnummer" %in% colnames(predictQueryDS)) { predictQueryDS <- rxFactors(predictQueryDS, factorInfo = list(Artikelnummer = list(levels = allItems))) }

除了设置所需的因子水平之外,RxFactors 还重新排序因子索引。我并不是说 colInfo 的解决方案是错误的,也许它只是不适用于“太多”级别的因素。

于 2016-10-04T06:59:23.253 回答
0

您确定指定 colInfo 可以解决问题吗?看起来 rxPredict 中存在一个普遍问题,而不是 rxPredict 与 SQL Server 结合使用:

# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of  variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
#   val_Pred    fac
#1  1           two
#2  3           three
#3  1           two
#4  1           two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.

在我的场景中,我有一个具有大约 10.000 个级别的因子(仅在创建模型期间已知)和多个因子,每个因子大约有 5 个级别(在创建模型之前已知)。在以“正确”顺序调用 rxPredict() 时,似乎不可能为所有这些指定级别。

于 2016-09-26T14:58:01.907 回答
0

答案似乎与 R 处理因子变量的方式一致,但是错误消息本可以更清楚地区分因子、水平、变量和参数。

用于生成预测的参数输入似乎不能简单地是没有级别的字符或因素。 它们需要与模型参数化中使用的同一变量的因子具有相同的水平

因此,以下几行:

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

...应该替换为:

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"

column_information<-list(
  City=list(type="factor",levels=c("London","New York","Liverpool","Paris","Sydney","Mexico City")),
  Person=list(type="factor",levels=c("John","Paul","George","Ringo")),
  Bin=list(type="logical")
)

pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      ,colInfo=column_information,
                      stringsAsFactors=FALSE)

我已经看到了其他带有分类变量的示例,这些示例似乎在没有这个的情况下也可以工作,但无论如何水平可能都在那里。

我希望这可以节省与我失去的时间一样多的时间!

编辑 SLSvenR 的回应

我认为我关于与训练集具有相同级别的评论仍然成立。

fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

levels(predictionData$fac)<-levels(trainingData$fac)
# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE,checkFactorLevels = TRUE)
rxPred
# This result appears correct to me.

我无法评论这是好是坏——但看起来解决这个问题的一种方法是将训练数据的级别应用于测试集,我假设你可以实时做到这一点。

于 2016-08-05T15:52:52.697 回答