sql-server - SQL Server R Services - 将数据输出到数据库表，性能

Question

我注意到当 outFile 参数设置为表时，rx* 函数（例如rxKmeans，rxDataStep）以逐行方式将数据插入 SQL Server 表。这显然非常慢，而像 bulk-insert 这样的东西是可取的。这可以得到吗？怎么做？

目前，我正在尝试通过调用指定参数的rxKmeans函数向表中插入大约 1400 万行outFile，这大约需要 20 分钟。

我的代码示例：

clustersLogInitialPD <-  rxKmeans(formula = ~LogInitialPD
                     ,data = inDataSource
                     ,algorithm = "Lloyd" 
                     ,centers = start_c
                     ,maxIterations = 1
                     ,outFile = sqlLogPDClustersDS
                     ,outColName = "ClusterNo"
                     ,overwrite = TRUE
                     ,writeModelVars = TRUE
                     ,extraVarsToWrite = c("LoadsetId", "ExposureId")
                     ,reportProgress = 0
)

sqlLogPDClustersDS指向我数据库中的一个表。

我正在使用安装和配置 R Services 的 SQL Server 2016 SP1（数据库内和独立）。通常，除了从 R 脚本将行写入数据库表的这种糟糕性能之外，一切正常。

任何意见将不胜感激。

score 2 · Accepted Answer

我最近也在这个 Microsoft R MSDN 论坛主题上提出了这个问题。

我遇到了这个问题，我知道 2 个合理的解决方案。

使用sp_execute_external_script输出数据框选项

/* Time writing data back to SQL from R */
SET STATISTICS TIME ON

IF object_id('tempdb..#tmp') IS NOT NULL
    DROP TABLE #tmp

CREATE TABLE #tmp (a FLOAT NOT NULL, b INT NOT NULL );

DECLARE @numRows INT = 1000000

INSERT INTO #tmp (a, b)
EXECUTE sys.sp_execute_external_script
          @language = N'R'
         ,@script = N'OutputDataSet <- data.frame(a=rnorm(numRows), b=1)'
         ,@input_data_1 = N''
        , @output_data_1_name = N'OutputDataSet'
         ,@params = N' @numRows INT'
         ,@numRows = @numRows
GO
-- ~7-8 seconds for 1 million row insert (2 columns) on my server
-- rxDataStep for 100K rows takes ~45 seconds on my server

在第一次将数据框写入平面文件后使用 SQL Serverbcp.exe或（仅当在 SQL 框本身上运行时）BULK INSERT

我已经编写了一些代码来执行此操作，但它不是很完善，我不得不留下部分<<<VARIABLE>>>假设连接字符串信息（服务器、数据库、模式、登录名、密码）。如果您觉得这很有用或有任何错误，请告诉我。我还希望看到Microsoft 能够使用 BCP API 将数据从 R 保存回 SQL Server。上面的解决方案（1）只能通过sp_execute_external_script. 基本测试也让我相信，对于一百万行，bcp.exe 的速度大约是选项 (1) 的两倍。BCP 将产生最少记录的 SQL 操作，因此我希望它会更快。

# Creates a bcp file format function needed to insert data into a table.
# This should be run one-off during code development to generate the format needed for a given task and saved in a the .R file that uses it
createBcpFormatFile <- function(formatFileName, tableName) {
  # Command to generate BCP file format for importing data into SQL Server

  # https://msdn.microsoft.com/en-us/library/ms162802.aspx
  # format creates a format file based on the option specified (-n, -c, -w, or -N) and the table or view delimiters. When bulk copying data, the bcp command can refer to a format file, which saves you from re-entering format information interactively. The format option requires the -f option; creating an XML format file, also requires the -x option. For more information, see Create a Format File (SQL Server). You must specify nul as the value (format nul).
  # -c Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
  # -x Used with the format and -f format_file options, generates an XML-based format file instead of the default non-XML format file. The -x does not work when importing or exporting data. It generates an error if used without both format and -f format_file.
  ## Bob: -x not used because we currently target bcp version 8 (default odbc driver compatibility that is installed everywhere)
  # -f If -f is used with the format option, the specified format_file is created for the specified table or view. To create an XML format file, also specify the -x option. For more information, see Create a Format File (SQL Server).
  # -t field_term Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
  # -S server_name [\instance_name] Specifies the instance of SQL Server to which to connect. If no server is specified, the bcp utility connects to the default instance of SQL Server on the local computer. This option is required when a bcp command is run from a remote computer on the network or a local named instance. To connect to the default instance of SQL Server on a server, specify only server_name. To connect to a named instance of SQL Server, specify server_name\instance_name.
  # -U login_id Specifies the login ID used to connect to SQL Server.
  # -P -P password Specifies the password for the login ID. If this option is not used, the bcp command prompts for a password. If this option is used at the end of the command prompt without a password, bcp uses the default password (NULL).

  bcpPath <- .pathToBcpExe()
  parsedTableName <- parseName(tableName)
  # We can't use the -d option for BCP and instead need to fully qualify a table (database.schema.table)
  # -d database_name Specifies the database to connect to. By default, bcp.exe connects to the user’s default database. If -d database_name and a three part name (database_name.schema.table, passed as the first parameter to bcp.exe) is specified, an error will occur because you cannot specify the database name twice.If database_name begins with a hyphen (-) or a forward slash (/), do not add a space between -d and the database name.
  fullyQualifiedTableName <- paste0(parsedTableName["dbName"], ".", parsedTableName["schemaName"], ".", parsedTableName["tableName"])

  bcpOptions <- paste0("format nul -c -f ", formatFileName, " -t, ", .bcpConnectionOptions())

  commandToRun <- paste0(bcpPath, " ", fullyQualifiedTableName, " ", bcpOptions)
  result <- .bcpRunShellThrowErrors(commandToRun)
}


# Save a data frame (data) using file format (formatFilePath) to a table on the database (tableName)
bcpDataToTable <- function(data, formatFilePath, tableName) {
  numRows <- nrow(data)

  # write file to disk
  ptm <- proc.time()

  tmpFileName <- tempfile("bcp", tmpdir=getwd(), fileext=".csv")
  write.table(data, file=tmpFileName, quote=FALSE, row.names=FALSE, col.names=FALSE, sep=",")
  # Bob: note that one can make this significantly faster by switching over to use the readr package (readr::write_csv)
  #readr::write_csv(data, tmpFileName, col_names=FALSE)

  # bcp file to server time start
  mid <- proc.time()

  bcpPath <- .pathToBcpExe()
  parsedTableName <- parseName(tableName)
  # We can't use the -d option for BCP and instead need to fully qualify a table (database.schema.table)
  # -d database_name Specifies the database to connect to. By default, bcp.exe connects to the user’s default database. If -d database_name and a three part name (database_name.schema.table, passed as the first parameter to bcp.exe) is specified, an error will occur because you cannot specify the database name twice.If database_name begins with a hyphen (-) or a forward slash (/), do not add a space between -d and the database name.
  fullyQualifiedTableName <- paste0(parsedTableName["dbName"], ".", parsedTableName["schemaName"], ".", parsedTableName["tableName"])
  bcpOptions <- paste0(" in ", tmpFileName, " ", .bcpConnectionOptions(), " -f ", formatFilePath, " -h TABLOCK")

  commandToRun <- paste0(bcpPath, " ", fullyQualifiedTableName, " ", bcpOptions)

  result <- .bcpRunShellThrowErrors(commandToRun)

  cat(paste0("time to save dataset to disk (", numRows, " rows):\n"))
  print(mid - ptm)
  cat(paste0("overall time (", numRows, " rows):\n"))
  proc.time() - ptm

  unlink(tmpFileName)
}

# Examples:
# createBcpFormatFile("test2.fmt", "temp_bob")
# data <- data.frame(x=sample(1:40, 1000, replace=TRUE))
# bcpDataToTable(data, "test2.fmt", "test_bcp_1")

#####################
#                   #
# Private functions #
#                   #
#####################

# Path to bcp.exe. bcp.exe is currently from version 8 (SQL 2000); newer versions depend on newer SQL Server ODBC drivers and are harder to copy/paste distribute
.pathToBcpExe <- function() {
  paste0(<<<bcpFolder>>>, "/bcp.exe")
}

# Function to convert warnings from shell into errors always
.bcpRunShellThrowErrors <- function(commandToRun) {
  tryCatch({
    shell(commandToRun)
  }, warning=function(w) {
    conditionMessageWithoutPassword <- gsub(<<<connectionStringSqlPassword>>>, "*****", conditionMessage(w), fixed=TRUE) # Do not print SQL passwords in errors
    stop("Converted from warning: ", conditionMessageWithoutPassword)
  })
}

# The connection options needed to establish a connection to the client database
.bcpConnectionOptions <- function() {
  if (<<<useTrustedConnection>>>) {
    return(paste0(" -S ", <<<databaseServer>>>, " -T"))
  } else {
    return(paste0(" -S ", <<<databaseServer>>>, " -U ", <<<connectionStringLogin>>>," -P ", <<<connectionStringSqlPassword>>>))
  }
}

###################
# Other functions #
###################

# Mirrors SQL Server parseName function
parseName <- function(databaseObject) {
  splitName <- strsplit(databaseObject, '.', fixed=TRUE)[[1]]
    if (length(splitName)==3){
      dbName <- splitName[1]
      schemaName <- splitName[2]
      tableName <- splitName[3]
    } else if (length(splitName)==2){
      dbName <- <<<databaseServer>>>
      schemaName <- splitName[1]
      tableName <- splitName[2]
    } else if (length(splitName)==1){
      dbName <- <<<databaseName>>>
      schemaName <- ""
      tableName <- splitName[1]
    }

    return(c(tableName=tableName, schemaName=schemaName, dbName=dbName))
}

sql-server - SQL Server R Services - 将数据输出到数据库表，性能

1 回答 1

Related

Reference