0

I have developed an online survey that stores my data in a Microsoft SQL 2005 database. I have written a set of outlier checks on my data in R. The general workflow for these scripts is:

  1. Read data from SQL database with sqlQuery()
  2. Perform outlier analysis
  3. Write offending respondents back to database in separate table using sqlSave()

The table I am writing back to has the structure:

CREATE TABLE outliers2(
    modelid int
    , password varchar(50)
    , reason varchar(50),
Constraint PK_outliers2 PRIMARY KEY(modelid, reason)
)
GO

As you can see, I've set the primary key to be modelid and reason. The same respondent may be an outlier for multiple checks, but I do not want to insert the same modelid and reason combo for any respondent.

Since we are still collecting data, I would like to be able to update these scripts on a daily / weekly basis as I develop the models I am estimating on the data. Here is the general form of the sqlSave() command I'm using:

sqlSave(db, db.insert, "outliers2", append = TRUE, fast = FALSE, rownames = FALSE)

where db is a valid ODBC Connection and db.insert has the form

> head(db.insert)
  modelid password          reason
1     873       abkd WRONG DIRECTION
2     875       ab9d WRONG DIRECTION
3     890       akdw WRONG DIRECTION
4     905       pqjd WRONG DIRECTION
5     941       ymne WRONG DIRECTION
6     944       okyt WRONG DIRECTION

sqlSave() chokes when it tries to insert a row that violates the primary key constraint and does not continue with the other records for the insert. I would have thought that setting fast = FALSE would have alleviated this problem, but it doesn't.

Any ideas on how to get around this problem? I could always drop the table at the beginning of the first script, but that seems pretty heavy handed and will undoubtedly lead to problems down the road.

4

1 回答 1

2

在这种情况下,一切都按预期工作。您将所有内容作为批处理上传,SQL Server 会在发现错误后立即停止批处理。不幸的是,我不知道一个优雅的内置解决方案。但是,我认为可以在数据库中构建一个系统来更有效地处理这个问题。我喜欢在数据库中而不是在 R 中进行数据存储/管理,所以我的解决方案非常依赖数据库。其他人可能会为您提供更面向 R 的解决方案。

首先,创建一个没有约束的简单表来保存新行并相应地调整 sqlSave 语句。这是 R 将信息上传到的地方。

CREATE TABLE tblTemp(
    modelid int
    , password varchar(50)
    , reason varchar(50)
    , duplicate int()
)
GO

您将信息放入此表的查询应假定列“重复”为“否”。我使用 1=Y & 5=N 的模式。你也可以只标记那些异常值,但我倾向于用我的逻辑来明确。

您还需要一个地方来转储所有违反异常值 PK 的行。

CREATE TABLE tblDuplicates(
    modelid int
    , password varchar(50)
    , reason varchar(50)
)
GO

好的。现在您需要做的就是创建一个触发器以将新行从 tblTemp 移动到异常值2。此触发器会将所有重复的行移动到 tblDuplicates 以供以后处理、删除等。

CREATE TRIGGER FindDups
ON tblOutliersTemp
AFTER INSERT
AS 

我不会写整个触发器。我没有 SQL Server 2005 来测试它,我可能会犯语法错误,我不想给你糟糕的代码,但这是触发器需要做的事情:

  1. 识别 tblTemp 中会违反异常值 2 中 PK 的所有行。在找到重复项的地方,将重复项更改为 1。这将通过 UPDATE 语句完成。
  2. 将重复=1 的所有行复制到 tblDuplicates。您可以使用 INSERT INTO tblDuplicates ......
  3. 现在使用看起来几乎与步骤 2 中使用的语句完全相同的 INSERT INTO 语句将非重复行复制到异常值 2。
  4. 删除 tblTemp 中的所有行,以便为您的下一批更新清除它。这一步很重要。

这样做的好处是 sqlSave() 不会因为你违反了你的 PK 而出错,你可以在以后处理比赛,比如明天。:-)

于 2010-11-12T14:18:18.990 回答