0

I have a massive CSV file that I am given once a week that contains just under 5 million records. These records need to either be added to my SQL database (MS SQL Server) or updated depending on if they exist already or not. I thought about performing a Bulk Upsert, but the issue is that I cannnot update the records directly. This is what the [important components of the] objects look like:

PatientRecord-
  int MRN; //primary key
  string first_name;
  string last_name;
  int? updated_mrn; 
  int? pat_id; //filtered unclustered unique index

When a record needs to be added to the system we need to first check if that MRN already exists and the rest of the data matches. If so, the record is skipped, otherwise it gets added to a List<PatientRecord> of exceptions. If the MRN is not found we need to check if that pat_id already exists. If so, the new MRN is added the updated_mrn component of the object (and updated in the db), otherwise a new record is created. The problem is that this takes forever. My application uses LINQ to SQL for almost all other database transactions, but this would not be the best way to handle the weekly load/update. I thought about performing some SQL Bulk operations to do this, but then I'd need to load all of the records from the CSV into memory. I'm not quite sure on the the most efficient way of doing this. My current thoughts are the following:

  1. Load CSV data into memory
  2. Compare object with database (Using Linq-to-sql)
  3. If found-remove from structure and place in exception structure or update structure
  4. Bulk Insert of non-exceptions/updates
  5. Bulk Update of exception structure
  6. Generate Exceptions File for manual review

My questions are as follows: What data-structure would be the most memory efficient to hold all of this data? Random access is not needed. Should LINQ-to-SQL not be used to perform the verifications? I know that it is not the best performing method of querying a database. Am I going about this component of the project all wrong?

Any advice or suggestions are welcome!

4

3 回答 3

1

如果您熟悉 SSIS 和 TSQL,则以下内容应该相当简单且易于维护。首先,创建一个 ssis 包以将原始数据加载到 sql server 表中。如果已知每个文件的 MRN 是唯一的,您可以相应地索引这个新的“RAW”表。

其次,创建一个存储过程来将 RAW 数据合并到您的生产表中。合并将在单个操作中执行插入、更新或删除。

最后,您可以将其全部封装在 SQL Server 代理作业中。

我希望这有帮助...

于 2013-10-11T15:38:56.083 回答
1

我会在 C# 中使用 SqlBulkCopy

1 /使用 SqlBulkCopy将 CSV 数据加载到临时表中

2/ 将暂存表与数据库进行比较(使用 Linq-to-sql 或任何其他 SQL 代码)

3/ If found-remove from structure and place in exception structure or update structure

4/ Sql Bulk copy Bulk Insert 非异常/更新

您不应该使用 Linq-2-Sql 进行插入,因为它会一一进行(L2S 中没有批量插入)

于 2013-10-13T08:02:01.117 回答
1

现有的答案很好,但我要补充一点:如果您大批量执行选择和 DML,您可以在应用程序中保留大量逻辑而不会出现问题。始终向数据库发送少量大查询。这节省了以下几个方面:往返时间​​、网络带宽、每笔交易成本、每批次成本和每条语句成本。它还为优化器提供了执行批量操作的机会。排序 1M 行比排序 1000x1000 行快得多。总的来说,这些弥补了数量级的加速。

SQL Server 没有批量更新或合并功能,但您可以批量插入临时表,然后一次对所有内容执行一次合并/更新。

关键是:只要您使用少量且庞大的操作,您就可以做任何您想做的事情。您不需要在 T-SQL 中运行所有内容。

于 2013-10-13T08:19:23.763 回答