I have a massive CSV file that I am given once a week that contains just under 5 million records. These records need to either be added to my SQL database (MS SQL Server) or updated depending on if they exist already or not. I thought about performing a Bulk Upsert, but the issue is that I cannnot update the records directly. This is what the [important components of the] objects look like:
PatientRecord-
int MRN; //primary key
string first_name;
string last_name;
int? updated_mrn;
int? pat_id; //filtered unclustered unique index
When a record needs to be added to the system we need to first check if that MRN already exists and the rest of the data matches. If so, the record is skipped, otherwise it gets added to a List<PatientRecord>
of exceptions. If the MRN is not found we need to check if that pat_id already exists. If so, the new MRN is added the updated_mrn component of the object (and updated in the db), otherwise a new record is created. The problem is that this takes forever. My application uses LINQ to SQL for almost all other database transactions, but this would not be the best way to handle the weekly load/update. I thought about performing some SQL Bulk operations to do this, but then I'd need to load all of the records from the CSV into memory. I'm not quite sure on the the most efficient way of doing this. My current thoughts are the following:
- Load CSV data into memory
- Compare object with database (Using Linq-to-sql)
- If found-remove from structure and place in exception structure or update structure
- Bulk Insert of non-exceptions/updates
- Bulk Update of exception structure
- Generate Exceptions File for manual review
My questions are as follows: What data-structure would be the most memory efficient to hold all of this data? Random access is not needed. Should LINQ-to-SQL not be used to perform the verifications? I know that it is not the best performing method of querying a database. Am I going about this component of the project all wrong?
Any advice or suggestions are welcome!