I'm pretty new to writing SQL and have just built a couple of procedures to add data to my MySQL database. The problem is that it is extremely slow, due to the large number of queries. What I do now is loop through each record in a table containing the unsorted, raw data and then take that data point and add into the database. This becomes complicated as I have a number of FKs that I have to deal with.
Can you please help me optimize this?
As an example, to add the specified table I do: CALL add_table1(112,15);
Procedure to add data
-- Primary procedure
CREATE PROCEDURE `add_table1`(
IN c_id INT UNSIGNED;
IN t_id INT UNSIGNED;
)
BEGIN
-- Table variables
DECLARE r_id INT UNSIGNED;
DECLARE dh_name VARCHAR(50);
DECLARE d_value DECIMAL(20,10);
-- Loop variables
DECLARE done BOOLEAN;
-- Cursor for measurement table
DECLARE m_cur CURSOR FOR
SELECT Run_ID, DataHeader_Name, Data_Value
FROM `measurements`.`measurement_20131029_152902`;
-- Handlers for exceptions
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
-- Set start time
UPDATE `measurements`.`queue`
SET Start_Time = NOW()
WHERE Experiment_ID = 112 AND Procedure_Name = 'add_table1';
-- Loop through measurement table
OPEN m_cur;
m_loop: LOOP
FETCH m_cur INTO r_id, dh_name, d_value;
IF done THEN
CLOSE m_cur;
LEAVE m_loop;
END IF;
CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
END LOOP m_loop;
END
Procedure to add measurement
-- Secondary procedure, called from add_table1
CREATE PROCEDURE `add_measurement`(
IN measurement_header VARCHAR(50),
IN measurement_value DECIMAL(20,10),
IN tool_id_var INT UNSIGNED,
IN config_id_var INT UNSIGNED,
IN run_id_var INT UNSIGNED
)
BEGIN
-- Variables representing FKs
DECLARE data_header_id INT UNSIGNED;
DECLARE tool_header_link_id INT UNSIGNED;
DECLARE tool_data_id INT UNSIGNED;
DECLARE tool_data_link_id INT UNSIGNED;
-- Add header
INSERT IGNORE INTO data_headers(DataHeader_Name)
VALUES(measurement_header);
SET data_header_id = (SELECT DataHeader_ID
FROM data_headers WHERE DataHeader_Name = measurement_header);
-- Link header to tool
INSERT IGNORE INTO tool_header_link(DataHeader_ID, Tool_ID)
VALUES(data_header_id, tool_id_var);
SET tool_header_link_id = (SELECT ToolHeaderLink_ID
FROM tool_header_link
WHERE DataHeader_ID = data_header_id AND Tool_ID = tool_id_var);
-- Add measurement
INSERT IGNORE INTO tool_data(Data_Value) VALUES(measurement_value);
SET tool_data_id = (SELECT ToolData_ID
FROM tool_data WHERE Data_Value = measurement_value);
-- Link measurement to header and configuration
INSERT IGNORE INTO
tool_data_link(ToolHeaderLink_ID, ToolData_ID, Run_ID)
VALUES(tool_header_link_id, tool_data_id, run_id_var);
SET tool_data_link_id = (SELECT ToolDataLink_ID FROM tool_data_link
WHERE ToolHeaderLink_ID = tool_header_link_id
AND ToolData_ID = tool_data_id AND Run_ID = run_id_var);
-- Link measurement to experiment configuration
INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
VALUES(tool_data_link_id, config_id_var);
END
Current Solution
I stumbled upon this solution about a similar issue. I enclosed the meat of the code inside of a TRANSACTION
and immediately noticed a massive improvement in speed. Instead of the query's estimated completion time being about 36 hours, I got the actual completion time down to about 5 minutes! I also did a slight design change to the database and removed an unnecessary FK. If anyone sees further ways to improve this code, I am still interested. I have the performance into an acceptable range for our applications, but I am always interested in making things better.
To show the changes:
START TRANSACTION;
-- Loop through measurement table
OPEN m_cur;
m_loop: LOOP
FETCH m_cur INTO r_id, dh_name, d_value;
IF done THEN
CLOSE m_cur;
LEAVE m_loop;
END IF;
CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
END LOOP m_loop;
COMMIT;
Alternative Solution
Based off the answers below, I was able to update my new solution to the one below. From my testing, it appears that this new solution is functioning as desired. It is also more than twice as fast as the previous solution. Using this routine, I can add one million unique pieces of data in about 2.5 minutes!
Thank you all for your help!
CREATE PROCEDURE `add_table`(
IN config_id_var INT UNSIGNED
)
BEGIN
START TRANSACTION;
-- Add header
INSERT IGNORE INTO data_headers(DataHeader_Name)
SELECT DataHeader_Name
FROM `measurements`.`measurement_20131114_142402`;
-- Add measurement
INSERT IGNORE INTO tool_data(Data_Value)
SELECT Data_Value
FROM `measurements`.`measurement_20131114_142402`;
-- Link measurement to header and configuration
-- INSERT Non-Unique Values
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
FROM `measurements`.`measurement_20131114_142402` AS m
JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
JOIN tool_data AS d ON d.Data_Value = m.Data_Value;
-- INSERT Unique Values
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
FROM `measurements`.`measurement_20131114_142402` AS m
LEFT OUTER JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
LEFT OUTER JOIN tool_data AS d ON d.Data_Value = m.Data_Value
WHERE ((h.DataHeader_Name IS NULL) OR (d.Data_Value IS NULL));
-- Link measurement to experiment configuration
-- INSERT Non-Unique Values
INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
SELECT tdl.ToolDataLink_ID, config_id_var
FROM tool_data_link AS tdl
JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID;
-- INSERT Unique Values
INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
SELECT tdl.ToolDataLink_ID, config_id_var
FROM tool_data_link AS tdl
LEFT OUTER JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
LEFT OUTER JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID
WHERE ((h.DataHeader_ID IS NULL) OR (d.ToolData_ID IS NULL));
COMMIT;
END
Conclusion
I did some more testing with the solution that did not use cursors. It is definitely faster, initially; however, when the size of the database grows, the execution time drastically increases.
I added in a couple million data points into the database. I then tried adding a small data set of around a few hundred data points. It took nearly 400x longer than the cursor solution. I believe that is because the cursors only looked at the data points needed, where as, the joins had to look through all of the data.
Based off those results, it appears that the cursor solution will be better for my applications.