4

I'm pretty new to writing SQL and have just built a couple of procedures to add data to my MySQL database. The problem is that it is extremely slow, due to the large number of queries. What I do now is loop through each record in a table containing the unsorted, raw data and then take that data point and add into the database. This becomes complicated as I have a number of FKs that I have to deal with.

Can you please help me optimize this?

As an example, to add the specified table I do: CALL add_table1(112,15);

Procedure to add data
-- Primary procedure

CREATE PROCEDURE `add_table1`(
    IN c_id INT UNSIGNED;
    IN t_id INT UNSIGNED;
)
BEGIN
    -- Table variables
    DECLARE r_id INT UNSIGNED;
    DECLARE dh_name VARCHAR(50);
    DECLARE d_value DECIMAL(20,10);

    -- Loop variables
    DECLARE done BOOLEAN;

    -- Cursor for measurement table
    DECLARE m_cur CURSOR FOR 
        SELECT Run_ID, DataHeader_Name, Data_Value 
        FROM `measurements`.`measurement_20131029_152902`;

    -- Handlers for exceptions
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    -- Set start time
    UPDATE `measurements`.`queue` 
        SET Start_Time = NOW() 
        WHERE Experiment_ID = 112 AND Procedure_Name = 'add_table1';

    -- Loop through measurement table
    OPEN m_cur;
    m_loop: LOOP
        FETCH m_cur INTO r_id, dh_name, d_value;
        IF done THEN
            CLOSE m_cur;
            LEAVE m_loop;
        END IF;
        CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
    END LOOP m_loop;
END

Procedure to add measurement
-- Secondary procedure, called from add_table1

CREATE PROCEDURE `add_measurement`(
    IN measurement_header VARCHAR(50),
    IN measurement_value DECIMAL(20,10),
    IN tool_id_var INT UNSIGNED,
    IN config_id_var INT UNSIGNED,
    IN run_id_var INT UNSIGNED
)
BEGIN
    -- Variables representing FKs
    DECLARE data_header_id INT UNSIGNED;
    DECLARE tool_header_link_id INT UNSIGNED;
    DECLARE tool_data_id INT UNSIGNED;
    DECLARE tool_data_link_id INT UNSIGNED;

    -- Add header
    INSERT IGNORE INTO data_headers(DataHeader_Name) 
        VALUES(measurement_header);
    SET data_header_id = (SELECT DataHeader_ID 
        FROM data_headers WHERE DataHeader_Name = measurement_header);

    -- Link header to tool
    INSERT IGNORE INTO tool_header_link(DataHeader_ID, Tool_ID)
        VALUES(data_header_id, tool_id_var);
    SET tool_header_link_id = (SELECT ToolHeaderLink_ID 
        FROM tool_header_link
        WHERE DataHeader_ID = data_header_id AND Tool_ID = tool_id_var);

    -- Add measurement
    INSERT IGNORE INTO tool_data(Data_Value) VALUES(measurement_value);
    SET tool_data_id = (SELECT ToolData_ID
        FROM tool_data WHERE Data_Value = measurement_value);

    -- Link measurement to header and configuration
    INSERT IGNORE INTO 
        tool_data_link(ToolHeaderLink_ID, ToolData_ID, Run_ID)
        VALUES(tool_header_link_id, tool_data_id, run_id_var);
    SET tool_data_link_id = (SELECT ToolDataLink_ID FROM tool_data_link
        WHERE ToolHeaderLink_ID = tool_header_link_id
        AND ToolData_ID = tool_data_id AND Run_ID = run_id_var);

    -- Link measurement to experiment configuration
    INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
        VALUES(tool_data_link_id, config_id_var);
END

Current Solution
I stumbled upon this solution about a similar issue. I enclosed the meat of the code inside of a TRANSACTION and immediately noticed a massive improvement in speed. Instead of the query's estimated completion time being about 36 hours, I got the actual completion time down to about 5 minutes! I also did a slight design change to the database and removed an unnecessary FK. If anyone sees further ways to improve this code, I am still interested. I have the performance into an acceptable range for our applications, but I am always interested in making things better.

To show the changes:

    START TRANSACTION;

    -- Loop through measurement table
    OPEN m_cur;
    m_loop: LOOP
        FETCH m_cur INTO r_id, dh_name, d_value;
        IF done THEN
            CLOSE m_cur;
            LEAVE m_loop;
        END IF;
        CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
    END LOOP m_loop;

    COMMIT;

Alternative Solution
Based off the answers below, I was able to update my new solution to the one below. From my testing, it appears that this new solution is functioning as desired. It is also more than twice as fast as the previous solution. Using this routine, I can add one million unique pieces of data in about 2.5 minutes!

Thank you all for your help!

CREATE PROCEDURE `add_table`(
    IN config_id_var INT UNSIGNED
)
BEGIN
    START TRANSACTION;

    -- Add header
    INSERT IGNORE INTO data_headers(DataHeader_Name)
        SELECT DataHeader_Name
        FROM `measurements`.`measurement_20131114_142402`;

    -- Add measurement
    INSERT IGNORE INTO tool_data(Data_Value)
        SELECT Data_Value
        FROM `measurements`.`measurement_20131114_142402`;

    -- Link measurement to header and configuration
        -- INSERT Non-Unique Values
    INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
        SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
        FROM `measurements`.`measurement_20131114_142402` AS m
        JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
        JOIN tool_data AS d ON d.Data_Value = m.Data_Value;
        -- INSERT Unique Values
    INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
        SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
        FROM `measurements`.`measurement_20131114_142402` AS m
        LEFT OUTER JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
        LEFT OUTER JOIN tool_data AS d ON d.Data_Value = m.Data_Value
        WHERE ((h.DataHeader_Name IS NULL) OR (d.Data_Value IS NULL));

    -- Link measurement to experiment configuration
        -- INSERT Non-Unique Values
    INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
        SELECT tdl.ToolDataLink_ID, config_id_var
        FROM tool_data_link AS tdl
        JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
        JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID;
        -- INSERT Unique Values
    INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
        SELECT tdl.ToolDataLink_ID, config_id_var
        FROM tool_data_link AS tdl
        LEFT OUTER JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
        LEFT OUTER JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID
        WHERE ((h.DataHeader_ID IS NULL) OR (d.ToolData_ID IS NULL));

    COMMIT;
END

Conclusion
I did some more testing with the solution that did not use cursors. It is definitely faster, initially; however, when the size of the database grows, the execution time drastically increases.

I added in a couple million data points into the database. I then tried adding a small data set of around a few hundred data points. It took nearly 400x longer than the cursor solution. I believe that is because the cursors only looked at the data points needed, where as, the joins had to look through all of the data.

Based off those results, it appears that the cursor solution will be better for my applications.

4

2 回答 2

2

数据库将使用基于集合的逻辑。尝试在没有游标的情况下编写相同的查询。

What your code does:
-1: Reads the records from measurement_xxxx.... table and for each one it executes 5 insert.

How this can be writen with set logic: 
-1: make the first insert for all the records from measurement_XXXX....
-2: make the second insert for all the records from measurement_XXX....
.......

This will change the code like the following: ( I did not test the code, so it is a guideline ) 

CREATE PROCEDURE `using_set_logic`(
    IN c_id INT UNSIGNED;
    IN t_id INT UNSIGNED;
)
begin
-- the first insert 
insert into data_headers(DataHeader_Name) 
select DataHeader_Name FROM `measurements`.`measurement_20131029_152902
-- the second insert if the dataheadername is uniqueu
insert into tool_header_link(DataHeader_ID, Tool_ID)
select DataHeader_ID, t_id from data_headers where DataHeader_Name in (
    select DataHeader_Name FROM `measurements`.`measurement_20131029_152902
)
--the second insert if the dataheadername isnot unique.
--take the last id for the dataheaders before the first insert 
--take the last id for the dataheaders after the insert
--use those values to bound the data header id.
insert into tool_header_link(DataHeader_ID, Tool_ID)
select DataHeader_ID, @t_id from data_headers 
where DataHeader_id between @beforeFirstInsert and @afterFirstInsert

end
于 2013-11-13T04:06:05.480 回答
1

举个例子,从你的问题中得到这个陈述:

-- Link measurement to header and configuration
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
    SELECT t1.DataHeader_ID, t2.ToolData_ID, t3.Run_ID
    FROM data_headers t1, tool_data t2, runs t3
    WHERE (( t1.DataHeader_Name in (
        SELECT DataHeader_Name
        FROM `measurements`.`measurement_20131114_110059`)
    ) AND ( t2.Data_Value in (
        SELECT Data_Value
        FROM `measurements`.`measurement_20131114_110059`)
    ) AND ( t3.Run_ID in (
        SELECT Run_ID
        FROM `measurements`.`measurement_20131114_110059`)));

将其与以下性能进行比较SELECT

    SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
    FROM measurements.measurement_20131114_110059 AS m
    JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
    JOIN tool_data AS d ON d.Data_Value = m.Data_Value
    ;

注意:您绝对需要两个表中的 DataHeader_Name 和 Data_Value 索引。(注意:加入临时表几乎总是很慢,并且永远无法正确索引)

这里的底线是您的所有数据都来自测量_*。如果您想要性能,请尽可能少地使用其他表。

于 2013-11-14T20:07:34.353 回答