php - 如何使用 PHP 和外键将“大量”数据导入 MySQL？

Question

我有这些表：

create table person (
    person_id int unsigned auto_increment, 
    person_key varchar(40) not null, 
    primary key (person_id), 
    constraint uc_person_key unique (person_key)
) 
-- person_key is a varchar(40) that identifies an individual, unique 
-- person in the initial data that is imported from a CSV file to this table

create table marathon (
    marathon_id int unsigned auto_increment,  
    marathon_name varchar(60) not null, 
    primary key (marathon_id) 
)

create table person_marathon (
    person_marathon _id int unsigned auto_increment,  

    person_id int unsigned, 
    marathon_id int unsigned,

    primary key (person_marathon_id),
    foreign key person_id references person (person_id), 
    foreign key marathon_id references person (marathon_id),

    constraint uc_marathon_person unique (person_id, marathon_id)  
)

Person 表由包含大约 130,000 行的 CSV 填充。此 CSV 包含每个人和一些其他人员数据的唯一 varchar(40)。CSV 中没有 ID。

对于每场马拉松比赛，我都会得到一个 CSV，其中包含 1k - 30k 人的列表。CSV 基本上只包含一个person_key值列表，显示哪些人参加了特定的马拉松比赛。

将数据导入person_marathon表以维护 FK 关系的最佳方法是什么？

这些是我目前能想到的想法：

从MySQL中提取信息并在 PHP 中person_id + person_key合并数据以在插入表之前将其放入其中person_marathonperson_idperson_marathon
使用临时表进行插入...但这是为了工作，我被要求永远不要在这个特定数据库中使用临时表
根本不使用 a person_id，只使用该person_key字段，但随后我将不得不加入 a varchar(40)，这通常不是一件好事

或者，对于插入，让它看起来像这样（我必须插入，<hr>否则它不会将整个插入格式化为代码）：

insert  into person_marathon 

select  p.person_id, m.marathon_id

from    ( select 'person_a' as p_name, 'marathon_a' as m_name union 
          select 'person_b' as p_name, 'marathon_a' as m_name ) 
          as imported_marathon_person_list 

        join person p 
           on p.person_name = imported_marathon_person_list.p_name

        join marathon m 
           on m.marathon_name = imported_marathon_person_list.m_name

该插入的问题在于，用 PHP 构建它imported_marathon_person_list会很大，因为它很容易有 30,000select union个项目。不过，我不知道该怎么做。

score 2 · Accepted Answer

我处理过类似的数据转换问题，尽管规模较小。如果我正确理解了您的问题（我不确定），听起来使您的情况具有挑战性的细节是：您尝试在同一步骤中做两件事：

将大量行从 CSV 导入 mysql，以及
进行转换，使人-马拉松关联通过 person_id 和 marathon_id 工作，而不是（笨拙和不受欢迎的）varchar personkey 列。

简而言之，我会尽一切可能避免在同一步骤中做这两件事。将其分为这两个步骤 -首先以可容忍的形式导入所有数据，然后再对其进行优化。Mysql 是进行这种转换的好环境，因为当您将数据导入到 people 和 marathons 表中时，ID 会为您设置好。

第 1 步：导入数据

我发现在 mysql 环境中执行数据转换比在它之外更容易。因此，将数据以保留人-马拉松关联的形式输入 mysql，即使它不是最佳的，并且担心之后更改关联方法。
您提到临时表，但我认为您不需要任何临时表。在persons_marathons 表上设置一个临时列“personkey”。当您导入所有关联时，您暂时将 person_id 留空，只需导入 personkey。重要的是，确保 personkey 是关联表和 person 表上的索引列。然后你可以稍后再过一遍，为每个personkey填写正确的person_id，不用担心mysql效率低下。
我不清楚马拉松表数据的性质。你有成千上万的马拉松要参加吗？如果是这样，我不羡慕你每次马拉松处理 1 个电子表格的工作。但如果它更少，那么您也许可以手动设置马拉松表。让mysql为你生成马拉松ID。然后，当您为每个马拉松导入 person_marathon CSV 时，请务必在与该马拉松相关的每个关联中指定该马拉松 ID。

完成数据导入后，您将拥有三个表： * 个人 - 您有丑陋的 personkey，以及新生成的 person_id，以及任何其他字段 * marathons - 此时您应该有一个 marathon_id，对吗？要么是新生成的，要么是您从某些旧系统中继承的数字。* person_marathons - 这个表应该填写 marathon_id 并指向 marathons 表中的正确行，对吗？您还有 personkey （丑陋但存在）和 person_id （仍然为空）。

第二步：使用personkey为关联表中的每一行填写person_id

然后，您要么直接使用 Mysql，要么编写一个简单的 PHP 脚本，为 person_marathons 表中的每一行填写 person_id。如果我无法让 mysql 直接执行此操作，我通常会编写一个 php 脚本来一次处理一行。其中的步骤很简单：

查找 person_id 为空但 personkey 不为空的任何 1 行
查找该 personkey 的 person_id
在该行的关联表中写入该 person_id

你可以告诉 PHP 重复 100 次然后结束脚本，或者 1000 次，如果你一直遇到超时问题或类似的问题。

这种转换涉及大量查找，但每次查找只需要针对单个行。这很吸引人，因为您在任何时候都不需要要求 mysql（或 PHP）“将整个数据集保持在头脑中”。

此时，您的关联表应该为每一行填写 person_id。现在可以安全地删除 personkey 列了，瞧，你有了高效的外键。

php - 如何使用 PHP 和外键将“大量”数据导入 MySQL？

1 回答 1

Related

Reference