3

我正在使用包含姓氏、地址、性别和出生日期字段的 1200 万条记录的 MyISAM 表:

ID  SURNAME  GENDER       BDATE  COUNTY         ADDRESS         CITY
 1    JONES       M  1954-11-04     015       51 OAK ST  SPRINGFIELD
 2     HILL       M  1981-02-16     009     809 PALM DR   JONESVILLE
 3     HILL       F  1979-06-23     009     809 PALM DR   JONESVILLE
 4     HILL       F  1941-10-11     009     809 PALM DR   JONESVILLE
 5    SMITH       M  1914-07-27     035  1791 MAPLE AVE     MAYBERRY
 6    SMITH       F  1954-02-05     035  1791 MAPLE AVE     MAYBERRY
 7  STEVENS       M  1962-05-05     019  404 CYPRESS ST     MAYBERRY
 .        .       .           .       .               .
 .        .       .           .       .               .
 .        .       .           .       .               .

对姓氏、日期和地址字段进行索引。我的目标是附加一个推断婚姻状况的字段,该字段由以下标准定义:对于每条记录,如果表中存在另一条记录(1)相同的姓氏,(2)不同的性别,(3)相同的地址,和(4)年龄差小于15岁,设已婚=T;否则设置已婚 = F。

作为一个 SQL 新手,我最初的方法是添加一个默认为“F”的 marital 字段,然后使用自连接来设置 MARRIED = T。

ALTER TABLE MY_TABLE
ADD COLUMN MARRIED CHAR(1) NOT NULL DEFAULT 'F';

UPDATE MY_TABLE T1, MY_TABLE T2
SET T1.MARRIED = 'T' WHERE
  T1.SURNAME = T2.SURNAME AND
  T1.GENDER != T2.GENDER AND
  T1.ADDRESS = T2.ADDRESS AND
  T1.CITY    = T2.CITY AND
  ABS(YEAR(T1.BDATE)-YEAR(T2.BDATE)) < 15;

虽然这在小表上工作得很好,但我很快了解到我可能会在这个过程完成之前退休,在一个 1200 万行的表上。我的 SQL 知识非常有限,所以我确信这是一种次优方法。有什么建议的替代品吗?也许索引 SURNAME + ADDRESS + CITY?先按 ADDRESS + CITY 分组?更好的餐桌设计?任何建议,将不胜感激。

提前感谢您的帮助!

4

3 回答 3

1

兄弟姐妹们注意啦!

于 2010-08-28T17:37:57.390 回答
1

我会尝试几种变化,看看哪个表现最好:

版本 1 使用简单的 Exists 但使用 Date_Add 而不是 ABS 值函数:

Update My_Table
Set Married = 'T'
Where Exists    (
                Select 1
                From My_Table As T2
                Where T2.SurName = My_Table.SurName
                    And T2.Gender != My_Table.Gender
                    And T2.Address = My_Table.Address
                    And T2.City = My_Table.City
                    And (
                        T2.BDate > Date_Add(My_Date.BDate, Interval 15 Year)
                        Or T2.BDate < Date_Add(My_Date.BDate, Interval -15 Year)
                        )
                )

使用 UNION ALL 的版本 2

Update My_Table
Set Married = 'T'
Where Exists    (
                Select 1
                From My_Table As T2
                Where T2.SurName = My_Table.SurName
                    And T2.Gender != My_Table.Gender
                    And T2.Address = My_Table.Address
                    And T2.City = My_Table.City
                    And T2.BDate > Date_Add(My_Date.BDate, Interval 15 Year)
                Union All
                Select 1
                From My_Table As T2
                Where T2.SurName = My_Table.SurName
                    And T2.Gender != My_Table.Gender
                    And T2.Address = My_Table.Address
                    And T2.City = My_Table.City
                    And T2.BDate < Date_Add(My_Date.BDate, Interval -15 Year
                )

版本 3 使用内部联接和 Date_Add

Update My_Table As T1
    Join My_Table As T2
            On T2.SurName = T1.SurName
                And T2.Gender != T1.Gender
                And T2.Address = T1.Address
                And T2.City = T1.City
Set Married = 'T'
Where T1.BDate > Date_Add(T2.BDate, Interval 15 Year)
        Or T1.BDate < Date_Add(T2.BDate, Interval -15 Year)

从 SQL 退一步,我认为试图根据提供的信息来推断两个人是否已婚将是充满问题的。它不考虑年龄差异大于 15 岁的夫妇(安娜妮可史密斯有人吗?),也不考虑兄弟姐妹,也不考虑两个结婚但不改变姓氏的人。

于 2010-08-28T17:39:38.520 回答
0

好吧,索引 WHERE 子句中的所有字段肯定会加快查询速度。

这意味着姓氏、性别、地址、城市和 BDATE。

您可以尝试的另一件事是定义规则以缩小 ON 部分的结果:

UPDATE MY_TABLE T1
  LEFT JOIN MY_TABLE T2
  ON T1.SURNAME = T2.SURNAME
    AND T1.GENDER != T2.GENDER
    AND T1.CITY   = T2.CITY
  SET T1.MARRIED = 'T'
  WHERE ABS(YEAR(T1.BDATE)-YEAR(T2.BDATE)) < 15;
于 2010-08-28T17:06:15.473 回答