php - Indexing 2.5 million items with assorted other information

Question

I have a table with a list of 2.5 million doctors. I also have tables for accepted insurance, languages spoken, and for specialties (taxonomy) provided. The doctor table is like:

CREATE TABLE `doctors` (
  `doctor_id` int(10) NOT NULL AUTO_INCREMENT,
  `city_id` int(10) NOT NULL DEFAULT '0',
  `d_gender` char(1) NOT NULL DEFAULT 'U',
  `s_insurance` int(6) NOT NULL DEFAULT '0',
  `s_languages` int(6) NOT NULL DEFAULT '0',
  `s_taxonomy` int(6) NOT NULL DEFAULT '0',
  PRIMARY KEY (`doctor_id`)
) ENGINE=InnoDB;

The other information is stored as such:

CREATE TABLE `doctors_insurance` (
  `assoc_id` int(10) NOT NULL AUTO_INCREMENT,
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `insurance_id` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`assoc_id`)
) ENGINE=InnoDB;

CREATE TABLE `doctors_languages` (
  `assoc_id` int(10) NOT NULL AUTO_INCREMENT,
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `language_id` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`assoc_id`)
) ENGINE=InnoDB;

CREATE TABLE `doctors_taxonomy` (
  `assoc_id` int(10) NOT NULL AUTO_INCREMENT,
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `taxonomy_id` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`assoc_id`)
) ENGINE=InnoDB;

Naturally each doctor supports various different insurance plans, maybe speaks multiple languages, and some doctors can have several different specialties (taxonomy). So I opted to have separate tables for indexing, this way need I add new indices or drop old ones, I can simply remove the tables and not have to wait the long time it takes to actually do it the old fashioned way.

Also because of other scaling techniques to consider in the future, classic JOINs make no difference to me right now, so I'm not worried about it.

Indexing by name was easy:

CREATE TABLE `indices_doctors_names` (
  `ref_id` int(10) NOT NULL AUTO_INCREMENT,
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `practice_id` int(10) NOT NULL DEFAULT '0',
  `name` varchar(120) NOT NULL DEFAULT '',
  PRIMARY KEY (`ref_id`),
  KEY `name` (`name`)
) ENGINE=InnoDB;

However when I wanted to allow people to search by the city, specialties, insurance, language, and gender and other demographics, I created his:

CREATE TABLE `indices_doctors_demos` (
  `ref_id` int(10) NOT NULL AUTO_INCREMENT,
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `city_id` int(10) NOT NULL DEFAULT '0',
  `taxonomy_id` int(6) NOT NULL DEFAULT '0',
  `insurance_id` int(6) NOT NULL DEFAULT '0',
  `language_id` int(6) NOT NULL DEFAULT '0',
  `gender_id` char(1) NOT NULL DEFAULT 'U',
  PRIMARY KEY (`ref_id`),
  KEY `index` (`city_id`,`taxonomy_id`,`insurance_id`,`language_id`,`gender_id`)
) ENGINE=InnoDB;

The idea is that there will be an entry for each change in specialty, insurance, or language primarily, though others will still the same. This creates an obvious problem. If a doctor has 3 specialties, supports 3 insurance providers, and speaks 3 languages, this alone means this specific doctor has 27 entries. So 2.5 million entries easily balloons into far more.

There has to be a better approach to do this, but how can it be done? Again, I'm not interested in moving to classic indexing techniques and using JOINs because it will quickly become too slow, I need a method that can scale out easily.

score 0 · Accepted Answer

I know this is not the answer you're looking for, but you've now taken the things that a RDBMs do well and tried implementing it yourself, using the same mechanism that the RDBMs could use to actually make sense of your data and optimize both retrieval and querying. In practice you've decided to drop using proper indexes to create your own half-way-there-solution, which will try to implement indexes by itself (by actually using the indexing capability of the RDBMs with the KEY).

I'd suggest to actually try to just use the database the way you've already structured it. 2.5m rows isn't that many rows, and you should be able to make it work fast and within your constraints using both JOINs and indexes. Use EXPLAIN and add proper indexes to support your the queries you want answered. If you ever run into an issue (and I'd doubt it regarding the amount of data you're querying here), decide to solve the bottle neck then when you actually know what could be the issue instead of trying to solve a problem you've only imagined so far. There might be other technologies than MySQL that can be helpful - but you'll need to know what's actually hurting your performance first.

score 0 · Accepted Answer

The normal way to deal with the explosion of rows in a denormalized table like "indices_doctors_demos" is to normalize to 5NF. Try to keep in mind that normalizing has nothing at all to do with the decision to use id numbers as surrogate keys.

In the scenario you described, normalizing to 5NF seems practical. You wouldn't have any table with more than about 7 million rows. The table "indices_doctors_demos" vanishes entirely, the four "doctors" tables all become narrower, and all of them would end up with highly selective indexes.

If you worked for me, I'd require you to prove that 5NF can't work before I'd let you take a different approach.

Since you already have all the data, it makes sense to build it and test it, paying close attention to the query plans. It shouldn't take you more than one afternoon. Guessing at some table names, I'd suggest you load data into these tables.

-- You're missing foreign keys throughout. I've added some of them, 
-- but not all of them. I'm also assuming you have a way to identify 
-- doctors besides a bare integer.
CREATE TABLE `doctors` (
  `doctor_id` int(10) NOT NULL AUTO_INCREMENT,
  `city_id` int(10) NOT NULL DEFAULT '0',
  `d_gender` char(1) NOT NULL DEFAULT 'U',
  PRIMARY KEY (`doctor_id`)
) ENGINE=InnoDB;

CREATE TABLE `doctors_insurance` (
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `insurance_id` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`doctor_id`, `insurance_id`),
  FOREIGN KEY (`doctor_id`) REFERENCES `doctors` (`doctor_id`),
  FOREIGN KEY (`insurance_id`) REFERENCES `insurance` (`insurance_id`)
) ENGINE=InnoDB;

CREATE TABLE `doctors_languages` (
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `language_id` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`doctor_id`, `language_id`),
  FOREIGN KEY (`doctor_id`) REFERENCES `doctors` (`doctor_id`),
  FOREIGN KEY (`language_id`) REFERENCES `languages` (`language_id`)
) ENGINE=InnoDB;

CREATE TABLE `doctors_taxonomy` (
  `doctor_id` int(10) NOT NULL DEFAULT '0',
  `taxonomy_id` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`doctor_id`, `taxonomy_id`),
  FOREIGN KEY (`doctor_id`) REFERENCES `doctors` (`doctor_id`),
  FOREIGN KEY (`taxonomy_id`) REFERENCES `taxonomies` (`taxonomy_id`)
) ENGINE=InnoDB;

php - Indexing 2.5 million items with assorted other information

2 回答 2

Related

Reference