We have been experimenting a while now with a particularly challenging and very large dataset we came upon with which there are very few ways I have found to create effective indices and primary keys (barring a wholly radical redesign of the database which is not an economical option at this point). I am looking for suggestions on how to alter either queries or table structure (partitioning etc) - long story short we end up with a lot of time consuming cartesian joins.
Here is the nitty gritty:
I have 3 key example tables here but sometimes join 2-3 similar to result to these:
If I indicated a (??) I am scratching my head as to (was this really the best way to set this up? please chime in)
samples - our main table of concern as it handles all our sample case demographics Fields:
- sampleid (NOT PK but is UNIQUE ??) varchar (255) - most data are 10 digit integer (??), this is a unique ID through out the database for a certain report.
- case - varchar (255) - again most are 10-12 digit integers (??), this is a second form of unique id, BUT a case value of 1000001 may have 1-20 sampleid associated with it in other tables (more later) to provide sequential/chronological information. (like a journal)
adj_samples
Contains expanded/annotated data beyond samples, is linked to samples by SampleID
Fields:
RecordID (PK) just an autonum keeping count on records (??).
SampleID linked to to the samples table in one sample id-> to many adj_case records -- as a sample may have several annotations notes, or other minuatie associated with it which is the purpose of the adj_case table.
ProbableID - int, just an internal code info code for us.
result Fields:
- SampleID varchar 255 (again probably can be int)
- result (from what I have seen to be limited to 100 len char fields, but field len set at 255 varchar)
Basically linked to adj_samples by SampleID one DISTINCT(SampleID) to many result, result fields indicating different levels of information more detailed than others, varchar 50-100 char in len (NOT 255 varchar?)
Table Sizes
samples 2,946,614 rows, MyISAM - 384MB
adj_samples 12,098,910 rows, myISAM, 1.3 GB
result 13,011,880 rows 428,508 KB
A sample query would be to give us all of the case counts for a given internal ID (probableID)
SELECT r.result as result
, COUNT(DISTINCT p.`case`) as ResultCount
FROM Adj_Samples as
LEFT JOIN samples s
ON as.SampleID = s.SampleID
LEFT JOIN results r
ON as.SampleID = r.SampleID
WHERE ProbableID = '101'
AND ProbableID NOT IN #(subquery to table of banned codes we dont want to see)
GROUP BY r.result
ORDER BY COUNT(DISTINCT p.`case`)
We sometimes have to join on several tables similar to results afterwards as there are other tables containing pertinent information - not uncommon to find 5-6 stacked up and it goes totally cartesian. We've indexed the best we could, but we are dealing with so many varchars that could be keys (results.result is an index but is 100-255 chars long!).
I wonder about the strange unused field in samples being the PK, seems to me SampleID ought to be the PK as they supposed to be unique but perhaps duplicates were introduced by error?
I am looking for something like a partioning strategy and just generally thinking outside the box to get this going. This information doesn't have much in the way of numeric codes and one-to-one tables to use as intermediate index tables.
So here is my my.cnf if it is at all helpful as we have major performance problems, the box is an 8 core intel dedicated centos5.5 with 16GB of RAM. I find that is often times has to write to disk on these large joins. Again first thing I think I should deal with is proper field sizes for the data we are storing, var 255 for a 10 digit integer seems like a waste
Will excessive field lengths beyond what you really need affect performance via table size?
Picture of db schema is also attached
With explain: I really bite the bullet on the initial Adj_Samples in the explaination - it goes to Using where; Using temporary; Using Filesort then another where on result using where on 4 rows. all are of type ref.
Here is some my.cnf:
[mysqld]
socket = /var/lib/mysql/mysql.sock
key_buffer = 2048MB
max_allowed_packet = 16MB
group_concat_max_len = 16MB
table_cache = 1024MB
sort_buffer_size = 4MB
read_buffer_size = 4MB
read_rnd_buffer_size = 16MB
myisam_sort_buffer_size = 128MB
thread_cache_size = 16
thread_concurrency = 16
query_cache_type = 1
query_cache_size = 512MB
tmpdir = /home/tmp
join_buffer_size = 4MB
max_heap_table_size = 3GB
tmp_table_size = 512MB
log-slow-queries
long_query_time = 20
no-auto-rehash
[isamchk]
key_buffer =1024M
sort_buffer_size = 256M
read_buffer = 2M
write_buffer = 2M
[myisamchk]
key_buffer = 4096M
sort_buffer_size = 256M
read_buffer = 2M
write_buffer = 2M
**This part strikes me as odd as I don't believe we are using any innodb tables**
innodb_data_home_dir = /var/lib/mysql/
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /var/lib/mysql/
innodb_log_arch_dir = /var/lib/mysql/
innodb_buffer_pool_size = 1024M
innodb_additional_mem_pool_size = 20M
**comment form whoever:** Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size = 100M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
Thanks for all your help all, I am 6 months into learning mysql and have learned a lot but am looking forward to learning more from you all on this exercise.
In bash: top I am seeing my mysql process hit 30% memory only but rack out at 200-400% of CPU is this normal or is my my.cnf screwy to top all this off?