sql - do indexes on boolean columns help page caching

Question

I have read about how boolean columns don't serve much as searching indexes.. But my question is.. if a clustered index, affects the physical arrangement of the records can't it be used to put a type of records, all together (in the same page) so that those page will have less chance of being loaded into memory.. I will try to explain better: for the table

[BookPages]
ID(int)
Deleted(Boolean)
Text(Varchar)

if the clustered index is on ID column, a sample data would be

1, true,  'the quick..'
2, false, 'hello w..'
3, true,  'stack m..'
4, false, 'just thin...'

this means that the delete/active records as interleaved, so if we search for the record 2

SELECT [Text] FROM [BookPages] WHERE [Deleted] = false AND [ID] = 2

the "leaf" data page may end up with the rows (1,2) this mean that we are loading into memory, records with the deleted field, that we will never be interested in.. but if the index was in the columns Deleted,ID the data would now be

2, false, 'hello w..'
4, false, 'just thin...'
1, true,  'the quick..'
3, true,  'stack m..'

now, when we target only the active records as SQL loads the pages, we will have pages full with of only active records..

So on a database with a long history and a lot deleted records, we can have better locality on the records that we want, and help with IO..

And on thousands of pages we can make sure that a large chunk of them will never be loaded on to memory, and that data will always only remain on disk.

is this reasoning correct? may this impact(improve) overall performance on large databases?

score 3 · Accepted Answer

Yes, that reasoning is correct. You can in effect partition the data set into two regions, one hot and one cold. Using a bit is just a special case of this technique. You also could use a date column and cluster on that (of course whether that is feasible or not depends on the schema and data).

Partitioning has a similar effect. Choosing the clustering key is lighter weight and just as good though.

Oftentimes clustering on an auto-incremented number also has good locality because the IDENTITY value correlates with age and age correlates with frequency of usage.

The same optimization does not apply directly to nonclustered indexes. You can use a boolean prefix for them, too, but you need to provide it in a sargable form:

WHERE SomeNCIndexCol = '1234' AND Deleted IN (0, 1)

SQL Server is not smart enough to figure this out by itself. It cannot "skip" the first index level like Oracle can. So we have to provide seek keys manually. (Connect item: https://connect.microsoft.com/SQLServer/feedback/details/695044)

A different concern is write performance. Marking a row as deleted (SET Deleted = 1) now requires a physical delete+insert pair for the CI plus one for each NCI. Primary key changes are not supported by most ORMs so you probably should not set this clustering key as the primary key.

As a side note creating an index on a bit column has other use cases as well. If 99% of the values are zero or one you can definitely use the index to perform a seek and key lookup. You can also use such an index for counting (or grouping on the bit column).

score 0 · Accepted Answer

Creating an index on columns with two or few possible values is actually counter-productive. Clustering the boolean column may also not be wise, as you may want to save it for another column(s) which are frequently queried on. Example, CustomerName. If your DB server supports fragmentation, you could logically place in a separate table the least accessed rows having a false value for your Deleted column. See my following related question/answers.

sql - do indexes on boolean columns help page caching

2 回答 2

Related

Reference