MySQL disallows indexing a full value of BLOB
, TEXT
and long VARCHAR
columns because data they contain can be huge, and implicitly DB index will be big, meaning no benefit from index.
MySQL requires that you define first N characters to be indexed, and the trick is to choose a number N that’s long enough to give good selectivity, but short enough to save space. The prefix should be long enough to make the index nearly as useful as it would be if you’d indexed the whole column.
Before we go further let us define some important terms. Index selectivity is ratio of the total distinct indexed values and total number of rows. Here is one example for test table:
+-----+-----------+
| id | value |
+-----+-----------+
| 1 | abc |
| 2 | abd |
| 3 | adg |
+-----+-----------+
If we index only the first character (N=1), then index table will look like the following table:
+---------------+-----------+
| indexedValue | rows |
+---------------+-----------+
| a | 1,2,3 |
+---------------+-----------+
In this case, index selectivity is equal to IS=1/3 = 0.33.
Let us now see what will happen if we increase number of indexed characters to two (N=2).
+---------------+-----------+
| indexedValue | rows |
+---------------+-----------+
| ab | 1,2 |
| ad | 3 |
+---------------+-----------+
In this scenario IS=2/3=0.66 which means we increased index selectivity, but we have also increased the size of index. Trick is to find the minimal number N which will result to maximal index selectivity.
There are two approaches you can do calculations for your database table. I will make demonstration on the this database dump.
Let's say we want to add column last_name in table employees to the index, and we want to define the smallest number N which will produce the best index selectivity.
First let us identify the most frequent last names:
select count(*) as cnt, last_name
from employees
group by employees.last_name
order by cnt
+-----+-------------+
| cnt | last_name |
+-----+-------------+
| 226 | Baba |
| 223 | Coorg |
| 223 | Gelosh |
| 222 | Farris |
| 222 | Sudbeck |
| 221 | Adachi |
| 220 | Osgood |
| 218 | Neiman |
| 218 | Mandell |
| 218 | Masada |
| 217 | Boudaillier |
| 217 | Wendorf |
| 216 | Pettis |
| 216 | Solares |
| 216 | Mahnke |
+-----+-------------+
15 rows in set (0.64 sec)
As you can see, the last name Baba is the most frequent one. Now we are going to find the most frequently occurring last_name prefixes, beginning with five-letter prefixes.
+-----+--------+
| cnt | prefix |
+-----+--------+
| 794 | Schaa |
| 758 | Mande |
| 711 | Schwa |
| 562 | Angel |
| 561 | Gecse |
| 555 | Delgr |
| 550 | Berna |
| 547 | Peter |
| 543 | Cappe |
| 539 | Stran |
| 534 | Canna |
| 485 | Georg |
| 417 | Neima |
| 398 | Petti |
| 398 | Duclo |
+-----+--------+
15 rows in set (0.55 sec)
There are much more occurrences of every prefix, which means we have to increase number N until the values are almost the same as in the previous example.
Here are results for N=9
select count(*) as cnt, left(last_name,9) as prefix
from employees
group by prefix
order by cnt desc
limit 0,15;
+-----+-----------+
| cnt | prefix |
+-----+-----------+
| 336 | Schwartzb |
| 226 | Baba |
| 223 | Coorg |
| 223 | Gelosh |
| 222 | Sudbeck |
| 222 | Farris |
| 221 | Adachi |
| 220 | Osgood |
| 218 | Mandell |
| 218 | Neiman |
| 218 | Masada |
| 217 | Wendorf |
| 217 | Boudailli |
| 216 | Cummings |
| 216 | Pettis |
+-----+-----------+
Here are results for N=10.
+-----+------------+
| cnt | prefix |
+-----+------------+
| 226 | Baba |
| 223 | Coorg |
| 223 | Gelosh |
| 222 | Sudbeck |
| 222 | Farris |
| 221 | Adachi |
| 220 | Osgood |
| 218 | Mandell |
| 218 | Neiman |
| 218 | Masada |
| 217 | Wendorf |
| 217 | Boudaillie |
| 216 | Cummings |
| 216 | Pettis |
| 216 | Solares |
+-----+------------+
15 rows in set (0.56 sec)
This are very good results. This means that we can make index on column last_name
with indexing only first 10 characters. In table definition column last_name
is defined as VARCHAR(16)
, and this means we have saved 6 bytes (or more if there are UTF8 characters in the last name) per entry. In this table there are 1637 distinct values multiplied by 6 bytes is about 9KB, and imagine how this number would grow if our table contains million of rows.
You can read other ways of calculating number of N in my post Prefixed indexes in MySQL.
UNIQUE
keys?