eECHO BLOG

A journey of a thousand miles starts with a single step.

Choosing Optimal Data Types

MySQL supports a large variety of data types, and choosing the correct type to store
your data is crucial to getting good performance. The following simple guidelines can
help you make better choices, no matter what type of data you are storing:

Smaller is usually better.
In general, try to use the smallest data type that can correctly store and repre-
sent your data. Smaller data types are usually faster, because they use less space
on the disk, in memory, and in the CPU cache. They also generally require fewer
CPU cycles to process.
Make sure you don’t underestimate the range of values you need to store,
though, because increasing the data type range in multiple places in your schema
can be a painful and time-consuming operation. If you’re in doubt as to which is
the best data type to use, choose the smallest one that you don’t think you’ll
exceed. (If the system is not very busy or doesn’t store much data, or if you’re at
an early phase in the design process, you can change it easily later.)
Simple is good.
Fewer CPU cycles are typically required to process operations on simpler data
types. For example, integers are cheaper to compare than characters, because
character sets and collations (sorting rules) make character comparisons compli-
cated. Here are two examples: you should store dates and times in MySQL’s
built-in types instead of as strings, and you should use integers for IP addresses.
We discuss these topics further later.
Avoid NULL if possible.
You should define fields as NOT NULL whenever you can. A lot of tables include
nullable columns even when the application does not need to store NULL (the
absence of a value), merely because it’s the default. You should be careful to
specify columns as NOT NULL unless you intend to store NULL in them.
It’s harder for MySQL to optimize queries that refer to nullable columns,
because they make indexes, index statistics, and value comparisons more com-
plicated. A nullable column uses more storage space and requires special pro-
cessing inside MySQL. When a nullable column is indexed, it requires an extra
byte per entry and can even cause a fixed-size index (such as an index on a sin-
gle integer column) to be converted to a variable-sized one in MyISAM.
Even when you do need to store a “no value” fact in a table, you might not need
to use NULL. Consider using zero, a special value, or an empty string instead.
The performance improvement from changing NULL columns to NOT NULL is usu-
ally small, so don’t make finding and changing them on an existing schema a pri-
ority unless you know they are causing problems. However, if you’re planning to
index columns, avoid making them nullable if possible.
The first step in deciding what data type to use for a given column is to determine
what general class of types is appropriate: numeric, string, temporal, and so on. This
is usually pretty straightforward, but we mention some special cases where the
choice is unintuitive.

Choosing Identifiers

Choosing a good data type for an identifier column is very important. You’re more
likely to compare these columns to other values (for example, in joins) and to use
them for lookups than other columns. You’re also likely to use them in other tables
as foreign keys, so when you choose a data type for an identifier column, you’re
probably choosing the type in related tables as well. (As we demonstrated earlier in
this chapter, it’s a good idea to use the same data types in related tables, because
you’re likely to use them for joins.)
When choosing a type for an identifier column, you need to consider not only the
storage type, but also how MySQL performs computations and comparisons on that
type. For example, MySQL stores ENUM and SET types internally as integers but con-
verts them to strings when doing comparisons in a string context.

integer types
Integers are usually the best choice for identifiers, because they’re fast and they
work with AUTO_INCREMENT.
ENUM and SET
The ENUM and SET types are generally a poor choice for identifiers, though they
can be good for static “definition tables” that contain status or “type” values.
ENUM and SET columns are appropriate for holding information such as an order’s
status, a product’s type, or a person’s gender.
As an example, if you use an ENUM field to define a product’s type, you might
want a lookup table primary keyed on an identical ENUM field. (You could add
columns to the lookup table for descriptive text, to generate a glossary, or to
provide meaningful labels in a pull-down menu on a web site.) In this case,
you’ll want to use the ENUM as an identifier, but for most purposes you should
avoid doing so.
String types
Avoid string types for identifiers if possible, as they take up a lot of space and are
generally slower than integer types. Be especially cautious when using string
identifiers with MyISAM tables. MyISAM uses packed indexes for strings by
default, which may make lookups much slower. In our tests, we’ve noted up to
six times slower performance with packed indexes on MyISAM.
You should also be very careful with completely “random” strings, such as those
produced by MD5( ), SHA1( ), or UUID( ). Each new value you generate with them
will be distributed in arbitrary ways over a large space, which can slow INSERT
and some types of SELECT queries:

They slow INSERT queries because the inserted value has to go in a random
location in indexes. This causes page splits, random disk accesses, and clus-
tered index fragmentation for clustered storage engines.
• They slow SELECT queries because logically adjacent rows will be widely dis-
persed on disk and in memory.
• Random values cause caches to perform poorly for all types of queries
because they defeat locality of reference, which is how caching works. If the
entire data set is equally “hot,” there is no advantage to having any particu-
lar part of the data cached in memory, and if the working set does not fit in
memory, the cache will have a lot of flushes and misses.
If you do store UUID values, you should remove the dashes or, even better, con-
vert the UUID values to 16-byte numbers with UNHEX( ) and store them in a
BINARY(16) column. You can retrieve the values in hexadecimal format with the
HEX( ) function.
Values generated by UUID( ) have different characteristics from those generated
by a cryptographic hash function such ash SHA1( ): the UUID values are unevenly
distributed and are somewhat sequential. They’re still not as good as a monoton-
ically increasing integer, though.

Indexes are data structures that help MySQL retrieve data efficiently. They are criti-
cal for good performance, but people often forget about them or misunderstand
them, so indexing is a leading cause of real-world performance problems

Comments are closed.