Optimizing Databases

Saturday, 30 January 2010 10:35
User Rating: / 0

Database tuning is the process of improving database performance by minimizing response time (the time it takes a statement to complete) and maximizing throughput the number of statements a database can handle concurrently per second).

Tuning is a team exercise - collectively performed by DBAs, database designers, application designers and database users, based understanding of the database. Tuning both depends on and impacts the following:

Database performance becomes an important issue in the presence of

In general, most database systems are designed for good performance. The best improvements can be achieved in the initial design phase but sometimes not enough information is available about the characteristics of a database. Later, altering a large database in production use can be expensive and practical considerations put constraints on what can be changed.

Tuning can make the difference between a query taking milliseconds or minutes or even more to execute.

Imagine that the customer table in sakila database contains information about 500,000 customers and city table contains another several hundred rows:

SELECT cu.customer_id AS ID, CONCAT(cu.first_name, _utf8' ', cu.last_name) AS name, a.address AS address,
 a.postal_code AS `zip code`, a.phone AS phone, city.city AS city, country.country AS country, 
IF(cu.active, _utf8'active',_utf8'') AS notes, cu.store_id AS StoreID
FROM customer AS cu JOIN address AS a ON cu.address_id = a.address_id JOIN city ON a.city_id = city.city_id 
 JOIN country ON city.country_id = country.country_id
WHERE country LIKE 'U%' AND city REGEXP '[A-L]*';

Without appropriate indexes, the database system may take a long time completing this query due to massive joins. More on query tuning and indexing in another lessson.

Improving Application Design

A SQL statement or transaction can be written in different ways to perform the same task - some good and some not so good. Smart Database systems automatically rearrange SQL queries for performance via query optimization. In doing so, the database system query optimizer uses information such as table sizes, indexes available, and the distribution of data.

However, a query optimizer may not be able to optimize poorly written SQL code best understood by the application writer.

If some queries are known to require more processing time than others, an application may be rearranged to run those at a no-load or low-load time.

Range queries (ones that search for values within a range) can be slower than exact match queries. Also, join and nested queries, and queries that use clauses such as the DISTINCT and ORDER BY clauses can be slow.

A large number of simultaneous users can slow down response time if the system does not have enough memory or processor cycles or if the transactions conflict heavily.

Using Non-Normalized Tables To Eliminate Joins

As we have seen in another lesson, normalizing tables eliminates data redundancy and the associated data consistency problems. However, to improve performance, databases are sometimes use normalized tables to store redundant data to avoid or eliminate joins.

Here are two examples of reading film with categories:

Code Sample: OptimizeDB/Demos/Normalized-Film.sql

SELECT title, description, rating, c.name as category
FROM film f
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id);
Code Explanation

Normalized access via film_category and category.

Execution of this query may be slowed down by the two joins. This often-used query can now be written with no join as:

Code Sample: OptimizeDB/Demos/Non-Normalized-Film.sql

SELECT title, description, rating, category_name as category
FROM film_detail f;
Code Explanation

No joins are needed as category name is stored along with film in the Non-normalized table film_detail.

Precomputing Queries

Frequently executed queries requiring a fair amount of fetching can be precomputed and their results stored in the database. This will reduce both disk I/O and query processing times.

Storing pre-computed summary data in the database can significantly improve performance on frequently used queries, such as maintaining counts, sums and related updates.

Here are two examples of reading film with stock count:

Code Sample: OptimizeDB/Demos/Count-On-Demand.sql

SELECT film_id, title, count(inventory_id) AS in_stock
FROM film JOIN inventory USING (film_id)
GROUP BY film_id
HAVING in_stock > 3;
Code Explanation

Film inventory is always fetched fresh from database, requiring a join and a GROUP BY.

Execution of this query may be improved by eliminating both joins and aggregation-on-demand as:

Code Sample: OptimizeDB/Demos/Pre-Computed.sql

SELECT film_id, title, in_stock
FROM film_detail
WHERE in_stock > 3;
Code Explanation

Film count in stock is stored with the film.

The updates to inventory will be automated using triggers on the inventory table.

The results need not be stored in existing tables. One may create new tables or even materialized views.

Warning: The performance-gain benefits of removing the joins and pre-computing must outweigh the cost of altering the table, any extra storage, maintaining redundant data consistently and other limitations of non-normalized tables such as loss of cardinality.

The tradeoff is between a slight performance degradation during inserts and changes to orders and a significant performance improvement in summary information queries.

Materialized Views

Creating materialized views or pre-created result sets can be beneficial if computing this data involves joining large tables with very dynamic and loose matching criteria, very common in web applications these days via custom-search forms.

A database supporting materialized views will automatically update the views to incorporate relevant changes made to the base tables. Some smart systems perform these changes incrementally, leading to considerable reduction in update processing.

Warning: MySQL does not support materialized views and the same can be simulated via tables containing result-sets updated using pre-computed queries. These tables will need to be maintained via batch jobs or using triggers on base tables. In many summary-data situations, the base tables are updated using batch jobs anyway.

Code Sample: OptimizeDB/Demos/Materialized-View.sql

DROP TABLE IF EXISTS customer_summary;
CREATE TABLE customer_summary
SELECT cu.customer_id AS ID, CONCAT(cu.first_name, _utf8' ', cu.last_name) AS name, a.address AS address, a.postal_code AS `zip code`,
 a.phone AS phone, city.city AS city, country.country AS country, IF(cu.active, _utf8'active',_utf8'') AS notes, cu.store_id AS SID,
 COUNT(r.rental_id) AS rental_count, SUM(p.amount) AS payment 
FROM customer AS cu JOIN address AS a ON cu.address_id = a.address_id JOIN city ON a.city_id = city.city_id 
 JOIN country ON city.country_id = country.country_id
 JOIN rental r USING (customer_id)
 JOIN payment p USING (customer_id)
GROUP BY customer_id

Code Explanation

A new table is create to contain often used customer information to represent a materialized view.

The table also keeps rental count and total payments made by each customer.


Transaction planning and design impact performance in many ways.


Indexes greatly reduce database access time but they incur maintenance overheads when data rows are inserted, updated, or deleted, leading to additional disk I/O, more processing time, and extra storage.

MySQL uses indexes to:

Types Of Indexes

Two indexes commonly used in databases are the hash and B+-tree indexes. Hash indexes are very fast for equality searches, especially for columns with unique values. The number of disk blocks retrieved to get the desired data will be very small, say one or two. But hash indexes do not perform well for range queries because they are not clustering indexes - database items within a range might not be stored contiguously on disk because their hash values may be very different.

B+-tree indexes, on the other hand, perform well for both equality and range queries though they are not as fast as hash indexes for equality searches:

SELECT * FROM film WHERE title = 'FARGO';

If there are going to be many such queries, then a hash index on column title of film would be appropriate.

Hash Index Characteristics:

B-tree Indexes

Hash indexes cannot be used for range queries:

SELECT * FROM film WHERE rental_rate > 2.99;

A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN operators. The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character. For example, the following SELECT statements use indexes:

It would be best to use a B+-tree index on column rental_rate of film.

SELECT * FROM tbl_name WHERE key_col LIKE 'Patrick%';
SELECT * FROM tbl_name WHERE key_col LIKE 'Pat%_ck%';

In the first statement, only rows with 'Patrick' <= key_col < 'Patricl' are considered. In the second statement, only rows with 'Pat' <= key_col < 'Pau' are considered.

Indexes are not used when ...

The following SELECT statements do not use indexes:

SELECT * FROM tbl_name WHERE key_col LIKE '%Patrick%';
SELECT * FROM tbl_name WHERE key_col LIKE other_col;

In the first statement, the LIKE value begins with a wildcard character. In the second statement, the LIKE value is not a constant.

Different Data Types for Compared Columns

MySQL can use indexes on columns more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR are considered the same if they are declared as the same size. For example, VARCHAR(10) and CHAR(10) are the same size, but VARCHAR(10) and CHAR(15) are not.

Comparing columns with different types may prevent use of indexes when values cannot be compared without conversion, for example, when a numeric column is compared to a string type column. For a given value such as 5 in the numeric column, it compares equal to several values in the string column such as '5 ', ' 5', '005', or '005.e1'. This conversion requirement will discard any indexes for the string column.

What To Index?

Analyzing and understanding frequently executed queries can help decide what to index. For example, assume that queries such as are executed frequently:

SELECT title
FROM film JOIN inventory USING (film_id) JOIN rental USING (inventory_id)
WHERE rating = 'PG';

A database system will try to minimize the number of rows involved in a join. Assuming that rental is larger than film, it will first select rows from the film table that satisfy the filter condition

Whether or not an index should be created depends on the application. If a SELECT statement returns a small percentage of the total number of rows, then it will benefit from an index; otherwise, it may not.:

SELECT title FROM film WHERE title LIKE 'G%';

The result of this query could be empty, consist of one row, a few rows, or a large number of rows. If the result is a small percentage, say 5% or less, of the rows. If the percentage is 50% or more, then using an index may not be beneficial. As the number of rows in a table increases, the benefit of using an index increases.

The breakeven point for using an index, will depend, in addition to query selectivity and table size, upon the query result size, the frequency of the queries, and the need for fast responses.

Note: Indexes not being used should be removed to reduce maintenance overheads.

Client-Server Interactions

All users (clients) interacting with a database server will expect (demand) good response times. Many factors in the interactions between applications running on the client and the database server will affect performance, for example:

MyISAM Index Collecting Statistics

Storage engines collect statistics about tables for use by the optimizer. Table statistics are based on value groups, where a value group is a set of rows with the same key prefix value. For optimizer purposes, an important statistic is the average value group size.

MySQL uses the average value group size to estimate:

As the average value group size for an index increases, the index is less useful for those two purposes because the average number of rows per lookup increases: For the index to be good for optimization purposes, it is best that each index value target a small number of rows in the table. When a given index value yields a large number of rows, the index is less useful and MySQL is less likely to use it.

The SHOW INDEX statement displays a cardinality value based on N/S, where N is the number of rows in the table and S is the average value group size. That ratio yields an approximate number of value groups in the table.

To help make decisions about performance, MySQL supports ANALYZE TABLE statement to collect statistics about the data in the tables including information about data distribution, and use this to determine the query selectivity and result size. The following statement generates data distribution statistics for the table rental

Here is an exercise to perform some table analysis:

Exercise: Table Analysis

Duration: 20 to 30 minutes.

In this exercise, you will analyze tables.

  1. See indexes for rental table, check cardinality.
  2. Write and execute a procedure to insert several rows with random data in the rental table.
  3. See indexes for rental table again, cardinality should reflect the same numbers.
  4. Analyze table.
  5. See indexes for rental table again, cardinality should reflect new rows inserted.

To regenerate table statistics, one can also use myisamchk or CHECK TABLE.

Optimizing Databases and Their Objects Conclusion

This lesson covered the optimization of a MySQL database for performance.

To continue to learn MySQL go to the top of this page and click on the next lesson in this MySQL Tutorial's Table of Contents.