Database tuning is the process of improving database performance by minimizing response time (the time it takes a statement to complete) and maximizing throughput the number of statements a database can handle concurrently per second).
Tuning is a team exercise - collectively performed by DBAs, database designers, application designers and database users, based understanding of the database. Tuning both depends on and impacts the following:
- Table design, relationships, index design, and other components.
- Query design, size of data read and retrieved, and order of execution.
- Nature and frequency of read and insert/update/delete operations
- Partitioning of the work between the Database server and the client.
- Timing, events and effect of loading tables, indexes or parts there of into memory.
- Concurrency characteristics of statements
Database performance becomes an important issue in the presence of
- large amounts of data,
- complex queries,
- queries manipulating large amounts of data,
- long running queries,
- queries that lock every one else out,
- large number of simultaneous transactions,
- large numbers of users, and
- limited bandwidth.
In general, most database systems are designed for good performance. The best improvements can be achieved in the initial design phase but sometimes not enough information is available about the characteristics of a database. Later, altering a large database in production use can be expensive and practical considerations put constraints on what can be changed.
Tuning can make the difference between a query taking milliseconds or minutes or even more to execute.
Imagine that the customer table in sakila database contains information about 500,000 customers and city table contains another several hundred rows:
Without appropriate indexes, the database system may take a long time completing this query due to massive joins. More on query tuning and indexing in another lessson.
Improving Application Design
A SQL statement or transaction can be written in different ways to perform the same task - some good and some not so good. Smart Database systems automatically rearrange SQL queries for performance via query optimization. In doing so, the database system query optimizer uses information such as table sizes, indexes available, and the distribution of data.
However, a query optimizer may not be able to optimize poorly written SQL code best understood by the application writer.
If some queries are known to require more processing time than others, an application may be rearranged to run those at a no-load or low-load time.
Range queries (ones that search for values within a range) can be slower than exact match queries. Also, join and nested queries, and queries that use clauses such as the DISTINCT and ORDER BY clauses can be slow.
A large number of simultaneous users can slow down response time if the system does not have enough memory or processor cycles or if the transactions conflict heavily.
Using Non-Normalized Tables To Eliminate Joins
As we have seen in another lesson, normalizing tables eliminates data redundancy and the associated data consistency problems. However, to improve performance, databases are sometimes use normalized tables to store redundant data to avoid or eliminate joins.
Here are two examples of reading film with categories:
Code Sample: OptimizeDB/Demos/Normalized-Film.sql
SELECT title, description, rating, c.name as category
FROM film f
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id);
Normalized access via film_category and category.
Execution of this query may be slowed down by the two joins. This often-used query can now be written with no join as:
Code Sample: OptimizeDB/Demos/Non-Normalized-Film.sql
SELECT title, description, rating, category_name as category
FROM film_detail f;
No joins are needed as category name is stored along with film in the Non-normalized table film_detail.
Frequently executed queries requiring a fair amount of fetching can be precomputed and their results stored in the database. This will reduce both disk I/O and query processing times.
Storing pre-computed summary data in the database can significantly improve performance on frequently used queries, such as maintaining counts, sums and related updates.
Here are two examples of reading film with stock count:
Code Sample: OptimizeDB/Demos/Count-On-Demand.sql
SELECT film_id, title, count(inventory_id) AS in_stock
FROM film JOIN inventory USING (film_id)
GROUP BY film_id
HAVING in_stock > 3;
Film inventory is always fetched fresh from database, requiring a join and a GROUP BY.
Execution of this query may be improved by eliminating both joins and aggregation-on-demand as:
Code Sample: OptimizeDB/Demos/Pre-Computed.sql
SELECT film_id, title, in_stock
WHERE in_stock > 3;
Film count in stock is stored with the film.
The updates to inventory will be automated using triggers on the inventory table.
The results need not be stored in existing tables. One may create new tables or even materialized views.
Warning: The performance-gain benefits of removing the joins and pre-computing must outweigh the cost of altering the table, any extra storage, maintaining redundant data consistently and other limitations of non-normalized tables such as loss of cardinality.
The tradeoff is between a slight performance degradation during inserts and changes to orders and a significant performance improvement in summary information queries.
Creating materialized views or pre-created result sets can be beneficial if computing this data involves joining large tables with very dynamic and loose matching criteria, very common in web applications these days via custom-search forms.
A database supporting materialized views will automatically update the views to incorporate relevant changes made to the base tables. Some smart systems perform these changes incrementally, leading to considerable reduction in update processing.
Warning: MySQL does not support materialized views and the same can be simulated via tables containing result-sets updated using pre-computed queries. These tables will need to be maintained via batch jobs or using triggers on base tables. In many summary-data situations, the base tables are updated using batch jobs anyway.
Code Sample: OptimizeDB/Demos/Materialized-View.sql
A new table is create to contain often used customer information to represent a materialized view.
The table also keeps rental count and total payments made by each customer.
Transaction planning and design impact performance in many ways.
- Generally, performance is improved by using small or short-running transactions.
- Splitting large and complex transactions into small transactions will also reduce chances of deadlock and the associated high cost of transaction management to break deadlocks or rollbacks. Small transactions presumably need fewer locks and thus compete less for locks between them, reducing the waiting time for locks. This also increases concurrency.
- In other cases, it may be beneficial to combine several small update transactions into one transaction so the disk writes at the end of each transaction are merged together.
- Using non-locking reads (SQL Level 1, that is, READ UNCOMMITED level) perform well but only if not-so-exact results are acceptable to the users.
Indexes greatly reduce database access time but they incur maintenance overheads when data rows are inserted, updated, or deleted, leading to additional disk I/O, more processing time, and extra storage.
MySQL uses indexes to:
- Quickly find the rows matching a WHERE clause.
- Eliminate rows from consideration. If there is a choice between multiple indexes, MySQL normally uses the index that finds the smallest number of rows.
- Retrieve rows from other tables when performing joins.
- Find the MIN() or MAX() value for a specific indexed column. For example:
FROM tbl_name WHERE key_part1=10;
- Sort or group a table if the sorting or grouping is done on a leftmost prefix of a usable key (for example, ORDER BY key_part1, key_part2).
- Optimize a query to retrieve values without consulting the data rows. If a query uses only columns from a table that are numeric and that form a leftmost prefix for some key, the selected values may be retrieved from the index tree for greater speed:
SELECT key_part3 FROM tbl_name
Types Of Indexes
Two indexes commonly used in databases are the hash and B+-tree indexes. Hash indexes are very fast for equality searches, especially for columns with unique values. The number of disk blocks retrieved to get the desired data will be very small, say one or two. But hash indexes do not perform well for range queries because they are not clustering indexes - database items within a range might not be stored contiguously on disk because their hash values may be very different.
B+-tree indexes, on the other hand, perform well for both equality and range queries though they are not as fast as hash indexes for equality searches:
SELECT * FROM film WHERE title = 'FARGO';
If there are going to be many such queries, then a hash index on column title of film would be appropriate.
Hash Index Characteristics:
- They are used only for equality comparisons that use the = or <=> operators.
- They are not used for comparison operators such as < that find a range of values.
- The optimizer cannot use a hash index to speed up ORDER BY operations, as hashes have no natural ordering.
- MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use). This may affect some queries if you change a MyISAM table to a hash-indexed MEMORY table.
- Only whole keys can be used to search for a row. Where as with a B-tree index, any leftmost prefix of the key can be used to find rows.
Hash indexes cannot be used for range queries:
SELECT * FROM film WHERE rental_rate > 2.99;
A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN operators. The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character. For example, the following SELECT statements use indexes:
It would be best to use a B+-tree index on column rental_rate of film.
SELECT * FROM tbl_name WHERE key_col LIKE 'Patrick%';
SELECT * FROM tbl_name WHERE key_col LIKE 'Pat%_ck%';
In the first statement, only rows with 'Patrick' <= key_col < 'Patricl' are considered. In the second statement, only rows with 'Pat' <= key_col < 'Pau' are considered.
Indexes are not used when ...
The following SELECT statements do not use indexes:
SELECT * FROM tbl_name WHERE key_col LIKE '%Patrick%';
SELECT * FROM tbl_name WHERE key_col LIKE other_col;
In the first statement, the LIKE value begins with a wildcard character. In the second statement, the LIKE value is not a constant.
Different Data Types for Compared Columns
MySQL can use indexes on columns more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR are considered the same if they are declared as the same size. For example, VARCHAR(10) and CHAR(10) are the same size, but VARCHAR(10) and CHAR(15) are not.
Comparing columns with different types may prevent use of indexes when values cannot be compared without conversion, for example, when a numeric column is compared to a string type column. For a given value such as 5 in the numeric column, it compares equal to several values in the string column such as '5 ', ' 5', '005', or '005.e1'. This conversion requirement will discard any indexes for the string column.
What To Index?
Analyzing and understanding frequently executed queries can help decide what to index. For example, assume that queries such as are executed frequently:
FROM film JOIN inventory USING (film_id) JOIN rental USING (inventory_id)
WHERE rating = 'PG';
A database system will try to minimize the number of rows involved in a join. Assuming that rental is larger than film, it will first select rows from the film table that satisfy the filter condition
Whether or not an index should be created depends on the application. If a SELECT statement returns a small percentage of the total number of rows, then it will benefit from an index; otherwise, it may not.:
SELECT title FROM film WHERE title LIKE 'G%';
The result of this query could be empty, consist of one row, a few rows, or a large number of rows. If the result is a small percentage, say 5% or less, of the rows. If the percentage is 50% or more, then using an index may not be beneficial. As the number of rows in a table increases, the benefit of using an index increases.
The breakeven point for using an index, will depend, in addition to query selectivity and table size, upon the query result size, the frequency of the queries, and the need for fast responses.
Note: Indexes not being used should be removed to reduce maintenance overheads.
All users (clients) interacting with a database server will expect (demand) good response times. Many factors in the interactions between applications running on the client and the database server will affect performance, for example:
Data Processing At The Client Or The Database Server: After the data has been retrieved, some additional processing may need to be performed - at the client or at the database server. If processing is done at the database server, then this may slow down the database server for everyone. If the processing is done at the client, then the needed data will have to be transmitted from the database to the client over the network which can take time and affect others. As an example suppose that a sales report has to be generated by aggregating much raw data and then sorting it in a variety of ways:
The sales report can be aggregated at the database server or at the client. If aggregated at the database server, it will slow down the server. If aggregated at the client, much data may have to be transmitted to the client. In this case, it would seem appropriate to do aggregation at the database server.
The sales report data can be sorted at the database server or at the client. It would be appropriate to do the sorting at the client to avoid burdening the database server. Unlike in the first case of sending data to the client for aggregation, not much data will have to be transmitted to the client.
Connections To The Database Server: Instead of opening a new connection for each SQL statement, the client should keep one connection open for the whole session. Opening and closing connections is expensive.
Data Transfer: Fetching rows, one by one, from the database server will be slower than fetching groups of rows.
MyISAM Index Collecting Statistics
Storage engines collect statistics about tables for use by the optimizer. Table statistics are based on value groups, where a value group is a set of rows with the same key prefix value. For optimizer purposes, an important statistic is the average value group size.
MySQL uses the average value group size to estimate:
- how may rows must be read for each ref access
- how many row a partial join will produce; that is, an operation like:
(...) JOIN tbl_name ON tbl_name.key = expr
As the average value group size for an index increases, the index is less useful for those two purposes because the average number of rows per lookup increases: For the index to be good for optimization purposes, it is best that each index value target a small number of rows in the table. When a given index value yields a large number of rows, the index is less useful and MySQL is less likely to use it.
The SHOW INDEX statement displays a cardinality value based on N/S, where N is the number of rows in the table and S is the average value group size. That ratio yields an approximate number of value groups in the table.
To help make decisions about performance, MySQL supports ANALYZE TABLE statement to collect statistics about the data in the tables including information about data distribution, and use this to determine the query selectivity and result size. The following statement generates data distribution statistics for the table rental
Here is an exercise to perform some table analysis:
Exercise: Table Analysis
In this exercise, you will analyze tables.
- See indexes for rental table, check cardinality.
- Write and execute a procedure to insert several rows with random data in the rental table.
- See indexes for rental table again, cardinality should reflect the same numbers.
- Analyze table.
- See indexes for rental table again, cardinality should reflect new rows inserted.
To regenerate table statistics, one can also use myisamchk or CHECK TABLE.
Optimizing Databases and Their Objects Conclusion
This lesson covered the optimization of a MySQL database for performance.