Brief History of SQL
In 1970, Dr. E.F. Codd published "A Relational Model of Data for Large Shared Data Banks," an article that outlined a model for storing and manipulating data using tables. Shortly after Codd's article was published, IBM began working on creating a relational database. Between 1979 and 1982, Oracle (then Relational Software, Inc.), Relational Technology, Inc. (later acquired by Computer Associates), and IBM all put out commercial relational databases, and by 1986 they all were using SQL as the data query language.
In 1986, the American National Standards Institute (ANSI) standardized SQL. This standard was updated in 1989, in 1992 (called SQL2), and again in 1999 (called SQL3). Standard SQL is sometimes called ANSI SQL or SQL92. All major relational databases support this standard but each has its own proprietary extensions. Unless otherwise noted, the SQL taught in this course is the standard ANSI SQL.
Relational Database Management System
A Relational Database Management System (RDBMS), commonly (but incorrectly) called a database, is software for creating, manipulating, and administering a database. For simplicity, we will often refer to RDBMSs as databases.
- Commercial Databases
- Oracle: Oracle is the most popular relational database. It runs on both Unix and Windows. It used to be many times more expensive than SQL Server and DB2, but it has come down a lot in price.
- SQL Server: SQL Server is Microsoft's database and, not surprisingly, only runs on Windows. It has only a slightly higher market share than Oracle on Windows machines. Many people find it easier to use than Oracle.
- DB2: IBM's DB2 was one of the earliest players in the database market. It is still very commonly used on mainframes and runs on both Windows and Unix.
- Popular Open Source Databases
- MySQL: Because of its small size, its speediness, and its support with solid documentation, MySQL has quickly become the most popular open source database. MySQL is available on several operating systems, including Windows and Unix.
- PostgreSQL: Until recently, PostgreSQL was the most popular open source database until that spot was taken over by MySQL. PostgreSQL now calls itself "the world's most advanced Open Source database software." The latest versions support features like native-language programming, advanced indexes and a query rewrite system that allows the database designer to create rules to dynamically transform operations on tables/views into alternate operations upon processing.
A relational database at its simplest is a set of tables used for storing data. Each table has a unique name and may relate to one or more other tables in the database through common values.
A table in a database is a collection of rows and columns. Tables are also known as entities or relations.
A row contains data pertaining to a single item or record in a table. Rows are also known as records or tuples.
A column contains data representing a specific characteristic of the records in the table. Columns are also known as fields or attributes.
A relationship is a link between two tables (i.e, relations). Relationships make it possible to find data in one table that pertains to a specific record in another table.
Each of a table's columns has a defined datatype that specifies the type of data that can exist in that column. For example, the FirstName column might be defined as varchar(20), indicating that it can contain a string of up to 20 characters. Unfortunately, datatypes vary widely between databases.
Most tables have a column or group of columns that can be used to identify records. For example, an Employees table might have a column called EmployeeID that is unique for every row. This makes it easy to keep track of a record over time and to associate a record with records in other tables.
Foreign key columns are columns that link to primary key columns in other tables, thereby creating a relationship. For example, the Customers table might have a foreign key column called SalesRep that links to EmployeeID, the primary key in the Employees table.
Valid Object References
Database Manipulation Language (DML)
DML statements are used to work with data in an existing database. The most common DML statements are:
Database Definition Language (DDL)
DDL statements are used to structure objects in a database. The most common DDL statements are:
Database Control Language (DCL)
DCL statements are used for database administration. The most common DCL statements are:
- DENY (SQL Server Only)
Database Design Principles
Modeling and designing a database comes first in any database application. Database design greatly impacts the development, performance, maintenance and the flexibility of the application. Bad decisions and errors of the design phase will continue to cause long-term trouble and ill- effects.
Database design for any non-trivial application is not easy and usually requires experience. This lesson will present a quick refresher with some pointers to the fundamentals of relational databases. It summarizes different data types, table types available under MySQL, and demonstrates normalization rules. In addition, indexes and integrity rules such as foreign key constraints are also covered.
The database design should be centered around the needs of various users, generally understood via a detailed requirement analysis of the system. A database designer must understand the usage patterns of the data. This understanding is reached by interviewing potential users towards analysis of the organization's database needs.
The database designer must determine items such as:
- Data items to be stored
- Tasks or Operations to be performed with this data
- Frequency of these operations
- Restrictions and Constraints on the data
- Security of data
Database modeling tools capture the designer's understanding of the database requirements and the resulting database model leads to a database design.
Film Rentals - sakila
The data model we design here is created for sakila - a fictional film rental company that rents films to its customers. The database tracks and provides information on various films, records rental transactions, and stores the names of the store's customers and sales staff. To mode the data, we use the following basic business rules:
- Store information for each film: title, rating, category, inventory, language Multiple copies of the same film are treated separately in the inventory.
- Store information for customers including address who reserve and rent films.
- Store information for sales staff who rent films
- Customer records should be distinct from staff records.
- Store information about each rental transaction. The information includes the customer who rented the film, which film/copy, staff who ran the transaction, the date of the rental, the date of return. Each film rental is recorded as an individual transaction.
- Keep payment information for the rental.
- Store basic look up information on categories, languages, studios.
The business rules is just an initial list of specifications that would be required to create a database, and should generally be enough to get started towards a data model. We enhance our database with other elements based on additional business rules.
Basic Modeling Process: Creating a Data Model
A relational model is based on tables and on the meaningful relationships between those tables. Specifically, a table is made up of columns and rows that form a table-like structure. The rows are identified through the use of primary keys, and the columns map to various attributes for an entity. Using these concepts, we design a database that adheres to the standards of the relational model.
A data model for a relational database should show all of the following information:
The tables or potential entities that make up the database
The columns that make up each table
The data type that defines each column
The primary key that identifies each row
The relationships that exist between tables
Refine the data model towards full normalization
Logical versus Physical Model
Ultimately, a data model should be concerned with how the represented database is implemented in a particular RDBMS. So we first design a logical data model and use it to develop a physical data model that represents how that database will actually be implemented in a particular RDBMS such as MySQL.
A logical data model is concerned with pure data storage, adhering strictly to the rules of normalization and relational modeling and is indifferent to implementation aspects such as what files, auto-generation of keys.
For this course, we build only a Physical data model as we are already targetting MySQL.
The Entity-Relationship Model
The popular Entity-Relationship (ER) model is often used for modeling databases. Its benefits lie in its simplicity, clarity, and simple graphical representation of data and their relationships. Data is described using concepts such as "entities," "entity sets," "attributes," and "relationships."
Entities And Entity Sets
An entity is a distinct object distinguishable from other objects. For example, a film and a customer are two different entities. A collection of similar entities group to form entity sets.
Properties of Entities are called attributes. For example,
- the rating of a film is an attribute of the entity film, and
- the last name and address of a customer are attributes of the entity customer.
Attributes have values associated with them, individually identified with an entity. For example, a film has a rental rate of $3.99, and is different from another film with the same rate of $3.99. A customer entity whose attribute last name has the value Burke is different from another customer with last name as Hagan.
Different attribute values distinguish similar entities in the same entity set from each other, but many entities can have the same attribute value. For example, there can be several film entities with the same rental rate.
An entity must have one or more attributes. All entities in an entity set have the same attributes. A subset of attributes are used to distinguish entities in an entity set from each other generally called a key of the entity set. Entities in an entity set must have a unique set of values for the key.
A relationship instance specifies an association between entities. For example, Customers reserve and rent films. The process of Reservations relates a customer and the films reserved.
Entities can have several types of relationships:
- One-To-One: An entity in one entity set is associated with at most one entity in the other entity set and vice versa.
- One-To-Many: An entity in one entity set is associated with many (zero or more) entities in the other entity set.
- Many-To-One: Many (zero or more) entities in one entity set can be associated with one entity in the other entity set.
- Many-To-Many: Many (zero or more) entities in one entity set can be associated with many entities in the other entity set.
Business rules indicate a relationship between two or more tables. Let us identify persistent (worth-storing) relationships in our sakila database.
- Customers reserve films - Many to many
- Customers have rental transactions on film copies - Many to many
- A film is of a certain category and in a certain language - Many to one
- Different staff reserves and rents films to customers - Many to many
Relationships can also have attributes. They are used to give information about the relationship. For example, the rental_date attribute in the rent relationship can be used to describe when a film was rented.
One of the defining characteristics of a relational database is the fact that various types of relationships exist between tables. These relationships allow the data in the tables to be associated with each other in meaningful ways that help ensure the integrity of normalized data. Because of these relationships, actions in one table cannot adversely affect data in another table.
For any relational database, there are three fundamental types of relationships that can exist between tables: one-to-one relationships, one-to-many relationships, and many-to-many relationships. This section takes a look at each of these relationships.
A one-to-one relationship can exist between any two tables in which a row in the first table can be related to only one row in the second table and a row in the second table can be related to only one row in the first table.
Several different systems are used to represent the relationships between tables, all of which connect the tables with lines that have special notations at the ends of those lines. The examples in this lesson use a very basic system to represent the relationships.
Generally, one-to-one relationships are the least likely type of relationships to be implemented in a relational database; however, there are sometimes reasons to use them. For example, you might want to separate tables simply because one table would contain too much data, or perhaps you would want to separate data into different tables so you could set up one table with a higher level of security. Even so, most databases contain relatively few, if any, one-to-one relationships. The most common type of relationship you're likely to find is the one-to-many.
As with one-to-one relationships, a one-to-many relationship can exist between any two tables in your database. A one-to-many relationship differs from a one-to-one relationship in that a row in the first table can be related to one or more rows in the second table, but a row in the second table can be related to only one row in the first table.
A one-to-many relationship is probably the most common type of relationship you'll see in your databases. A many-to-one relationship is simply a reversing of the order in which the tables are referred.
A many-to-many relationship can exist between any two tables in which a row in the first table can be related to one or more rows in the second table, but a row in the second table can be related to one or more rows in the first table. When a many-to-many relationship exists, it is implemented by adding a third table between these two tables that matches the primary key values of one table to the primary key values of the second table, making a many-to-many relationship as logical.
Linking via Foreign Keys
One other aspect to notice in the film table is the language_id column tagged FK1. FK is an acronym for a Foreign Key. As a foreign key, the language_id column contains a key values from the associated row in the language_id column in the language table. The foreign key represents a relationship between the two tables, designated by a line connecting the two tables.
Graphical Representation of a Model
The ER model allows the database design to be represented graphically:
- An entity set is represented by a rectangle.
- A relationship is represented by a diamond with lines connecting it to the entity sets that it relates.
- A "many" relationship is indicated by an "N" next to the line.
- A "one" relationship is indicated by a "1" next to the line.
- An attribute is represented by an oval connected by a solid line to the entity set rectangle or to the relationship diamond.
As an example, consider the basic ER diagram for a few of sakila entities:
Modeling sakila - Film Rentals Database
Our understanding so far leads us to model the sakila database as consisting of following entity sets:
Customers, Transactions, Films, and following (named) relationships, Reservations between Customers and Films, and Rentals between Customers and Films.
Note: Based on our experience, we interpret the non-specific requirement of storing name and address of a customer in practice as stored in component form, such as The first and last names of the customer should be stored separately as should the various sub-elements of the address.
The attributes listed may not uniquely identify a customer (several customers can belong the same company, several customers can have the same name, a family can share an address, and so on). Moreover, it may also be desirable to have a single identifier to uniquely refer to a customer. Consequently, we will add an "ID" attribute, which will have a unique value for each customer.
Let us examine the reservation operation
- ReservationId (a unique id that identifies each order),
Each reservation must contain information about the customer and the film reserved. These attributes cannot be in Customers or in Films.
Here is a more complete ER model for some of the tables in the sakila database:
The key (unique value) attributes in the three entity sets are identified by shaded ovals.
Note: ER modeling may not lead to a unique solution. Usually, complex databases are likely to be modeled differently by different database designers.
Physical Model: Mapping The ER Model To A Relational Database
Each entity set in the ER model is represented by a table in a relational database. Attributes of an entity set becomes the columns of the table.
Primary Key attributes such as film_id, uniquely identify members of the entity sets.
Each relationship is also represented by a table. A relationship table includes the key columns of the two entity set tables it relates. Relationship attributes become columns of the relationship table.
A one-to-many relationship table can be eliminated by storing the relationship information in one of the tables being related. This is also possible for many-to-one and one-to-one relationships. For example, relationships film.language or address.city do not need attributes, and it would be sufficient to simply put the language_id and city_id into the film and address entities.
Note: For one-to-many and many-to-one relationships, the mapping information must be stored in the "many" entity set to avoid creating extra rows.
One of the concepts most important to a relational database is that of normalized data. Normalized data is organized into a structure that preserves the integrity (consistent with no loss) of the data while minimizing redundant data by keeping logically related data together.
Once we have identified core entities and their attributes, we will normalize the data structure and further refine the tables and columns, one entity at a time.
A normalized database contains table structures refined according to the normalization rules referred to as normal forms - With the introduction of relational model, Codd included three normal forms. Extended normal forms have been defined after than, but the first three remain central to a relational model.
There is often a trade-off and therefore a balancing point between strict adherence to the normal forms and system performance. Often, the more normalized the data, the more taxing it can be on a system. In most situations, the first three normal forms sufficiently provide that balance.
As a start, memorize the 3 normal forms so that you can chant them in your sleep:
- 1NF: No repeating elements or groups of elements
- 2NF: No partial dependencies on a composite key
- 3NF: No dependencies on non-key attributes
First Normal Form
- For tables in the first normal form (1NF), each column in a row must be atomic. In other words, the column can contain only one value for any given row.
- Each row in a table must contain the same number of columns. Given that each column can contain only one value, this means that each row must contain the same number of values.
- All rows in a table must be different. Although rows might include the same values, each row, when taken as a whole, must be unique in the table.
The 1NF property ensures that a data element or column or attribute of a row is NOT a composite or multi-valued item. The use of set valued column types can lead to redundancy, loss of data and inconsistency.
Here is the Film table that is not in 1NF:
Table film is not in 1NF because the values in column category are sets of category names. So we will define a table film_category to list all the categories to which a particular film belongs.
Splitting the film table into two leads to tables that are in 1NF, but now a join will be required for some queries. For example, to list all the films in 'Family' category, we will use a join query - discussed later.
Exercise: film can be translated into many languages.
Warning: Abnormal Normal Form!
Defining column categories as an atomic VARCHAR may bring film to 1NF. Still, table film is not be considered to be in 1NF if strings are not treated as indivisible, especially where queries need specific film categories. We can surely use available operators such as concatenation to write and substring to read category names for a given film. Besides the complexity of extra programming, this mechanism is covertly a multi-valued column and sets up for inconsistency and redundancy.
Second Normal Form
- Tables in the second normal form (2NF) must be in 1NF and,
- All their columns not part of a key must be dependent upon the whole key, which can be a composite.
Note: A primary key made up of more than one column is referred to as a composite primary key.
As an example, suppose we have identified that entity film also contains rental information that differs for various cities where the films are rented. In other words, we have some data in this entity that is for a film and some data for the combination of city and film. This table is certainly NOT in 2NF !
A Non-key column rental_rate, that stores a rate for each film for a given city is in the film table. This rate does depend on film_id but also depends on a city_id.
So moving out rental_rate column which depends on the non-key city column will make the film table be in 2NF (assuming it satisfies the 1NF requirement). Values of the eliminated rate column, will be fetched for a combination of a given store/city and a film.
Before the move, the film table is not in 2NF. Note that changing the daily rate for a film without corresponding changes in column city_id will lead to data inconsistency.
We will move the rental_rate into a separate rental_city table, which will have both city_id and film_id part of its composite key. Now database from this association is in 2NF.
Third Normal From
- Tables in the third normal form (3NF) must be in 2NF and,
- No non-key column should depend upon another non-key column. In other words, all non-key columns must depend only upon the entity's key.
- All nonprimary key columns in the table must be dependent on the primary key and must be independent of each other.
A 2NF table can be converted to 3NF by removing the non-key columns whose values are determined by other non-key columns.
For example, all films are made by a certain studio. If attributes like studio-size are present in the film table, it is not in 3NF, as the studio-size has no dependence on the film itself, rather the non-key studio which made the film. Moving studio-size into the studio entity will bring film closer to 3NF.
Note: Specifying column B as being dependent (functionally dependent to be precise) upon column A is equivalent to saying that the values in column A determine the corresponding values in column B.
Designing Databases Conclusion
This lesson covered the basic design guidelines for modeling a database and explain the core Normalization principles