In this article, I will cover the 6 normal forms and the methods for bringing tables into these forms.

The process of designing a database using the NF method is iterative and consists of sequentially converting a relation from 1NF to a higher-order NF according to specific rules. Each subsequent NF is constrained by a certain type of functional dependency and eliminates corresponding anomalies when performing operations on DB relations, while also preserving the properties of the preceding NFs.

Terms Used


Attribute — a property of some entity. Often called a table field.

Attribute domain — the set of permissible values that an attribute can take.

Tuple — a finite set of interrelated permissible attribute values that together describe some entity (a table row).

Relation — a finite set of tuples (a table).

Relation schema — a finite set of attributes that define some entity. In other words, it is the structure of a table consisting of a specific set of fields.

Projection — a relation obtained from a given one by deleting and/or rearranging some attributes.

Functional dependency between attributes (sets of attributes) X and Y means that for any valid set of tuples in a given relation: if two tuples have the same value for X, they must also have the same value for Y. For example, if the value of the "Company Name" attribute is Canonical Ltd, then the value of the "Headquarters" attribute in such a tuple will always be Millbank Tower, London, United Kingdom. Notation: {X} -> {Y}.

Normal form — a requirement imposed on the structure of tables in relational database theory to eliminate redundant functional dependencies between attributes (table fields) from the database.

The method of normal forms (NF) consists of collecting information about the objects of a problem within a single relation and the subsequent decomposition of this relation into several interrelated relations based on relation normalization procedures.

The goal of normalization: to eliminate redundant data duplication, which is the cause of anomalies that arise when adding, editing, and deleting tuples (table rows).

An anomaly is a situation in a DB table that leads to a contradiction in the DB or significantly complicates DB processing. The cause is excessive data duplication in the table, which is caused by the presence of functional dependencies on non-key attributes.

Modification anomalies occur when changing some data may require scanning the entire table and correspondingly changing some table records.

Deletion anomalies — when deleting a tuple from a table, information that is not directly related to the deleted record may be lost.

Insertion anomalies occur when information cannot be placed in a table until it is complete, or when inserting a record requires an additional scan of the table.

First Normal Form


A relation is in 1NF if all of its attributes are simple, and all domains used must contain only scalar values. There should be no duplicate rows in the table.

For example, there is a table "Cars":

Company Models
BMW M5, X5M, M1
Nissan GT-R

A violation of 1NF normalization occurs in the BMW models, as one cell contains a list of 3 elements: M5, X5M, M1, i.e., it is not atomic. Let's convert the table to 1NF:
Company Models
BMW M5
BMW X5M
BMW M1
Nissan GT-R

Second Normal Form


A relation is in 2NF if it is in 1NF and every non-key attribute is irreducibly dependent on the Primary Key (PK).

Irreducibility means that there is no smaller subset of attributes within the candidate key from which this functional dependency can also be derived.

For example, given the table:

Model Company Price Discount
M5 BMW 5500000 5%
X5M BMW 6000000 5%
M1 BMW 2500000 5%
GT-R Nissan 5000000 10%

The table is in the first normal form, but not in the second. The price of the car depends on the model and the company. The discount depends on the company, meaning the dependency on the primary key is partial. This is corrected by decomposing it into two relations in which non-key attributes depend on the PK.

Model Company Price
M5 BMW 5500000
X5M BMW 6000000
M1 BMW 2500000
GT-R Nissan 5000000

Company Discount
BMW 5%
Nissan 10%


Third Normal Form


A relation is in 3NF when it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Simply put, the second rule requires moving all non-key fields whose content can relate to multiple table records into separate tables.

Consider the table:

Model Shop Phone
BMW Rial-auto 87-33-98
Audi Rial-auto 87-33-98
Nissan Next-Auto 94-54-12


The table is in 2NF, but not in 3NF.
In the relation, the "Model" attribute is the primary key. Cars do not have personal phones, and the phone number depends exclusively on the shop.
Thus, the following functional dependencies exist in the relation: Model → Shop, Shop → Phone, Model → Phone.
The dependency Model → Phone is transitive, therefore, the relation is not in 3NF.
As a result of splitting the original relation, two relations in 3NF are obtained:

Shop Phone
Rial-auto 87-33-98
Next-Auto 94-54-12


Model Shop
BMW Rial-auto
Audi Rial-auto
Nissan Next-Auto


Boyce-Codd Normal Form (BCNF) (a special case of the third normal form)


The definition of 3NF is not entirely suitable for the following relations:
1) the relation has two or more candidate keys;
2) two or more candidate keys are composite;
3) they overlap, i.e., they have at least one common attribute.

For relations having one candidate key (primary), BCNF is the same as 3NF.

A relation is in BCNF when every non-trivial and left-irreducible functional dependency has a candidate key as its determinant.

Suppose we are considering a relation representing data about daily parking reservations:

Parking Spot Number Start Time End Time Rate
1 09:30 10:30 Thrifty
1 11:00 12:00 Thrifty
1 14:00 15:30 Standard
2 10:00 12:00 Premium-B
2 12:00 14:00 Premium-B
2 15:00 18:00 Premium-A

The rate has a unique name and depends on the chosen parking spot and the availability of benefits, in particular:
  • “Thrifty”: parking spot 1 for beneficiaries
  • “Standard”: parking spot 1 for non-beneficiaries
  • “Premium-A”: parking spot 2 for beneficiaries
  • “Premium-B”: parking spot 2 for non-beneficiaries.

Thus, the following composite primary keys are possible: {Parking Spot Number, Start Time}, {Parking Spot Number, End Time}, {Rate, Start Time}, {Rate, End Time}.

The relation is in 3NF. The requirements of the second normal form are met, as all attributes are part of some candidate key, and there are no non-key attributes in the relation. There are also no transitive dependencies, which meets the requirements of the third normal form. Nevertheless, there is a functional dependency Rate → Parking Spot Number, in which the left side (the determinant) is not a candidate key of the relation, meaning the relation is not in Boyce-Codd Normal Form.

The disadvantage of this structure is that, for example, the "Thrifty" rate could be mistakenly assigned to a reservation for the second parking spot, although it can only apply to the first parking spot.

The structure can be improved by decomposing the relation into two and adding the attribute Has Benefits, resulting in relations that satisfy BCNF (the attributes that are part of the primary key are underlined):

Rates
Rate Parking Spot Number Has Benefits
Thrifty 1 Yes
Standard 1 No
Premium-A 2 Yes
Premium-B 2 No

Reservations
Rate Start Time End Time
Thrifty 09:30 10:30
Thrifty 11:00 12:00
Standard 14:00 15:30
Premium-B 10:00 12:00
Premium-B 12:00 14:00
Premium-A 15:00 18:00

Fourth Normal Form


A relation is in 4NF if it is in BCNF and all non-trivial multivalued dependencies are in fact functional dependencies on its candidate keys.

In a relation R (A, B, C), a multivalued dependency R.A -> -> R.B exists if and only if the set of B values corresponding to a pair of A and C values depends only on A and not on C.

Suppose that restaurants produce different types of pizza, and the restaurants' delivery services operate only in certain areas of the city. The composite primary key of the corresponding relation variable includes three attributes: {Restaurant, Pizza Type, Delivery Area}.

Such a relation variable does not conform to 4NF, as the following multivalued dependency exists:
{Restaurant} → {Pizza Type}
{Restaurant} → {Delivery Area}

That is, for example, when adding a new type of pizza, one new tuple will have to be entered for each delivery area. A logical anomaly is possible, where a certain type of pizza corresponds to only some of the delivery areas served by the restaurant.

To prevent the anomaly, the relation must be decomposed by placing independent facts in different relations. In this example, a decomposition into {Restaurant, Pizza Type} and {Restaurant, Delivery Area} should be performed.

However, if an attribute that is functionally dependent on the candidate key is added to the original relation variable, for example, a price that includes the delivery cost ({Restaurant, Pizza Type, Delivery Area} → Price), then the resulting relation will be in 4NF and can no longer be decomposed without loss.

Fifth Normal Form


A relation is in 5NF if it is in 4NF and there are no complex join dependencies between attributes.
If "Attribute_1" depends on "Attribute_2", and "Attribute_2" in turn depends on "Attribute_3", and "Attribute_3" depends on "Attribute_1", then all three attributes must be part of a single tuple.

This is a very strict requirement that can only be met under additional conditions. In practice, it is difficult to find an example of a pure implementation of this requirement.

For example, a certain table contains three attributes: "Supplier", "Product", and "Customer". Customer_1 purchases several Products from Supplier_1. Customer_1 purchases a new Product from Supplier_2. Then, by virtue of the requirement stated above, Supplier_1 is obliged to supply Customer_1 with the same new Product, and Supplier_2 must supply Customer_1 with the entire range of Supplier_1's Products in addition to the new Product. This does not happen in practice. The customer is free in their choice of products. Therefore, to eliminate this difficulty, all three attributes are separated into different relations (tables). After creating three new relations (Supplier, Product, and Customer), it is necessary to remember that when retrieving information (for example, about customers and products), all three relations must be joined in the query. Any combination of joining two of the three relations will inevitably lead to the retrieval of incorrect (invalid) information. Some DBMS are equipped with special mechanisms that prevent the retrieval of unreliable information. Nevertheless, the general recommendation should be followed: build the database structure in such a way as to avoid the use of 4NF and 5NF.

The fifth normal form is oriented towards working with join dependencies. The specified join dependencies between three attributes are very rare. It is practically impossible to specify join dependencies between four, five, or more attributes.

Domain-Key Normal Form


A relation variable is in DKNF if and only if every constraint imposed on it is a logical consequence of the domain constraints and key constraints imposed on that relation variable.
A domain constraint is a constraint that prescribes using for a specific attribute only values from a certain specified domain. The constraint is essentially a specification of a list (or a logical equivalent of a list) of permissible values of a type and a declaration that the specified attribute has this type.

A key constraint is a constraint stating that a certain attribute or combination of attributes is a candidate key.

Any relation variable that is in DKNF is necessarily in 5NF. However, not every relation variable can be brought into DKNF.

Sixth Normal Form


A relation variable is in the sixth normal form if and only if it satisfies all non-trivial join dependencies. It follows from the definition that a variable is in 6NF if and only if it is irreducible, that is, it cannot be subjected to further lossless decomposition. Every relation variable that is in 6NF is also in 5NF.

The idea of "decomposition to the end" was proposed before research in the field of temporal data began, but it did not find support. However, for temporal databases, the maximum possible decomposition helps to combat redundancy and simplifies the maintenance of database integrity.

For temporal databases, U_operators are defined, which unpack relations by specified attributes, perform the corresponding operation, and pack the result. In this example, the join of the relation's projections should be performed using the U_JOIN operator.

Employees
Emp.ID Time Position Home Address
6575 01-01-2000:10-02-2003 mechanic 10 Lenina St.
6575 11-02-2003:15-06-2006 mechanic 22 Sovetskaya St.
6575 16-06-2006:05-03-2009 foreman 22 Sovetskaya St.

The "Employees" relation variable is not in 6NF and can be decomposed into the "Employee Positions" and "Employee Home Addresses" relation variables.

Employee Positions
Emp.ID Time Position
6575 01-01-2000:10-02-2003 mechanic
6575 16-06-2006:05-03-2009 foreman

Employee Home Addresses
Emp.ID Time Home Address
6575 01-01-2000:10-02-2003 10 Lenina St.
6575 11-02-2003:15-06-2006 22 Sovetskaya St.

Literature


For a deeper and more thorough study of the topic discussed, the book "An Introduction to Database Systems" by C. J. Date is recommended, on the materials of which this article was based.