Database Design

Posted Mar 3, 2024

By Seoeun Hong 18 min read

The foundational knowledge stems from the NYU’s “CSCI-UA 479 Data Management and Analysis” course by Joe Versoza.

Motivation for Database Design

Terminology

Entity - some thing that we store data about (the sale of a property, an incident of a dog bite, a movie)
Relation - a two-dimensional table consisting of columns and rows, with only one value at the intersection of a column and row
Composite key - two or more columns in a table that uniquely identify a row in that table; no part of a composite key may be null

we’ll also take a look at some strange things that occur when working with a poorly designed database, which includes: insertion, deletion, and update anomalies.

Imagine that we store all information about courses and students in the following table

netid	first	last	course_num	name	room	semester	year
abc123	alice	cho	0480-003	DMA	101	fa	2019
abc123	alice	cho	0380-001	E-Sports	317	fa	2019
gh789	george	henderson	0480-002	DMA	317	fa	2019
ijk012	irena	kahn	0480-001	DMA	101	fa	2019
lmn34	lars	nunez	0480-001	DMA	101	fa	2019
lmn34	lars	nunez	0480-001	DMA	101	fa	2020
bcd456	bob	davis	0380-001	E-Sports	317	sp	2017
cd78	carol	diaz	0480-007	How 2 Troll	202	fa	2019
cde901	carol	evans	0380-001	E-Sports	317	fa	2019
efg456	eva	gu	0480-003	DMA	101	fa	2019

Which could make a good primary key for this table? Could it be just netid? Maybe include both netid and course_num? Or should we consider a combination like netid, course_num, semester, and year?

But is this primary key option feasible?

We can:

Easily display all students enrolled in a specific course ✅
Show all courses in fall 2019 with enrolled students ✅

However, displaying all courses in fall 2019? Well, that’s a bit tricky:

This table mixes information about courses, students, and enrollment.
A course seems to depend on someone being enrolled in it. (If we only insert course data, the netid, which is part of the composite primary key, would end up being null)
This kind of issue is known as an insertion anomaly.

Insertion Anomaly

An insertion anomaly happens when we can't add data due to an incomplete primary key (where none of the parts of a composite primary key can be left empty).

In our case:

We can’t add a student to the database unless they enroll in a course.
Adding a course is only possible if a student enrolls in it.

Usually, insertion anomalies pop up when a table holds data from multiple entities.

Consequently, inserting data about one entity forces the insertion of data from another entity even when that extra data is not needed.

Can a student withdraw from a course or leave the school? Well, it’s a bit tricky. While we can delete a row, the consequences reveal some challenges in database design.

If all students withdraw from a course, the entire course data disappears!
Deleting just the student data leads to a partially null composite key, known as a deletion anomaly.

Deletion Anomaly

A deletion anomaly arises when removing data about one entity in a row results in part of the composite primary key being null.

The dilemma is that the only remedy is to delete the entire row, potentially causing unintended data loss.

In our scenario:

If a student is the last one in a course, removing that student also means removing the entire course
If a course is canceled and only one student is enrolled, we lose that student’s data.

Imagine a scenario where a student decides to legally change their name to Robert’); DROP TABLE–

Can you change the student’s name in the database? Absolutely! But, there’s a catch:

You’ll have to update the name in multiple places.
If not done carefully, only a subset of the duplicate data might change, leading to inconsistent data!

This predicament falls under the category of an update anomaly.

Update Anomaly

An update anomaly arises when a subset of duplicate data is updated, resulting in inconsistent data. In simpler terms, not all instances of the same data are changed simultaneously.

this is mainly due to redundant data
of course, redundant data also means unnecessary usage of space

Why Take the Time to Design Your Database?

Lack of planning and design will lead to:

redundant (and as consequence, potentially inconsistent data)
difficulty adding data (insert anomalies)
difficulty removing data (deletion anomalies)
data prone to inconsistencies (update anomalies)

Good database design also allows for:

the enforcing of constraints and referential integrity (given the mechanisms available in a relational database)
growth / expandability
overall ease of conceptual understanding

Database Design

The process of designing a database typically involves the following stages:

Requirements Analysis:
- Specify the problem, collaborate with domain experts, explore existing data, etc., to:
  - Determine the necessary data to be stored.
  - Understand how the data is interrelated.
Conceptual Design:
- Focuses solely on entities, attributes, and relationships.
- Utilize requirements to formally describe the data, relationships, and constraints.
- This stage doesn’t delve into the actual implementation details (such as the platform or physical storage).
- Represents the most basic and abstract model.

After completing the conceptual design:

Translate to Logical Design:
- Transform the conceptual design into a logical design, incorporating details like data types and unique identifiers.
- This stage is more intricate than conceptual design, laying the groundwork for the subsequent steps.
Translate to Database Objects in a Specific DBMS/Database Platform:
- Implement the logical design by translating it into actual database objects within a chosen Database Management System (DBMS) or specific database platform.
- This step involves creating the database and defining the structures, relationships, and other technical specifications based on the logical design.

Design Rules

Your database design should aim to accomplish the following objectives:

Ability to Solve Problems:
- Understand the problem: Thoroughly comprehend the business requirements and challenges that the database is intended to address.
- Fulfill business requirements: Ensure that the design supports reporting and data analysis, meeting the specific needs of the organization.
Storage of Required Data:
- Structure to satisfy queries: Design the database structure in a way that it efficiently satisfies the required queries.
- Correct fields and types: Define appropriate fields with the correct data types to accurately represent the information.
Modeling Relationships:
- Establish relationships: Clearly define and model the relationships between different entities in the database to represent the real-world connections accurately.
Data Integrity:
- Correct data: Ensure the accuracy of data by implementing measures to prevent errors and inconsistencies.
- Correct relationships: Validate and maintain the integrity of relationships to avoid anomalies, inconsistency, and redundancy.
Flexibility for Future Changes:
- Adaptability: Design the database with a level of flexibility that allows it to accommodate future changes in business requirements without requiring a complete overhaul.
Performance and Efficiency:
- Indexing and optimization: Implement appropriate indexes and optimize queries to enhance the performance of the database.
- Consideration of efficiency: While performance is crucial, design choices should not compromise other aspects of the database, and decisions should be made considering the overall system requirements.

Terminology

data model - specifies the data and relationships to a DBMS; The actual implementation may vary based on the DBMS, so the data model is independent from the DBMS that is being used.
entity - some thing that we store data about, like a student, a dog bite incident, a pizza order, etc (essentially, a table)
an attribute is the data that describes an entity, like netid, breed of dog, or type of crust (think: columns)
entity identifier - an attribute or attributes that uniquely identify an instance of an entity (Ex. primary key)
an instance of an entity is the actual attribute data of an entity (like a row in a table)

Attribute Values

An attribute is typically designed to contain a single value for optimal database organization. However, it’s conceivable to encounter situations where an attribute might encompass multiple values. For instance:

A student’s ‘contact_info’ attribute could include a phone number, multiple email addresses, and an Instagram username.
Example: “555-555-5555, pg12@nyu.edu, sql.is.my.life@nyu.edu, @sqlXallXday” While this represents a multi-value attribute, it’s generally considered suboptimal design for the following reasons:

Slower search functionality: Retrieving information becomes more time-consuming as searches are no longer exact matches but involve substring matches.
Increased complexity in selecting specific information: Navigating and extracting only the desired piece of information becomes more challenging.
Preferable to have separate attributes: For improved database organization and efficiency, it’s advisable to have distinct attributes for each piece of information. It’s worth noting that some database management systems may support multi-value attributes, but careful consideration of the aforementioned drawbacks is essential.

Attribute Domains

Attributes are characterized by their type, representing values within specific domains.

For instance, a student’s GPA is a floating-point number, and a pie size can be categorized as small, medium, large, or even record-breaking.
Essentially, attribute domains dictate the possible values an attribute can hold.

Entities are interconnected, creating relationships defined by cardinality and participation.

Relationships can be classified as one-to-one, one-to-many, or many-to-many, with the possibility of zero participation on either side.
- For example, a student can exist without enrolled courses, and a course can exist without students.

Weak Entities

A weak entity is one that must be associated with another entity for its existence.

Consider the following examples:

A pizza order necessitates a customer.
An exam cannot exist independently of a course.

In such cases, one entity requires a relationship (the other side cannot be zero). Implementation of a weak entity field referencing the dependent table would involve constraints like a foreign key and a non-null condition.

Composite Entities

A composite entity models relationships between other entities, often acting as a bridge in many-to-many associations.

For instance, the relationship between a student and a course might be represented by an enrollment, serving as a kind of join table.

Documentation, ER diagrams

Entity Relationship Diagrams (ERDs) visually represent entities, their attributes, and relationships.

Various styles of ER diagrams exist; for our purposes, references like Chapter 3 of Relational Database Design and Implementation by Jan L. Harrington on O’Reilly Safari via NYU Libraries or Wikipedia’s list can be consulted.
We’ll adhere to either the Chen or crow’s feet notation in our diagrams.

Chen

Entities in boxes
No attributes or attributes in circles
Relationships:
- names in diamonds or (to read both ways) written adjacent to relationship line
- either arrows or numbers to note cardinality

Crow’s Feet / IE

Entities in boxes
Attributes listed in boxes
Relationships:
- name adjacent to relationship line
- || one and only one (mandatory)
- O| zero or one (optional)
- |< one or many (mandatory)
- O< zero or many (optional)

Normalization

The term relational database may be misleading as it doesn’t solely refer to a database containing entities and their relationships.

Instead, its origin lies in the concept of a relation, a mathematical abstraction akin to a table. In essence, a relational database is a compilation of these "relations," represented by tables.

However, the term has nuances, as relations possess specific properties, most of which are directly applicable to the tables embodying them. The foundational idea of a relational data model was introduced by Edgar (E.F.) Codd in the early 1970s and has undergone refinements by subsequent contributors.

Relations

A relation can be described as:

A structured table comprising columns (referred to as attributes) and rows (referred to as tuples).
Each row of data serves as an instance of a relation, akin to an instance of an entity.

However, this might resemble a conventional table or spreadsheet. So, what sets it apart?

The distinctive feature lies in the properties mandated for both rows and columns. These properties are enforced by a Database Management System (DBMS) through constraints, adding a layer of specificity and organization to the data structure.

Columns

Columns in a table must adhere to the following guidelines:

Unique names within a table are essential (though, across different tables, a column may share a name; aliasing can resolve conflicts when both tables and their common columns are used in the same query).
Values in a column must belong to a specific domain, defining the type of data the column holds.
The positional order of columns is generally inconsequential (although practical considerations may require ordering in certain queries, it does not alter the data’s meaning; the columns could be arranged differently and convey the same information).

Rows

When it comes to rows:

As we’ve established with entities, the intersection of a column and a row can only contain a singular value.
Duplicate rows are prohibited in a relation, emphasizing the importance of primary keys.
Similar to columns, the order of rows holds no intrinsic meaning (position has no impact on the data’s interpretation), allowing flexibility in how rows are presented.

Normalization

Normalization is the process of structuring a relational database to:

reduce redundant data / use less space
improve quality and integrity of data
- make it easier to change data (avoid insert, update and delete anomalies)
- allow for growth, evolution of data (reduce the need for restructuring tables completely when new data is introduced)
- make referential integrity constraints easier to enforce
- create clear and easy-to-understand structure that closely models the situation that the data represents

First Normal Form (1NF)

The First Normal Form outlines the following criteria:

No Repeating Groups:
- The definition of repeating groups has evolved over time, encompassing scenarios such as avoiding repeated columns or columns containing multiple values (maintaining atomicity).
- Essentially, a relation achieves 1NF when none of its domains (types of columns) permits a set as a possible value.
- Every intersection of a row and column should contain a singular, atomic value from the domain.
Attributes are Defined:
- Attributes must be clearly defined.
Attributes Depend on a Primary Key:
- Attributes should be dependent on a primary key.

An alternative perspective on 1NF relates it to the structure of a record type:

All rows must possess the same number of fields.
Atomicity is emphasized, prohibiting a single column from holding multiple values (e.g., phone numbers or names).

Additional attributes of 1NF include:

The ordering of columns or rows should not be considered relevant.
Duplicate rows are not allowed, reinforcing the significance of a primary key.

1NF Example

Are these table in first normal form?

netid	first	course
abc123	alice	ait
		nlp
def456	bob	nlp

netid	first	course
abc123	alice	ait,nlp
def456	bob	nlp

netid	first	course1	course2
abc123	alice	ait	nlp
def456	bob	nlp

None of these are in first normal form

attributes should depend on primary key
there should be no repeating groups / values should be atomic

netid	first	course
abc123	alice	ait
abc123	alice	nlp
def456	bob	nlp

Ok - this is 1NF, but

note that primary key now has to be combination of course and netid
probably better as multiple tables (we’ll see this later)

Second Normal Form (2NF)

A relation attains second normal form if:

It satisfies the requirements of 1NF.
There are no partial dependencies, where a partial dependency occurs when a non-prime attribute exhibits a functional dependency on a portion of the composite primary key (or on a candidate key).

It’s worth noting that if a table possesses a primary key composed of a single column, it inherently conforms to 2NF.

The second normal form primarily addresses many insertion and deletion anomalies that persist after achieving 1NF.

Candidate Keys

A candidate key refers to a column or set of columns that:

Uniquely identifies a row.
While possessing the capability to uniquely identify a row, it is not obligated to be the primary key.
A table may have multiple candidate keys, but only one of them serves as the primary key.

Functional Dependency

Functional dependency occurs when the value of one attribute depends on or is determined by the value of another attribute.

Consider two attributes, X and Y:

Y is functionally dependent on X if, for every row with the same value of X, the corresponding value of Y is the same.
Put simply, if you know the value of X, you can accurately determine the value of Y, although different values for X may yield the same value for Y.

2NF Example

netid	first	last	course name	course number
abc123	alice	cho	applied internet tech	0480-003
abc123	alice	cho	drawing on the web	0380-001
bcd456	bob	davis	drawing on the web	0380-001
cd78	carol	diaz	applied internet tech	0480-007
cde901	carol	evans	drawing on the web	0380-001

Identify some functional dependencies

netid → first
netid → last
course number → course name
name a candidate key
- a composite key: netid, course number
- a composite key: last, course number (based on the data in the example only)
name a part-key dependency
- first depends on netid or last (every time we see abc123 for netid, we have the same first name, alice)
- course name depends on course number

Student:

netid (pk)	first	last	course num (fk)
abc123	alice	cho	0480-003
abc123	alice	cho	0380-001
bcd456	bob	davis	0380-001
cd78	carol	diaz	0480-007
cde901	carol	evans	0380-001

Course:

course name	course num (pk)
applied internet tech	0480-003
applied internet tech	0480-007
drawing on the web	0380-001

primary key of student must be composite now
but we still have partial dependency!
ok - moar tables!

Student:

netid (pk)	first	last
abc123	alice	cho
bcd456	bob	davis
cd78	carol	diaz
cde901	carol	evans

student_course:

netid (fk)	course num (fk)
abc123	0480-003
abc123	0380-001
bcd456	0380-001
cd78	0380-007
cde901	0380-001

course:

course name	course num (pk)
applied internet tech	0480-003
applied internet tech	0480-007
drawing on the web	0380-001

Note that 2NF isn’t inherently tied to many-to-many relationships:

While our initial attempt wasn’t precisely accurate due to the presence of a many-to-many relationship, it’s essential to recognize that 2NF doesn’t exclusively revolve around many-to-many relationships.

In an alternate scenario where multiple duplicate students didn’t appear in the student table, meaning we weren’t tasked with modeling a student taking numerous classes, our initial attempt would have sufficed. However, it would have been limited to illustrating a single course accommodating many students.

Third Normal Form (3NF)

A relation achieves third normal form if:

It satisfies the requirements of 2NF.
It avoids any transitive dependencies, meaning no non-prime attribute is dependent on the primary key through another non-prime attribute.

It’s noteworthy that if a table contains only one non-key attribute and adheres to 2NF, it is automatically considered to be in 3NF.

course name	course num (pk)	dept	dept description
applied internet tech	0480-003	CS	bits and stuff
probability and statistics	0235-002	Math	numbers and stuff
applied internet tech	0480-007	CS	bits and stuff
drawing on the web	0380-001	CS	bits and stuff

“dept” is not a primary key but “dept description” depends on “course number” through “dept”

OK, how about this? →

course name	course num (pk)	dept_id
applied internet tech	0480-003	2
probability and statistics	0235-002	1
applied internet tech	0480-007	2
drawing on the web	0380-001	2

dept_id	name	dept description
1	Math	numbers and stuff
2	CS	bits and stuff

2NF and 3NF essentially deal with the relationships between keys and non-keys

“[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.”

1NF: “the key” implies that the key exists
2NF: “the whole key” is a reference to no partial dependencies
3NF: “nothing but the key” means that it doesn’t depend on non-key attributes

4NF and 5NF

Typically, 3NF is *good enough* for preventing insert, update and delete anomalies

With that said, there are additional normal forms:

4NF: 3NF + “a record type should not contain two or more independentmulti-valued facts about an entity”
- See pizza example on wikipedia
- (always seems like the answer is more tables, amirite?)
5NF: 4NF + “information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record”

Finding Candidate Keys

Data Engineering