Database Design
The foundational knowledge stems from the NYU’s “CSCI-UA 479 Data Management and Analysis” course by Joe Versoza.
Motivation for Database Design
Terminology
- Entity - some thing that we store data about (the sale of a property, an incident of a dog bite, a movie)
- Relation - a two-dimensional table consisting of columns and rows, with only one value at the intersection of a column and row
- Composite key - two or more columns in a table that uniquely identify a row in that table; no part of a composite key may be null
we’ll also take a look at some strange things that occur when working with a poorly designed database, which includes: insertion, deletion, and update anomalies.
Imagine that we store all information about courses and students in the following table
netid | first | last | course_num | name | room | semester | year |
---|---|---|---|---|---|---|---|
abc123 | alice | cho | 0480-003 | DMA | 101 | fa | 2019 |
abc123 | alice | cho | 0380-001 | E-Sports | 317 | fa | 2019 |
gh789 | george | henderson | 0480-002 | DMA | 317 | fa | 2019 |
ijk012 | irena | kahn | 0480-001 | DMA | 101 | fa | 2019 |
lmn34 | lars | nunez | 0480-001 | DMA | 101 | fa | 2019 |
lmn34 | lars | nunez | 0480-001 | DMA | 101 | fa | 2020 |
bcd456 | bob | davis | 0380-001 | E-Sports | 317 | sp | 2017 |
cd78 | carol | diaz | 0480-007 | How 2 Troll | 202 | fa | 2019 |
cde901 | carol | evans | 0380-001 | E-Sports | 317 | fa | 2019 |
efg456 | eva | gu | 0480-003 | DMA | 101 | fa | 2019 |
Which could make a good primary key for this table? Could it be just netid
? Maybe include both netid
and course_num
? Or should we consider a combination like netid
, course_num
, semester
, and year
?
But is this primary key option feasible?
We can:
- Easily display all students enrolled in a specific course ✅
- Show all courses in fall 2019 with enrolled students ✅
However, displaying all courses in fall 2019? Well, that’s a bit tricky:
- This table mixes information about courses, students, and enrollment.
- A course seems to depend on someone being enrolled in it. (If we only insert course data, the
netid
, which is part of the composite primary key, would end up beingnull
) - This kind of issue is known as an insertion anomaly.
Insertion Anomaly
An insertion anomaly happens when we can't add data due to an incomplete primary key (where none of the parts of a composite primary key can be left empty).
In our case:
- We can’t add a student to the database unless they enroll in a course.
- Adding a course is only possible if a student enrolls in it.
Usually, insertion anomalies pop up when a table holds data from multiple entities.
Consequently, inserting data about one entity forces the insertion of data from another entity even when that extra data is not needed.
Can a student withdraw from a course or leave the school? Well, it’s a bit tricky. While we can delete a row, the consequences reveal some challenges in database design.
- If all students withdraw from a course, the entire course data disappears!
- Deleting just the student data leads to a partially null composite key, known as a deletion anomaly.
Deletion Anomaly
A deletion anomaly arises when removing data about one entity in a row results in part of the composite primary key being null.
The dilemma is that the only remedy is to delete the entire row, potentially causing unintended data loss.
In our scenario:
- If a student is the last one in a course, removing that student also means removing the entire course
- If a course is canceled and only one student is enrolled, we lose that student’s data.
Imagine a scenario where a student decides to legally change their name to Robert’); DROP TABLE–
Can you change the student’s name in the database? Absolutely! But, there’s a catch:
- You’ll have to update the name in multiple places.
- If not done carefully, only a subset of the duplicate data might change, leading to inconsistent data!
This predicament falls under the category of an update anomaly.
Update Anomaly
An update anomaly arises when a subset of duplicate data is updated, resulting in inconsistent data. In simpler terms, not all instances of the same data are changed simultaneously.
- this is mainly due to redundant data
- of course, redundant data also means unnecessary usage of space
Why Take the Time to Design Your Database?
Lack of planning and design will lead to:
- redundant (and as consequence, potentially inconsistent data)
- difficulty adding data (insert anomalies)
- difficulty removing data (deletion anomalies)
- data prone to inconsistencies (update anomalies)
Good database design also allows for:
- the enforcing of constraints and referential integrity (given the mechanisms available in a relational database)
- growth / expandability
- overall ease of conceptual understanding
Database Design
The process of designing a database typically involves the following stages:
- Requirements Analysis:
- Specify the problem, collaborate with domain experts, explore existing data, etc., to:
- Determine the necessary data to be stored.
- Understand how the data is interrelated.
- Specify the problem, collaborate with domain experts, explore existing data, etc., to:
- Conceptual Design:
- Focuses solely on entities, attributes, and relationships.
- Utilize requirements to formally describe the data, relationships, and constraints.
- This stage doesn’t delve into the actual implementation details (such as the platform or physical storage).
- Represents the most basic and abstract model.
After completing the conceptual design:
- Translate to Logical Design:
- Transform the conceptual design into a logical design, incorporating details like data types and unique identifiers.
- This stage is more intricate than conceptual design, laying the groundwork for the subsequent steps.
- Translate to Database Objects in a Specific DBMS/Database Platform:
- Implement the logical design by translating it into actual database objects within a chosen Database Management System (DBMS) or specific database platform.
- This step involves creating the database and defining the structures, relationships, and other technical specifications based on the logical design.
Design Rules
Your database design should aim to accomplish the following objectives:
- Ability to Solve Problems:
- Understand the problem: Thoroughly comprehend the business requirements and challenges that the database is intended to address.
- Fulfill business requirements: Ensure that the design supports reporting and data analysis, meeting the specific needs of the organization.
- Storage of Required Data:
- Structure to satisfy queries: Design the database structure in a way that it efficiently satisfies the required queries.
- Correct fields and types: Define appropriate fields with the correct data types to accurately represent the information.
- Modeling Relationships:
- Establish relationships: Clearly define and model the relationships between different entities in the database to represent the real-world connections accurately.
- Data Integrity:
- Correct data: Ensure the accuracy of data by implementing measures to prevent errors and inconsistencies.
- Correct relationships: Validate and maintain the integrity of relationships to avoid anomalies, inconsistency, and redundancy.
- Flexibility for Future Changes:
- Adaptability: Design the database with a level of flexibility that allows it to accommodate future changes in business requirements without requiring a complete overhaul.
- Performance and Efficiency:
- Indexing and optimization: Implement appropriate indexes and optimize queries to enhance the performance of the database.
- Consideration of efficiency: While performance is crucial, design choices should not compromise other aspects of the database, and decisions should be made considering the overall system requirements.
Terminology
- data model - specifies the data and relationships to a DBMS; The actual implementation may vary based on the DBMS, so the data model is independent from the DBMS that is being used.
- entity - some thing that we store data about, like a student, a dog bite incident, a pizza order, etc (essentially, a table)
- an attribute is the data that describes an entity, like netid, breed of dog, or type of crust (think: columns)
- entity identifier - an attribute or attributes that uniquely identify an instance of an entity (Ex. primary key)
- an instance of an entity is the actual attribute data of an entity (like a row in a table)
Attribute Values
An attribute is typically designed to contain a single value for optimal database organization. However, it’s conceivable to encounter situations where an attribute might encompass multiple values. For instance:
- A student’s ‘contact_info’ attribute could include a phone number, multiple email addresses, and an Instagram username.
- Example: “555-555-5555, pg12@nyu.edu, sql.is.my.life@nyu.edu, @sqlXallXday” While this represents a multi-value attribute, it’s generally considered suboptimal design for the following reasons:
- Slower search functionality: Retrieving information becomes more time-consuming as searches are no longer exact matches but involve substring matches.
- Increased complexity in selecting specific information: Navigating and extracting only the desired piece of information becomes more challenging.
- Preferable to have separate attributes: For improved database organization and efficiency, it’s advisable to have distinct attributes for each piece of information. It’s worth noting that some database management systems may support multi-value attributes, but careful consideration of the aforementioned drawbacks is essential.
Attribute Domains
Attributes are characterized by their type, representing values within specific domains.
- For instance, a student’s GPA is a floating-point number, and a pie size can be categorized as small, medium, large, or even record-breaking.
- Essentially, attribute domains dictate the possible values an attribute can hold.
Entities are interconnected, creating relationships defined by cardinality and participation.
- Relationships can be classified as one-to-one, one-to-many, or many-to-many, with the possibility of zero participation on either side.
- For example, a student can exist without enrolled courses, and a course can exist without students.
Weak Entities
A weak entity is one that must be associated with another entity for its existence.
Consider the following examples:
- A pizza order necessitates a customer.
- An exam cannot exist independently of a course.
In such cases, one entity requires a relationship (the other side cannot be zero). Implementation of a weak entity field referencing the dependent table would involve constraints like a foreign key and a non-null condition.
Composite Entities
A composite entity models relationships between other entities, often acting as a bridge in many-to-many associations.
For instance, the relationship between a student and a course might be represented by an enrollment, serving as a kind of join table.
Documentation, ER diagrams
Entity Relationship Diagrams (ERDs) visually represent entities, their attributes, and relationships.
- Various styles of ER diagrams exist; for our purposes, references like Chapter 3 of Relational Database Design and Implementation by Jan L. Harrington on O’Reilly Safari via NYU Libraries or Wikipedia’s list can be consulted.
- We’ll adhere to either the Chen or crow’s feet notation in our diagrams.
Chen
- Entities in boxes
- No attributes or attributes in circles
- Relationships:
- names in diamonds or (to read both ways) written adjacent to relationship line
- either arrows or numbers to note cardinality
Crow’s Feet / IE
- Entities in boxes
- Attributes listed in boxes
- Relationships:
- name adjacent to relationship line
||
one and only one (mandatory)O|
zero or one (optional)|<
one or many (mandatory)O<
zero or many (optional)
Normalization
The term relational database may be misleading as it doesn’t solely refer to a database containing entities and their relationships.
Instead, its origin lies in the concept of a relation, a mathematical abstraction akin to a table. In essence, a relational database is a compilation of these "relations," represented by tables.
However, the term has nuances, as relations possess specific properties, most of which are directly applicable to the tables embodying them. The foundational idea of a relational data model was introduced by Edgar (E.F.) Codd in the early 1970s and has undergone refinements by subsequent contributors.
Relations
A relation can be described as:
- A structured table comprising columns (referred to as attributes) and rows (referred to as tuples).
- Each row of data serves as an instance of a relation, akin to an instance of an entity.
However, this might resemble a conventional table or spreadsheet. So, what sets it apart?
The distinctive feature lies in the properties mandated for both rows and columns. These properties are enforced by a Database Management System (DBMS) through constraints, adding a layer of specificity and organization to the data structure.
Columns
Columns in a table must adhere to the following guidelines:
- Unique names within a table are essential (though, across different tables, a column may share a name; aliasing can resolve conflicts when both tables and their common columns are used in the same query).
- Values in a column must belong to a specific domain, defining the type of data the column holds.
- The positional order of columns is generally inconsequential (although practical considerations may require ordering in certain queries, it does not alter the data’s meaning; the columns could be arranged differently and convey the same information).
Rows
When it comes to rows:
- As we’ve established with entities, the intersection of a column and a row can only contain a singular value.
- Duplicate rows are prohibited in a relation, emphasizing the importance of primary keys.
- Similar to columns, the order of rows holds no intrinsic meaning (position has no impact on the data’s interpretation), allowing flexibility in how rows are presented.
Normalization
Normalization is the process of structuring a relational database to:
- reduce redundant data / use less space
- improve quality and integrity of data
- make it easier to change data (avoid insert, update and delete anomalies)
- allow for growth, evolution of data (reduce the need for restructuring tables completely when new data is introduced)
- make referential integrity constraints easier to enforce
- create clear and easy-to-understand structure that closely models the situation that the data represents
First Normal Form (1NF)
The First Normal Form outlines the following criteria:
- No Repeating Groups:
- The definition of repeating groups has evolved over time, encompassing scenarios such as avoiding repeated columns or columns containing multiple values (maintaining atomicity).
- Essentially, a relation achieves 1NF when none of its domains (types of columns) permits a set as a possible value.
- Every intersection of a row and column should contain a singular, atomic value from the domain.
- Attributes are Defined:
- Attributes must be clearly defined.
- Attributes Depend on a Primary Key:
- Attributes should be dependent on a primary key.
An alternative perspective on 1NF relates it to the structure of a record type:
- All rows must possess the same number of fields.
- Atomicity is emphasized, prohibiting a single column from holding multiple values (e.g., phone numbers or names).
Additional attributes of 1NF include:
- The ordering of columns or rows should not be considered relevant.
- Duplicate rows are not allowed, reinforcing the significance of a primary key.
1NF Example
Are these table in first normal form?
netid | first | course |
---|---|---|
abc123 | alice | ait |
nlp | ||
def456 | bob | nlp |
netid | first | course |
---|---|---|
abc123 | alice | ait,nlp |
def456 | bob | nlp |
netid | first | course1 | course2 |
---|---|---|---|
abc123 | alice | ait | nlp |
def456 | bob | nlp |
None of these are in first normal form
- attributes should depend on primary key
- there should be no repeating groups / values should be atomic
netid | first | course |
---|---|---|
abc123 | alice | ait |
abc123 | alice | nlp |
def456 | bob | nlp |
Ok - this is 1NF, but
- note that primary key now has to be combination of course and netid
- probably better as multiple tables (we’ll see this later)
Second Normal Form (2NF)
A relation attains second normal form if:
- It satisfies the requirements of 1NF.
- There are no partial dependencies, where a partial dependency occurs when a non-prime attribute exhibits a functional dependency on a portion of the composite primary key (or on a candidate key).
It’s worth noting that if a table possesses a primary key composed of a single column, it inherently conforms to 2NF.
The second normal form primarily addresses many insertion and deletion anomalies that persist after achieving 1NF.
Candidate Keys
A candidate key refers to a column or set of columns that:
- Uniquely identifies a row.
- While possessing the capability to uniquely identify a row, it is not obligated to be the primary key.
- A table may have multiple candidate keys, but only one of them serves as the primary key.
Functional Dependency
Functional dependency occurs when the value of one attribute depends on or is determined by the value of another attribute.
Consider two attributes, X and Y:
- Y is functionally dependent on X if, for every row with the same value of X, the corresponding value of Y is the same.
- Put simply, if you know the value of X, you can accurately determine the value of Y, although different values for X may yield the same value for Y.
2NF Example
netid | first | last | course name | course number |
---|---|---|---|---|
abc123 | alice | cho | applied internet tech | 0480-003 |
abc123 | alice | cho | drawing on the web | 0380-001 |
bcd456 | bob | davis | drawing on the web | 0380-001 |
cd78 | carol | diaz | applied internet tech | 0480-007 |
cde901 | carol | evans | drawing on the web | 0380-001 |
Identify some functional dependencies
- netid → first
- netid → last
course number → course name
- name a candidate key
- a composite key:
netid
,course number
- a composite key:
last
,course number
(based on the data in the example only)
- a composite key:
- name a part-key dependency
first
depends onnetid
orlast
(every time we seeabc123
fornetid
, we have the samefirst
name,alice
)course name
depends oncourse number
Student:
netid (pk) | first | last | course num (fk) |
---|---|---|---|
abc123 | alice | cho | 0480-003 |
abc123 | alice | cho | 0380-001 |
bcd456 | bob | davis | 0380-001 |
cd78 | carol | diaz | 0480-007 |
cde901 | carol | evans | 0380-001 |
Course:
course name | course num (pk) |
---|---|
applied internet tech | 0480-003 |
applied internet tech | 0480-007 |
drawing on the web | 0380-001 |
- primary key of student must be composite now
- but we still have partial dependency!
- ok - moar tables!
Student:
netid (pk) | first | last |
---|---|---|
abc123 | alice | cho |
bcd456 | bob | davis |
cd78 | carol | diaz |
cde901 | carol | evans |
student_course:
netid (fk) | course num (fk) |
---|---|
abc123 | 0480-003 |
abc123 | 0380-001 |
bcd456 | 0380-001 |
cd78 | 0380-007 |
cde901 | 0380-001 |
course:
course name | course num (pk) |
---|---|
applied internet tech | 0480-003 |
applied internet tech | 0480-007 |
drawing on the web | 0380-001 |
Note that 2NF isn’t inherently tied to many-to-many relationships:
While our initial attempt wasn’t precisely accurate due to the presence of a many-to-many relationship, it’s essential to recognize that 2NF doesn’t exclusively revolve around many-to-many relationships.
In an alternate scenario where multiple duplicate students didn’t appear in the student table, meaning we weren’t tasked with modeling a student taking numerous classes, our initial attempt would have sufficed. However, it would have been limited to illustrating a single course accommodating many students.
Third Normal Form (3NF)
A relation achieves third normal form if:
- It satisfies the requirements of 2NF.
- It avoids any transitive dependencies, meaning no non-prime attribute is dependent on the primary key through another non-prime attribute.
It’s noteworthy that if a table contains only one non-key attribute and adheres to 2NF, it is automatically considered to be in 3NF.
course name | course num (pk) | dept | dept description |
---|---|---|---|
applied internet tech | 0480-003 | CS | bits and stuff |
probability and statistics | 0235-002 | Math | numbers and stuff |
applied internet tech | 0480-007 | CS | bits and stuff |
drawing on the web | 0380-001 | CS | bits and stuff |
“dept” is not a primary key but “dept description” depends on “course number” through “dept”
OK, how about this? →
course name | course num (pk) | dept_id |
---|---|---|
applied internet tech | 0480-003 | 2 |
probability and statistics | 0235-002 | 1 |
applied internet tech | 0480-007 | 2 |
drawing on the web | 0380-001 | 2 |
dept_id | name | dept description |
---|---|---|
1 | Math | numbers and stuff |
2 | CS | bits and stuff |
2NF and 3NF essentially deal with the relationships between keys and non-keys
“[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.”
- 1NF: “the key” implies that the key exists
- 2NF: “the whole key” is a reference to no partial dependencies
- 3NF: “nothing but the key” means that it doesn’t depend on non-key attributes
4NF and 5NF
Typically, 3NF is *good enough* for preventing insert, update and delete anomalies
With that said, there are additional normal forms:
- 4NF: 3NF + “a record type should not contain two or more independentmulti-valued facts about an entity”
- See pizza example on wikipedia
- (always seems like the answer is more tables, amirite?)
- 5NF: 4NF + “information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record”