Post

Database Design

The foundational knowledge stems from the NYU’s “CSCI-UA 479 Data Management and Analysis” course by Joe Versoza.

Motivation for Database Design

Terminology

  • Entity - some thing that we store data about (the sale of a property, an incident of a dog bite, a movie)
  • Relation - a two-dimensional table consisting of columns and rows, with only one value at the intersection of a column and row
  • Composite key - two or more columns in a table that uniquely identify a row in that table; no part of a composite key may be null

we’ll also take a look at some strange things that occur when working with a poorly designed database, which includes: insertion, deletion, and update anomalies.

Imagine that we store all information about courses and students in the following table

netidfirstlastcourse_numnameroomsemesteryear
abc123alicecho0480-003DMA101fa2019
abc123alicecho0380-001E-Sports317fa2019
gh789georgehenderson0480-002DMA317fa2019
ijk012irenakahn0480-001DMA101fa2019
lmn34larsnunez0480-001DMA101fa2019
lmn34larsnunez0480-001DMA101fa2020
bcd456bobdavis0380-001E-Sports317sp2017
cd78caroldiaz0480-007How 2 Troll202fa2019
cde901carolevans0380-001E-Sports317fa2019
efg456evagu0480-003DMA101fa2019

Which could make a good primary key for this table? Could it be just netid? Maybe include both netid and course_num? Or should we consider a combination like netid, course_num, semester, and year?

But is this primary key option feasible?

We can:

  • Easily display all students enrolled in a specific course ✅
  • Show all courses in fall 2019 with enrolled students ✅

However, displaying all courses in fall 2019? Well, that’s a bit tricky:

  • This table mixes information about courses, students, and enrollment.
  • A course seems to depend on someone being enrolled in it. (If we only insert course data, the netid, which is part of the composite primary key, would end up being null)
  • This kind of issue is known as an insertion anomaly.

Insertion Anomaly

An insertion anomaly happens when we can't add data due to an incomplete primary key (where none of the parts of a composite primary key can be left empty).

In our case:

  • We can’t add a student to the database unless they enroll in a course.
  • Adding a course is only possible if a student enrolls in it.

Usually, insertion anomalies pop up when a table holds data from multiple entities.

Consequently, inserting data about one entity forces the insertion of data from another entity even when that extra data is not needed.

Can a student withdraw from a course or leave the school? Well, it’s a bit tricky. While we can delete a row, the consequences reveal some challenges in database design.

  • If all students withdraw from a course, the entire course data disappears!
  • Deleting just the student data leads to a partially null composite key, known as a deletion anomaly.

Deletion Anomaly

A deletion anomaly arises when removing data about one entity in a row results in part of the composite primary key being null.

The dilemma is that the only remedy is to delete the entire row, potentially causing unintended data loss.

In our scenario:

  • If a student is the last one in a course, removing that student also means removing the entire course
  • If a course is canceled and only one student is enrolled, we lose that student’s data.

Imagine a scenario where a student decides to legally change their name to Robert’); DROP TABLE–

Can you change the student’s name in the database? Absolutely! But, there’s a catch:

  • You’ll have to update the name in multiple places.
  • If not done carefully, only a subset of the duplicate data might change, leading to inconsistent data!

This predicament falls under the category of an update anomaly.

Update Anomaly

An update anomaly arises when a subset of duplicate data is updated, resulting in inconsistent data. In simpler terms, not all instances of the same data are changed simultaneously.

  • this is mainly due to redundant data
  • of course, redundant data also means unnecessary usage of space

Why Take the Time to Design Your Database?

Lack of planning and design will lead to:

  • redundant (and as consequence, potentially inconsistent data)
  • difficulty adding data (insert anomalies)
  • difficulty removing data (deletion anomalies)
  • data prone to inconsistencies (update anomalies)

Good database design also allows for:

  • the enforcing of constraints and referential integrity (given the mechanisms available in a relational database)
  • growth / expandability
  • overall ease of conceptual understanding

Database Design

The process of designing a database typically involves the following stages:

  1. Requirements Analysis:
    • Specify the problem, collaborate with domain experts, explore existing data, etc., to:
      • Determine the necessary data to be stored.
      • Understand how the data is interrelated.
  2. Conceptual Design:
    • Focuses solely on entities, attributes, and relationships.
    • Utilize requirements to formally describe the data, relationships, and constraints.
    • This stage doesn’t delve into the actual implementation details (such as the platform or physical storage).
    • Represents the most basic and abstract model.

After completing the conceptual design:

  1. Translate to Logical Design:
    • Transform the conceptual design into a logical design, incorporating details like data types and unique identifiers.
    • This stage is more intricate than conceptual design, laying the groundwork for the subsequent steps.
  2. Translate to Database Objects in a Specific DBMS/Database Platform:
    • Implement the logical design by translating it into actual database objects within a chosen Database Management System (DBMS) or specific database platform.
    • This step involves creating the database and defining the structures, relationships, and other technical specifications based on the logical design.

Design Rules

Your database design should aim to accomplish the following objectives:

  1. Ability to Solve Problems:
    • Understand the problem: Thoroughly comprehend the business requirements and challenges that the database is intended to address.
    • Fulfill business requirements: Ensure that the design supports reporting and data analysis, meeting the specific needs of the organization.
  2. Storage of Required Data:
    • Structure to satisfy queries: Design the database structure in a way that it efficiently satisfies the required queries.
    • Correct fields and types: Define appropriate fields with the correct data types to accurately represent the information.
  3. Modeling Relationships:
    • Establish relationships: Clearly define and model the relationships between different entities in the database to represent the real-world connections accurately.
  4. Data Integrity:
    • Correct data: Ensure the accuracy of data by implementing measures to prevent errors and inconsistencies.
    • Correct relationships: Validate and maintain the integrity of relationships to avoid anomalies, inconsistency, and redundancy.
  5. Flexibility for Future Changes:
    • Adaptability: Design the database with a level of flexibility that allows it to accommodate future changes in business requirements without requiring a complete overhaul.
  6. Performance and Efficiency:
    • Indexing and optimization: Implement appropriate indexes and optimize queries to enhance the performance of the database.
    • Consideration of efficiency: While performance is crucial, design choices should not compromise other aspects of the database, and decisions should be made considering the overall system requirements.

Terminology

  • data model - specifies the data and relationships to a DBMS; The actual implementation may vary based on the DBMS, so the data model is independent from the DBMS that is being used.
  • entity - some thing that we store data about, like a student, a dog bite incident, a pizza order, etc (essentially, a table)
  • an attribute is the data that describes an entity, like netid, breed of dog, or type of crust (think: columns)
  • entity identifier - an attribute or attributes that uniquely identify an instance of an entity (Ex. primary key)
  • an instance of an entity is the actual attribute data of an entity (like a row in a table)

Attribute Values

An attribute is typically designed to contain a single value for optimal database organization. However, it’s conceivable to encounter situations where an attribute might encompass multiple values. For instance:

  • A student’s ‘contact_info’ attribute could include a phone number, multiple email addresses, and an Instagram username.
  • Example: “555-555-5555, pg12@nyu.edu, sql.is.my.life@nyu.edu, @sqlXallXday” While this represents a multi-value attribute, it’s generally considered suboptimal design for the following reasons:
  1. Slower search functionality: Retrieving information becomes more time-consuming as searches are no longer exact matches but involve substring matches.
  2. Increased complexity in selecting specific information: Navigating and extracting only the desired piece of information becomes more challenging.
  3. Preferable to have separate attributes: For improved database organization and efficiency, it’s advisable to have distinct attributes for each piece of information. It’s worth noting that some database management systems may support multi-value attributes, but careful consideration of the aforementioned drawbacks is essential.

Attribute Domains

Attributes are characterized by their type, representing values within specific domains.

  • For instance, a student’s GPA is a floating-point number, and a pie size can be categorized as small, medium, large, or even record-breaking.
  • Essentially, attribute domains dictate the possible values an attribute can hold.

Entities are interconnected, creating relationships defined by cardinality and participation.

  • Relationships can be classified as one-to-one, one-to-many, or many-to-many, with the possibility of zero participation on either side.
    • For example, a student can exist without enrolled courses, and a course can exist without students.

Weak Entities

A weak entity is one that must be associated with another entity for its existence.

Consider the following examples:

  1. A pizza order necessitates a customer.
  2. An exam cannot exist independently of a course.

In such cases, one entity requires a relationship (the other side cannot be zero). Implementation of a weak entity field referencing the dependent table would involve constraints like a foreign key and a non-null condition.

Composite Entities

A composite entity models relationships between other entities, often acting as a bridge in many-to-many associations.

For instance, the relationship between a student and a course might be represented by an enrollment, serving as a kind of join table.

Documentation, ER diagrams

Entity Relationship Diagrams (ERDs) visually represent entities, their attributes, and relationships.

Chen

  1. Entities in boxes
  2. No attributes or attributes in circles
  3. Relationships:
    • names in diamonds or (to read both ways) written adjacent to relationship line
    • either arrows or numbers to note cardinality

Crow’s Feet / IE

  1. Entities in boxes
  2. Attributes listed in boxes
  3. Relationships:
    • name adjacent to relationship line
    • || one and only one (mandatory)
    • O| zero or one (optional)
    • |< one or many (mandatory)
    • O< zero or many (optional)

Normalization

The term relational database may be misleading as it doesn’t solely refer to a database containing entities and their relationships.

Instead, its origin lies in the concept of a relation, a mathematical abstraction akin to a table. In essence, a relational database is a compilation of these "relations," represented by tables.

However, the term has nuances, as relations possess specific properties, most of which are directly applicable to the tables embodying them. The foundational idea of a relational data model was introduced by Edgar (E.F.) Codd in the early 1970s and has undergone refinements by subsequent contributors.

Relations

A relation can be described as:

  • A structured table comprising columns (referred to as attributes) and rows (referred to as tuples).
  • Each row of data serves as an instance of a relation, akin to an instance of an entity.

However, this might resemble a conventional table or spreadsheet. So, what sets it apart?

The distinctive feature lies in the properties mandated for both rows and columns. These properties are enforced by a Database Management System (DBMS) through constraints, adding a layer of specificity and organization to the data structure.

Columns

Columns in a table must adhere to the following guidelines:

  • Unique names within a table are essential (though, across different tables, a column may share a name; aliasing can resolve conflicts when both tables and their common columns are used in the same query).
  • Values in a column must belong to a specific domain, defining the type of data the column holds.
  • The positional order of columns is generally inconsequential (although practical considerations may require ordering in certain queries, it does not alter the data’s meaning; the columns could be arranged differently and convey the same information).

Rows

When it comes to rows:

  • As we’ve established with entities, the intersection of a column and a row can only contain a singular value.
  • Duplicate rows are prohibited in a relation, emphasizing the importance of primary keys.
  • Similar to columns, the order of rows holds no intrinsic meaning (position has no impact on the data’s interpretation), allowing flexibility in how rows are presented.

Normalization

Normalization is the process of structuring a relational database to:

  1. reduce redundant data / use less space
  2. improve quality and integrity of data
    • make it easier to change data (avoid insert, update and delete anomalies)
    • allow for growth, evolution of data (reduce the need for restructuring tables completely when new data is introduced)
    • make referential integrity constraints easier to enforce
    • create clear and easy-to-understand structure that closely models the situation that the data represents

First Normal Form (1NF)

The First Normal Form outlines the following criteria:

  1. No Repeating Groups:
    • The definition of repeating groups has evolved over time, encompassing scenarios such as avoiding repeated columns or columns containing multiple values (maintaining atomicity).
    • Essentially, a relation achieves 1NF when none of its domains (types of columns) permits a set as a possible value.
    • Every intersection of a row and column should contain a singular, atomic value from the domain.
  2. Attributes are Defined:
    • Attributes must be clearly defined.
  3. Attributes Depend on a Primary Key:
    • Attributes should be dependent on a primary key.

An alternative perspective on 1NF relates it to the structure of a record type:

  • All rows must possess the same number of fields.
  • Atomicity is emphasized, prohibiting a single column from holding multiple values (e.g., phone numbers or names).

Additional attributes of 1NF include:

  • The ordering of columns or rows should not be considered relevant.
  • Duplicate rows are not allowed, reinforcing the significance of a primary key.

1NF Example

Are these table in first normal form?

netidfirstcourse
abc123aliceait
  nlp
def456bobnlp
netidfirstcourse
abc123aliceait,nlp
def456bobnlp
netidfirstcourse1course2
abc123aliceaitnlp
def456bobnlp 

None of these are in first normal form

  • attributes should depend on primary key
  • there should be no repeating groups / values should be atomic
netidfirstcourse
abc123aliceait
abc123alicenlp
def456bobnlp

Ok - this is 1NF, but

  • note that primary key now has to be combination of course and netid
  • probably better as multiple tables (we’ll see this later)

Second Normal Form (2NF)

A relation attains second normal form if:

  1. It satisfies the requirements of 1NF.
  2. There are no partial dependencies, where a partial dependency occurs when a non-prime attribute exhibits a functional dependency on a portion of the composite primary key (or on a candidate key).

It’s worth noting that if a table possesses a primary key composed of a single column, it inherently conforms to 2NF.

The second normal form primarily addresses many insertion and deletion anomalies that persist after achieving 1NF.

Candidate Keys

A candidate key refers to a column or set of columns that:

  • Uniquely identifies a row.
  • While possessing the capability to uniquely identify a row, it is not obligated to be the primary key.
  • A table may have multiple candidate keys, but only one of them serves as the primary key.

Functional Dependency

Functional dependency occurs when the value of one attribute depends on or is determined by the value of another attribute.

Consider two attributes, X and Y:

  • Y is functionally dependent on X if, for every row with the same value of X, the corresponding value of Y is the same.
  • Put simply, if you know the value of X, you can accurately determine the value of Y, although different values for X may yield the same value for Y.

2NF Example

netidfirstlastcourse namecourse number
abc123alicechoapplied internet tech0480-003
abc123alicechodrawing on the web0380-001
bcd456bobdavisdrawing on the web0380-001
cd78caroldiazapplied internet tech0480-007
cde901carolevansdrawing on the web0380-001

Identify some functional dependencies

  1. netid → first
  2. netid → last
  3. course number → course name

  4. name a candidate key
    • a composite key: netid, course number
    • a composite key: last, course number (based on the data in the example only)
  5. name a part-key dependency
    • first depends on netid or last (every time we see abc123 for netid, we have the same first name, alice)
    • course name depends on course number

Student:

netid (pk)firstlastcourse num (fk)
abc123alicecho0480-003
abc123alicecho0380-001
bcd456bobdavis0380-001
cd78caroldiaz0480-007
cde901carolevans0380-001

Course:

course namecourse num (pk)
applied internet tech0480-003
applied internet tech0480-007
drawing on the web0380-001
  • primary key of student must be composite now
  • but we still have partial dependency!
  • ok - moar tables!

Student:

netid (pk)firstlast
abc123alicecho
bcd456bobdavis
cd78caroldiaz
cde901carolevans

student_course:

netid (fk)course num (fk)
abc1230480-003
abc1230380-001
bcd4560380-001
cd780380-007
cde9010380-001

course:

course namecourse num (pk)
applied internet tech0480-003
applied internet tech0480-007
drawing on the web0380-001

Note that 2NF isn’t inherently tied to many-to-many relationships:

While our initial attempt wasn’t precisely accurate due to the presence of a many-to-many relationship, it’s essential to recognize that 2NF doesn’t exclusively revolve around many-to-many relationships.

In an alternate scenario where multiple duplicate students didn’t appear in the student table, meaning we weren’t tasked with modeling a student taking numerous classes, our initial attempt would have sufficed. However, it would have been limited to illustrating a single course accommodating many students.

Third Normal Form (3NF)

A relation achieves third normal form if:

  1. It satisfies the requirements of 2NF.
  2. It avoids any transitive dependencies, meaning no non-prime attribute is dependent on the primary key through another non-prime attribute.

It’s noteworthy that if a table contains only one non-key attribute and adheres to 2NF, it is automatically considered to be in 3NF.

course namecourse num (pk)deptdept description
applied internet tech0480-003CSbits and stuff
probability and statistics0235-002Mathnumbers and stuff
applied internet tech0480-007CSbits and stuff
drawing on the web0380-001CSbits and stuff

“dept” is not a primary key but “dept description” depends on “course number” through “dept”

OK, how about this?

course namecourse num (pk)dept_id
applied internet tech0480-0032
probability and statistics0235-0021
applied internet tech0480-0072
drawing on the web0380-0012
dept_idnamedept description
1Mathnumbers and stuff
2CSbits and stuff

2NF and 3NF essentially deal with the relationships between keys and non-keys

“[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.”

  • 1NF: “the key” implies that the key exists
  • 2NF: “the whole key” is a reference to no partial dependencies
  • 3NF: “nothing but the key” means that it doesn’t depend on non-key attributes

4NF and 5NF

Typically, 3NF is *good enough* for preventing insert, update and delete anomalies

With that said, there are additional normal forms:

  • 4NF: 3NF + “a record type should not contain two or more independentmulti-valued facts about an entity”
  • 5NF: 4NF + “information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record”

Finding Candidate Keys