Postgres and SQL Basics

Posted Feb 28, 2024 Updated Mar 3, 2024

By Seoeun Hong 11 min read

The foundational knowledge stems from the NYU’s “CSCI-UA 479 Data Management and Analysis” course by Joe Versoza.

PostgreSQL

Installation package for MacOS: Follow this Tutorial

Postgres Object Hierarchy

(Photos from Brandur.org)

Templates - base template(s) to copy database from
Databases - multiple databases allowed inon postgres instance/”cluster”
Schemas - Groups together a set of Relations (Any other type of named Postgres object like a table, view, index, or function - name spacing). Duplicate relation names are not allowed in the same schema, but are allowed in different schemas. (Brandur.org)
Tables - define the data structure and store the actual data values within databases (Brandur.org)
Views - Abstraction of table / multiple tables: merge tables and perform calculations and present data as if it were a table; typically read only

PSQL

psql is command line client of postgres databases.

Commands

\l: list databases
\dt: list tables
\dv: list views
\d: list tables and views
\dn: list schemas
\d table_name: discribe table
\du: list users

Naming Conventions

Table names enclosed in double quotes are case-sensitive, while unquoted names are normalized to lowercase (which may not be suitable depending on your table names). To ensure consistency and best practices:

Keep table and column names in lowercase.
Avoid double quoting table names.
Separate words with underscores.
Choose descriptive names.
Utilize underscore for foreign keys (e.g., foo_id).
Maintain consistency in pluralization, either always using singular or plural forms.

Types

PostgreSQL offers a wide array of types, and you can even create your custom ones. These types fall into several high-level categories:

Numeric

serial (auto incrementing, pk if no “natural pk” apparent, called artificial / surrogate)
integer - typical choice for integer (signed), 4 bytes
smallint - 2 bytes, signed
bigint - 8 bytes, signed
decimal / numeric - user specified precision numbers
real / double precision - variable-precision numbers

Strings

text - unlimited length
varchar(n) - where n is num of characters (character varying)

Date and Time

timestamptz (timestamp with timezone. Stored as UTC)
timestamp (no timezone)
date
time

Booleans

true
't'
even 'yes', 'y', quoted true, etc…

Other

currency (money)
shapes (circle, polygon)
documents (xml, json / jsonb)
networking (inet for ipv4 and ipv6, cidr for ip ranges)

As we explore these types, the following techniques will be employed:

Use SELECT statements to execute operations and functions without tables.
Utilize the pg_typeof function to identify the type of a value.
Employ the syntax value::type to coerce a value into a specific type.

SQL Syntax

Whitespace, including newlines and tabs, is acceptable within a statement. Therefore, formatting your code with line breaks and indentation for readability is highly encouraged.
Statements should be concluded with a semicolon ;.
Comments commence with --.
SQL keywords can be written in either upper or lower case, although it is common practice to use uppercase for keywords.

In the context of PostgreSQL:

When writing SQL, names of objects are lowercase.
Therefore, in PostgreSQL, it is advisable to avoid uppercase letters in object names to eliminate the need for quoting.
Delimit strings with single quotes '
- use double $ as quotes $$what's this$$
- E'\t' - prefix with E to use \ as escape character

Creating a Database

When a database is created, it clones the template database template1.

template1 can be modified so you can have a customized template (for example, add objects like languages, functions,etc.).
(there’s also template0, which is meant to be kept as an unmodified copy of template1’s initially configuration)

To create a database based off of the template template1

  
CREATE DATABASE some_database_name;
--uses same encoding and collation as template1

  
-- use utf8 as encoding, "copy" template0
-- instead of template1
CREATE DATABASE some_database_name
    ENCODING 'UTF8';
    TEMPLATE template0;

Additional information can be found in the CREATE DATABASE documentation.

Creating a Table

CREATE TABLE <table_name> (

comma seperated list of column names and their type seperated by space
can specify some constraints after type, such as:
- NOT NULL
- UNIQUE
- PRIMARY KEY
default value specified with:
- DEFAULT value_to_default_to

)

  
CREATE TABLE <table_name> (
  <column1_name> <type1> <constraints_if_any>,
  <column2_name> <type2> <constraints_if_any>,
  ...
);

Create a student table with 5 fields: netid first, last, midterm and registered

  
CREATE TABLE student(
	 netid varchar(20) PRIMARY KEY,
	 first varchar(255) NOT NULL,
	 last varchar(255) NOT NULL,
	 midterm numeric,
	 registered timestamptz DEFAULT NOW()
);

Additional information can be found in the CREATE TABLE documentation.

Create (INSERT)

To add a new row to a table, use an INSERT statement

  
-- values in order of field names (registered
-- left out, as it has a default value)
INSERT INTO student 
	VALUES ('fb123', 'foo', 'bar', 90);

  
-- specify columns and matching values (does not have 
-- to follow same order of columns in CREATE TABLE)
INSERT INTO student 
	(first, last, midterm, netid) 
	VALUES ('baz', 'qux', 70, 'bq789');

Additional information can be found in the CREATE INSERT documentation.

Read (SELECT)

Use a SELECT statement to read data:

Begin with SELECT, followed by a comma-separated list of columns or calculated values you wish to display.
Optionally, use the keyword FROM tablename to specify the table for the query.
Optionally, include WHERE cond to define conditions for filtering rows.
Optionally, employ ORDER BY to specify the sorting order.
Optionally, use LIMIT num to restrict the number of returned rows.

Additionally, you can combine query results using UNION and INTERSECT.

The select list may consist of:

Column names
AS some_alias for renaming column names or naming calculations
* means all columns
Operations, functions, or expressions
DISTINCT for retrieving only unique rows
DISTINCT ON (expression) for the first row matching the expression (ensure you order by the expression to avoid unpredictable results).”

  
-- give me col1, col2, and new_col only
-- new_col is a calculated field
-- this is the 'SELECT list'
SELECT col1, col2, col3 * 2 as new_col    

	-- from table, some_Table
	FROM some_table

	-- the value in col1 must be > 1
	-- for the row to be returned
	WHERE col1 > 1

	-- sort by col2 ascending
	ORDER BY col2

	-- only give back, at most, 5 rows
	LIMIT 5;

SELECT retrieves rows from zero or more tables.

The processing order of the parts of a SELECT query is as follows:

FROM: Determines the set of all possible rows to be returned.
WHERE: Filters out rows that do not match the specified criteria.
SELECT list: Determines the actual values of the resulting rows by evaluating expressions, resolving column names, etc.
DISTINCT: Eliminates duplicate rows in the output (or DISTINCT ON).
ORDER BY: Sorts the output rows.
LIMIT: Restricts the output rows to a specific number.

Additional information can be found in the SELECT documentation.

Operators

arithmetic: +, -, *, /
string concatenation: || ('HI' || 'THERE')
logical operators: AND, OR, NOT
check for NULL: IS NULL and IS NOT NULL
pattern matching, case sensitive and insensitive: LIKE, ILIKE

Functions

NOW() (current date / time), ROUND(val), etc.

Additional information can be found in the Functions and Operators documentation.

SELECT examples

  
-- get all students
SELECT * from student;

  
-- get all students, just netid, first, and alias first as fn
SELECT netid, first AS fn FROM student;

  
-- get all students, show netid and midterm grade divided by 100
SELECT netid, midterm / 100 FROM student;

SELECT + DISTINCT

This shows the distinct rows (remove duplicate rows)

  
-- show the distinct first names of students
SELECT DISTINCT first FROM student;

SELECT + WHERE

Optionally, add a WHERE clause to specify conditions (think filtering)

conditions can use operators like =, <> (not equal), >, <
you can also use LIKE and ILIKE with % representing wildcards to match on substrings (ILIKE is case insensitive)
use col_name IS NULL to check for a NULL value
multiple conditions can be put together with AND, OR, NOT
parentheses can be added to specify precedence

SELECT + WHERE examples

  
-- only students with midterm > 80
SELECT * FROM student WHERE midterm > 80;

  
-- only students with between 70 and 70
SELECT * FROM student
	WHERE midterm > 70
	AND midterm < 90;
-- can be done with BETWEEN 71 and 89

  
-- students that have no midterm score
SELECT * FROM student WHERE midterm IS NULL;
-- get the netid and first name of students with taht have a netid
-- that has jv in it or start with Jo, case insensitive
SELECT netid, first FROM student
	WHERE netid LIKE '%jv%'
	OR first ILIKE 'Jo%'

Ordering

Add an ORDER BY clause at the end of SELECT to specify ascending ordering

  
SELECT * FROM student
	WHERE midterm < 60
	ORDER BY registered;

Add DESC to order tin descending order

  
SELECT netid FROM student
	ORDER BY registered DESC;

Seperate multiple column names to order by multiple columns

  
SELECT * FROM student ORDER BY last, first;

LIMIT

LIMIT clause can be added at the end of a SELECT statement (after ORDER BY) to contrain the number of results of the query

Only show 10 results

  
SELECT netid, first, last
	FROM student
	ORDER BY netid
	LIMIT 10;

DISTINCT ON

Show the highest score per student first name

  
SELECT DISTINCT ON (first) netid, first, midterm
	FROM student
	ORDER BY first, midterm DESC;

Update (UPDATE)

Use an UPDATE statement to set the value of a column for a row/rows

Begin with the keyword UPDATE.
Specify the name of the table to be updated.
Follow this with the keyword SET.
Finally, column_name = some value.

  
-- set all students' registered field to 1/1/2024
UPDATE student SET registered = '2024-01-01';

Add WHERE clause to UPDATE(after SET) to specify which rows to chage

  
-- only set midterm score for rows that have netid fb123
UPDATE student
	SET midterm = 80
	WHERE netid = 'fb123';

Additional information can be found in the UPDATE documentation.

Delete / Remove Rows (DELETE)

Use DELETE statement to remove rows from a table

  
DELETE FROM student WHERE midterm > 90;

Add / Remove Column (ALTER)

Use ULTER TABLE to add / remove columns

  
-- add a new column
ALTER TABLE student 
	ADD COLUMN final_exam_score integer;

  
-- add a new column with a default value
ALTER TABLE student
	ADD COLUMN final_exam_score numeric DEFAULT 80;

  
-- remove a column
ALTER TABLE student DROP COLUMN final_exam_score;

ALTER TABLE can also be used to modify columns

  
-- change data type of column
ALTER TABLE student
	ALTER COLUMN netid
	SET DATA TYPE varchar(200);

  
-- rename column
ALTER TABLE student
	RENAME COLUMN midterm TO midterm_score;

Additional information can be found in the DELETE documentation.

Casting

We can use either of the two expression to cast a value from one type to another

CAST (columnname AS newType)
val::newType

  
SELECT netid, 
	CAST (midterm AS smallint) AS smol_mid
	FROM student;

  
-- assume that midterm is integer
-- cast to numeric
SELECT * FROM student ORDER BY ROUND(midterm::numeric, 2);

Formatting (ROUND)

ROUND rounds a numeric value to a specified number of decimal places

  
-- assuming midterm is now a integer
-- cast to numeric
-- so that we can round to two places
SELECT netid, ROUND(CAST(midterm AS numeric),2) FROM student;

Removing tables / Databases (DROP)

Use the DROP command to remove databases or table

DROP TABLE table_name;
DROP DATABASE database_name;

Usage notes:

Ensure you connect to another database if you intend to drop the currently connected database.
Use IF EXISTS to prevent errors if the table being dropped does not exist.
- DROP TABLE IF EXISTS table_name;

Data Engineering, PostgreSQL