In the previous chapter we learned how to use SQL for creating tables and populating them with data. Now it's time for us to learn how to query that data, and manipulate it.
The most fundamental SQL query looks like this:
SELECT <columns>
FROM <table>;
The FROM
clause tells SQL which table you're interested in, and the SELECT
clause tells SQL which columns of that table you want to see. For example, consider a table Person(name, age, num_dogs)
containing the data below:
name|age|num_dogs
----+---+--------
Ace |20 |4
Ada |18 |3
Ben |7 |2
Cho |27 |3
If we executed this SQL query ...
SELECT name, num_dogs
FROM Person;
... then we would get any of the following outputs, which are all equivalent:
name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs name|num_dogs
----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+-------- ----+--------
Ace |4 Ace |4 Ace |4 Ace |4 Ace |4 Ace |4 Ada |3 Ada |3 Ada |3 Ada |3 Ada |3 Ada |3 Ben |2 Ben |2 Ben |2 Ben |2 Ben |2 Ben |2 Cho |3 Cho |3 Cho |3 Cho |3 Cho |3 Cho |3
Ada |3 Ada |3 Ben |2 Ben |2 Cho |3 Cho |3 Ace |4 Ace |4 Ben |2 Ben |2 Cho |3 Cho |3 Ace |4 Ace |4 Ada |3 Ada |3 Cho |3 Cho |3 Ace |4 Ace |4 Ada |3 Ada |3 Ben |2 Ben |2
Ben |2 Cho |3 Ada |3 Cho |3 Ada |3 Ben |2 Ben |2 Cho |3 Ace |4 Cho |3 Ace |4 Ben |2 Ada |3 Cho |3 Ace |4 Cho |3 Ace |4 Ada |3 Ada |3 Ben |2 Ace |4 Ben |2 Ace |4 Ada |3
Cho |3 Ben |2 Cho |3 Ada |3 Ben |2 Ada |3 Cho |3 Ben |2 Cho |3 Ace |4 Ben |2 Ace |4 Cho |3 Ada |3 Cho |3 Ace |4 Ada |3 Ace |4 Ben |2 Ada |3 Ben |2 Ace |4 Ada |3 Ace |4
These tables are all permutations of one another, and you may get a different permutation depending on which version of SQL you're using. Typically in this book I'll only show one possible permutation when I talk about the output of a SQL query, but you shouldn't make any assumptions about which permutation you'll get — that is, unless you use the ORDER BY
clause, which I'll discuss later in this chapter.
Frequently we are interested in only a subset of the data available to us. That is, even though we might have data about many people or things, we often only want to see the data that we have about very specific people or things. This is where the WHERE
clause comes in handy; it lets us specify which specific rows of our table we're interested in. Here's the syntax:
SELECT <columns>
FROM <table>
WHERE <predicate>;
Once again, let's consider our table Person(name, age, num_dogs)
. Suppose we want to see how many dogs each person owns — same as before — but this time we only care about the dog-owners who are adults. Let's walk through this SQL query, one step at a time:
SELECT name, num_dogs
FROM Person
WHERE age >= 18;
If you want to filter on more complicated predicates, you can use the boolean operators NOT
, AND
, and OR
. For instance, if we only cared about dog-owners who are not only adults, but also own more than 3 dogs, then we would write the following query:
SELECT name, num_dogs
FROM Person
WHERE age >= 18
AND num_dogs > 3;
As in Python, this is the order of evaluation for boolean operators:
NOT
AND
OR
That said, it is good practice to avoid ambiguity by adding parentheses even when they are not strictly necessary.
NULL
valuesBear in mind that some values in your database may be NULL
whether you like it or not, so it's good to know how SQL handles them. Pretty much it boils down to the following:
NULL
, you'll just get NULL
. For instance if x
is NULL
, then x > 3
, 1 = x
, and x + 4
all evaluate to NULL
. Even x = NULL
would evaluate to NULL
; if you want to check whether x
is NULL
, then write x IS NULL
or x IS NOT NULL
instead.WHERE NULL
is just like WHERE FALSE
. The row in question does not get included.NULL
short-circuits with boolean operators. That means a boolean expression involving NULL
will evaluate to:TRUE
, if it'd evaluate to TRUE
regardless of whether the unknown value is really TRUE
or FALSE
.FALSE
, if it'd evaluate to FALSE
regardless of whether the unknown value is really TRUE
or FALSE
.NULL
, if it depends on the unknown value.Let's walk through this query as an example:
SELECT name, num_dogs
FROM Person
WHERE age <= 20
OR num_dogs = 3;
When you're working with a very large database, it is useful to be able to summarize your data so that you can better understand the general trends at play. Let's see how.
With SQL you are able to summarize entire columns of data using built-in aggregate functions. The most common ones are SUM
, AVG
, MAX
, MIN
, and COUNT
— but there are more, and you can even make your own using the CREATE AGGREGATE
command. I don't want to get sidetracked, though, so here are the important takeaways:
NULL
values except for COUNT(*)
. (So COUNT(<column>)
returns the number of non-NULL
values in the specified column, whereas COUNT(*)
returns the number of rows in the table overall.)For example consider this variant of our table People(name, age, num_dogs)
from earlier, where we are now unsure how many dogs Ben owns:
name|age|num_dogs
----+---+--------
Ace |20 |4
Ada |18 |3
Ben |7 |NULL
Cho |27 |3
With this table in mind ...
SUM(age)
is 72.0
, and SUM(num_dogs)
is 10.0
.AVG(age)
is 18.0
, and AVG(num_dogs)
is 3.3333333333333333
.MAX(age)
is 27
, and MAX(num_dogs)
is 4
.MIN(age)
is 7
, and MIN(num_dogs)
is 3
.COUNT(age)
is 4
, COUNT(num_dogs)
is 3
, and COUNT(*)
is 4
.So, if we desired the range of ages represented in our database, then we could use the query below and it would produce the result 20. (Technically it would produce a one-by-one table containing the number 20, but SQL treats it the same as the number 20 itself.)
SELECT MAX(age) - MIN(age)
FROM Person;
Or, if we desired the average number of dogs owned by adults, then we could write this:
SELECT AVG(num_dogs)
FROM Person
WHERE age >= 18;
Now you know how to summarize an entire column of your database into a single number. More often than not, though, we want a little finer accuracy than that. This is possible with the GROUP BY
clause, which allows us to split our data into groups and then summarize each group separately. Here's the syntax:
SELECT <columns>
FROM <table>
WHERE <predicate> -- Filter out rows (before grouping).
GROUP BY <columns>
HAVING <predicate>; -- Filter out groups (after grouping).
Notice we also have a brand new HAVING
clause, which is actually very similar to WHERE
. The difference?
WHERE
occurs before grouping. It filters out uninteresting rows.HAVING
occurs after grouping. It filters out uninteresting groups.To explore all these new mechanics let's see another step-by-step example. This time our query will find the average number of dogs owned, for each adult age represented in our database. We will exclude any age for which we only have one datum.
SELECT age, AVG(num_dogs)
FROM Person
WHERE age >= 18
GROUP BY age
HAVING COUNT(*) > 1;
So, to recap, here's how you should go about a query that follows the template above:
FROM
clause.WHERE
clause.GROUP BY
clause.HAVING
clause.SELECT
clause.In passing, note that you can also group data on multiple columns at the same time. Here's an example:
SELECT name, age, COUNT(*)
FROM Person
GROUP BY name, age;
So that's how grouping and aggregation work, but before we move on I must emphasize one last thing regarding illegal queries. We'll start by considering these two examples:
Though it's not immediately obvious, this query actually produces an error:
SELECT age, AVG(num_dogs)
FROM Person;
What's the issue? age
is an entire column of numbers, whereas AVG(num_dogs)
is just a single number. This is problematic because a properly formed table must have the same amount of rows in each column.
This one is bad too, for a very similar reason:
SELECT age, num_dogs
FROM Person
GROUP BY age;
After grouping by age
we obtain a table like this:
name|age|num_dogs
----+---+--------
Ace |20 |4
Ema |20 |2
Ian |20 |3
----+---+--------
Ada |18 |3
Jay |18 |5
----+---+--------
Cho |27 |3
Rex |27 |1
----+---+--------
Mae |33 |8
Then the SELECT
clause's job is to collapse each group into a single row. Each such row must contain two things:
age
corresponding to the group, which is a single number.num_dogs
for the group, which is an entire column of numbers.So once again, we have this issue of trying to make a table with mismatching dimensions.
The takeaway from all this? If you're going to do any grouping / aggregation at all, then you must only SELECT
grouped / aggregated columns. Make sure you understand this rule before you keep reading. It's such a common point of confusion that I'd even recommend writing it down, getting a tattoo of it, or likewise. At the very least, re-read this section later to make sure it still makes sense to you.
This chapter is a work in progress. I'll try to add more details soon.