PostgreSQL Arrays 🐘

floki

May 4, 2024

Many-To-Many

Traditionally in relational databases, many-to-many data design is implemented using three tables.

Classic examples are users and roles: One user can be in many roles and one role can be on many users. For example, most developers are familiar with something like this (or, a variation of this approach):

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique
);

create table roles (
    role_id int not null primary key generated always as identity,
    name text not null unique
);

create table users_roles (
    user_id int not null references users,
    role_id int not null references roles,
    primary key (user_id, role_id)
);

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique
);

create table roles (
    role_id int not null primary key generated always as identity,
    name text not null unique
);

create table users_roles (
    user_id int not null references users,
    role_id int not null references roles,
    primary key (user_id, role_id)
);

PostgreSQL Way

PostgreSQL is a different and unique database because it offers a vast range of data types, that other databases lack.

One of those types is the array type. In the example above, we are completely certain that our system will never have more than a few roles. In that case, we can use the array type safely.

Not only that PostgreSQL have array types - but it has also implemented mathematical operators over arrays. And, not only that, you can also ensure data integrity with the CHECK constraints.

So, the classic example from above can be vastly simplified in one, single table:

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique,
    roles text[] not null,
    constraint roles_check check (roles <@ array['admin', 'user', 'guest'])
);

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique,
    roles text[] not null,
    constraint roles_check check (roles <@ array['admin', 'user', 'guest'])
);

The constraint roles_check check (roles <@ array['admin', 'user', 'guest']) uses <@ array operator to ensure that every insert and update can only have array values that exist on the right side array ['admin', 'user', 'guest'].

From the official documentation:

anyarray <@ anyarray → boolean
Is the first array contained by the second?
ARRAY[2,2,7] <@ ARRAY[1,7,4,2,6] → t

Meaning, each element in the first array has to exist in the second.

If we try to do this:

insert into users (name, roles) 
values ('user1', array['admin', 'user']);

insert into users (name, roles) 
values ('user1', array['admin', 'user']);

It will insert the new record successfully. The same goes for the update.

However, if we try this:

insert into users (name, roles) 
values ('user2', array['super', 'admin']);

insert into users (name, roles) 
values ('user2', array['super', 'admin']);

It will end up with the following error: ERROR: new row for relation "users" violates check constraint "roles_check". The same goes for the update.

Our data integrity is now protected.

Changing The Rules

Requirements change, and naturally, we would like to change our list of valid roles. We can do that with the simple script:

begin;

alter table users drop constraint roles_check;
alter table users add constraint roles_check check (roles <@ array['super', 'admin', 'user']); 

end;

begin;

alter table users drop constraint roles_check;
alter table users add constraint roles_check check (roles <@ array['super', 'admin', 'user']); 

end;

This script will recreate the roles_check constraint to allow only these roles: ['super', 'admin', 'user'].

However, there are two problems with this approach:

We already have records that have the guest role. That role is not allowed by the new definition. The script will fail with the following error message: ERROR: check constraint "roles_check" of relation "users" is violated by some row.
If the table users is too big, this script may take a long time to run. PostgreSQL needs to check every record for integrity (which may be too long for big tables), and while doing so, all reads and writes are blocked.

To address this issue, a new constraint can be created using the not valid directive:

begin;

alter table users drop constraint roles_check;
alter table users add constraint roles_check check (roles <@ array['super', 'admin', 'user']) not valid; 

end;

begin;

alter table users drop constraint roles_check;
alter table users add constraint roles_check check (roles <@ array['super', 'admin', 'user']) not valid; 

end;

From the documentation:

Normally, this form will cause a scan of the table to verify that all existing rows in the table satisfy the new constraint. But if the NOT VALID option is used, this potentially-lengthy scan is skipped.

In practice this means the following:

Records with invalid role guest will remain intact. That small deviation from the data integrity can be safely ignored (if business rules permit).
All new inserts and updates will enforce new data integrity rules - roles must exist in a new array: ['super', 'admin', 'user'].
The script that updates the new check constraint will run fast, even on big tables and it will not lock or block anyone or anything.

Advanced Scenarios

The check constraint is neat, but it gets pretty messy if we have some complicated business logic.

In that case, we can use the PostgreSQL functions for validation. We can write a function that receives an entire record for validation and returns true or false.

create function roles_check_func(
    _record record
)
returns boolean
language plpgsql
as 
$$
begin
    -- Some complicated logic here, branching, loops, and all, this is just an example.
    return _record.roles <@ array['admin', 'user', 'guest'];
end;
$$;

create function roles_check_func(
    _record record
)
returns boolean
language plpgsql
as 
$$
begin
    -- Some complicated logic here, branching, loops, and all, this is just an example.
    return _record.roles <@ array['admin', 'user', 'guest'];
end;
$$;

Note that only plpgsql functions can use record types.

Now, we can use this function in the check constraint:

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique,
    roles text[] not null,
    constraint roles_check check (roles_check_func(users))
);

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique,
    roles text[] not null,
    constraint roles_check check (roles_check_func(users))
);

Once we have that function and check constraint in place, we can safely replace the function definition without recreating the constraint:

create or replace function roles_check_func(
    _record record
)
returns boolean
language plpgsql
as 
$$
begin
    -- Some complicated logic here, branching, loops, and all, this is just an example.
    return _record.roles <@ array['xxx', 'yyy', 'zzz'];
end;
$$;

create or replace function roles_check_func(
    _record record
)
returns boolean
language plpgsql
as 
$$
begin
    -- Some complicated logic here, branching, loops, and all, this is just an example.
    return _record.roles <@ array['xxx', 'yyy', 'zzz'];
end;
$$;

This will act like using the not valid option when recreating the constraint, meaning:

No long table scans.
No table locking.

While, at the same time offers a powerful programming model to implement complex business logic for our data.

Other Usages

Arrays are not just useful in these scenarios when we have just a few immutable value combinations that the field can have. They are also extremely useful in data projections with select queries, for example as aggregates.

Consider the denormalized example from the start of this article (which is very common):

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique
);

create table roles (
    role_id int not null primary key generated always as identity,
    name text not null unique
);

create table users_roles (
    user_id int not null references users,
    role_id int not null references roles,
    primary key (user_id, role_id)
);

create table users (
    user_id int not null primary key generated always as identity,
    name text not null unique
);

create table roles (
    role_id int not null primary key generated always as identity,
    name text not null unique
);

create table users_roles (
    user_id int not null references users,
    role_id int not null references roles,
    primary key (user_id, role_id)
);

Now, we would like to retrieve a user with id 1, and all roles for that user. We can do something like this:

select 
    u.user_id, u.name, r.name as role
from 
    users u 
    join users_roles ur using(user_id)
    join roles r using(role_id)
where 
    u.user_id = 1

select 
    u.user_id, u.name, r.name as role
from 
    users u 
    join users_roles ur using(user_id)
    join roles r using(role_id)
where 
    u.user_id = 1

The result would be:

`user_id`	`name`	`role`
1	Bob	admin
1	Bob	reader
1	Bob	writer
1	Bob	user
1	Bob	guest

Obviously, we now have a redundancy in our projection and the client has to compensate for that fact.

Either that or, issue two separate queries, which is still a very suboptimal solution (much more code and a higher latency).

On PostgreSQL, we can group by user_id and name and then aggregate roles into array projection:

select 
    u.user_id, u.name, array_agg(r.name) as roles
from 
    users u 
    join users_roles ur using(user_id)
    join roles r using(role_id)
where 
    u.user_id = 1
group by
    u.user_id, u.name

select 
    u.user_id, u.name, array_agg(r.name) as roles
from 
    users u 
    join users_roles ur using(user_id)
    join roles r using(role_id)
where 
    u.user_id = 1
group by
    u.user_id, u.name

See more on array aggregates in the PostgreSQL aggregates documentation reference.

The result of this query is, basically, the same as the previous one but only in a normalized structure without redundancy where roles are aggregated into the array:

`user_id`	`name`	`roles`
1	Bob	`["admin", "reader", "writer", "user", "guest"]`

This aggregated array is usually represented as a standard memory structure on the client such as an array or a list. In C# that would be a string array and in Python that would be a string list.

So that means that clients will try to fit them into memory and we have to have that fact in mind and not return too much data in array types.

Conclusion

Arrays are an extremely powerful data type in PostgreSQL, but they do come with a price.

Doing operations with the array type will mean that PostgreSQL will try to fit them in memory (os select for example), As such they are not suitable for bigger structures.

But for smaller datasets, such as in this example - they can be a powerful ally that can simplify the data model and programming model as well.

Whatsomore, data contained in arrays is not normalized. In case we would need to do a data update we would still suffer from update anomaly. For example, changing the name from admin to administrator would be a complicated and costly operation, at least compared to a fully normalized structure like the example at the beginning of this article.

Still, there are cases when doing data denormalization by using arrays can incredibly simplify design, and programming models and boost performance. The rule of thumb is when the dataset is fairly small like an example with a few roles in this example and there dataset is immutable. This is precisely this role example, but the may be others in real life. PostgreSQL offers the tools and data types to do exactly that - simplify design, simplify programming models and boost performance.