Recursion with PostgreSQL Part 4 - Finding the Right Path
The mystery has finally been resolved, and I have learned something today.
In my previous article, I managed to get some really crazy performances in tree processing by using PostgreSQL recursive procedural-style function.
While recursive CTE was getting some severe performance degradation - even on a smaller scale: It usually would eat all of my available space.
Thanks to @fatalmind (Markus Winand) comment, I got better clarification on what path and cycle really mean.
Example
So we have this test table:
create table big_tree_test (id text, parent_id text);
create table big_tree_test (id text, parent_id text);
id | parent_id |
---------+-----------+
node_1 | node_3 |
node_1 | node_8 |
node_4 | node_8 |
node_5 | node_7 |
node_6 | node_4 |
node_6 | node_6 |
node_6 | node_9 |
node_7 | node_10 |
node_8 | node_10 |
node_8 | node_5 |
node_8 | node_9 |
node_9 | node_9 |
node_10 | node_1 |
Now, If we run recursive CTE for the node_1
and if we include this statement: cycle id set is_cycle using path
, we will get this:
with recursive _recursive_cte as (
select
id,
parent_id,
1 as level
from
big_tree_test
where
id = 'node_1'
union all
select
rec.parent_id as id,
t.parent_id,
rec.level + 1 as level
from
_recursive_cte rec
inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
with recursive _recursive_cte as (
select
id,
parent_id,
1 as level
from
big_tree_test
where
id = 'node_1'
union all
select
rec.parent_id as id,
t.parent_id,
rec.level + 1 as level
from
_recursive_cte rec
inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
id | parent_id | level | is_cycle | path
---------+-----------+-------+----------+-------------------------------------------------
| node_1 | node_3 | 1 | false | {(node_1)}
| node_1 | node_8 | 1 | false | {(node_1)}
| node_8 | node_9 | 2 | false | {(node_1),(node_8)}
| node_8 | node_5 | 2 | false | {(node_1),(node_8)}
| node_8 | node_10 | 2 | false | {(node_1),(node_8)}
| node_9 | node_9 | 3 | false | {(node_1),(node_8),(node_9)}
| node_5 | node_7 | 3 | false | {(node_1),(node_8),(node_5)}
|node_10 | node_1 | 3 | false | {(node_1),(node_8),(node_10)}
| node_9 | node_9 | 4 | true | {(node_1),(node_8),(node_9),(node_9)}
| node_7 | node_10 | 4 | false | {(node_1),(node_8),(node_5),(node_7)}
| node_1 | node_8 | 4 | true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 | node_3 | 4 | true | {(node_1),(node_8),(node_10),(node_1)}
|node_10 | node_1 | 5 | false | {(node_1),(node_8),(node_5),(node_7),(node_10)}
| node_1 | node_8 | 6 | true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
| node_1 | node_3 | 6 | true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
Now, If we take a look at this random record on level 3 for example:
id | parent_id | level | is_cycle | path
---------+-----------+-------+----------+-------------------------------------------------
| node_5 | node_7 | 3 | false | {(node_1),(node_8),(node_5)}
To be able to get to that node connection | node_5 | node_7 |
on level 3, and by starting from the node_1
- we would have to go through following path:
| node_1 | node_8 |
- level 1: we start withnode_1
and next node isnode_8
(parent ofnode_1
).| node_8 | node_5 |
- level 2: next isnode_8
, because that was parent ofnode_1
| node_5 | node_7 |
- level 3:node_5
is the parent ofnode_8
at previous level 2, so now, we're atnode_5
at that is level 3.
Following this path, we have calculated the entire path from node_1
to node connection | node_5 | node_7 |
to be this following path: node_1
, node_8
, node_5
.
And that is the correct path calculation.
A cycle path means simply that if you follow that path - you will end at the already visited node at that path, which means you will run in circles (endless loop).
And indeed, if we check out the cycle paths in this example:
id | parent_id | level | is_cycle | path
---------+-----------+-------+----------+-------------------------------------------------
| node_9 | node_9 | 4 | true | {(node_1),(node_8),(node_9),(node_9)}
| node_1 | node_8 | 4 | true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 | node_3 | 4 | true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 | node_8 | 6 | true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
| node_1 | node_3 | 6 | true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
We will notice some repeating nodes:
{(node_1),(node_8),(node_9),(node_9)}
-node_9
is repeating.{(node_1),(node_8),(node_10),(node_1)}
-node_1
is repeating.{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
-node_1
is repeating.{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
-node_1
is repeating.
New Recursive Function
Given all that new understanding of what path and cycle actually means - I created a new version of my recursive function. This time to calculate and return the correct path and is cycle check:
select drop_function('public', 'select_nodes_recursive_function');
create or replace function select_nodes_recursive_function(
_id text[],
_level int = 1
)
returns table (
id text,
parent_id text,
level int,
is_cycle boolean,
path text[]
)
language plpgsql
as
$$
declare
_parents text[];
begin
if not exists(
select 1 from information_schema.tables t
where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
) then
create temp table _result (
id text,
parent_id text,
level int,
is_cycle boolean,
path text[]
) on commit drop;
end if;
with _insert_cte as (
insert into _result
select
t.id,
t.parent_id,
_level,
coalesce(l.is_cycle, false) as is_cycle,
array_append(l.path, t.id) as path
from
big_tree_test t
left outer join lateral (
select p.path, t.id = any(p.path) as is_cycle
from _result p
where p.level = _level - 1 and p.parent_id = t.id
) l on true
where
t.id = any(_id)
and not coalesce(l.is_cycle, false)
returning
_result.parent_id
)
select array_agg(p.parent_id)
into _parents
from _insert_cte p;
if _parents is not null then
perform select_nodes_recursive_function(_parents, _level + 1);
end if;
return query
select
t.id,
t.parent_id,
t.level,
t.is_cycle,
t.path
from
_result t;
end;
$$;
-- helper function overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
id text,
parent_id text,
level int,
is_cycle boolean,
path text[]
)
language sql
as 'select id, parent_id, level, is_cycle, path from select_nodes_recursive_function(array[_id]::text[])';
select drop_function('public', 'select_nodes_recursive_function');
create or replace function select_nodes_recursive_function(
_id text[],
_level int = 1
)
returns table (
id text,
parent_id text,
level int,
is_cycle boolean,
path text[]
)
language plpgsql
as
$$
declare
_parents text[];
begin
if not exists(
select 1 from information_schema.tables t
where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
) then
create temp table _result (
id text,
parent_id text,
level int,
is_cycle boolean,
path text[]
) on commit drop;
end if;
with _insert_cte as (
insert into _result
select
t.id,
t.parent_id,
_level,
coalesce(l.is_cycle, false) as is_cycle,
array_append(l.path, t.id) as path
from
big_tree_test t
left outer join lateral (
select p.path, t.id = any(p.path) as is_cycle
from _result p
where p.level = _level - 1 and p.parent_id = t.id
) l on true
where
t.id = any(_id)
and not coalesce(l.is_cycle, false)
returning
_result.parent_id
)
select array_agg(p.parent_id)
into _parents
from _insert_cte p;
if _parents is not null then
perform select_nodes_recursive_function(_parents, _level + 1);
end if;
return query
select
t.id,
t.parent_id,
t.level,
t.is_cycle,
t.path
from
_result t;
end;
$$;
-- helper function overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
id text,
parent_id text,
level int,
is_cycle boolean,
path text[]
)
language sql
as 'select id, parent_id, level, is_cycle, path from select_nodes_recursive_function(array[_id]::text[])';
This is a crucial piece of logic that allowed for correct path calculation:
select
t.id,
t.parent_id,
_level,
coalesce(l.is_cycle, false) as is_cycle,
array_append(l.path, t.id) as path
from
big_tree_test t
left outer join lateral (
select p.path, t.id = any(p.path) as is_cycle
from _result p
where p.level = _level - 1 and p.parent_id = t.id
) l on true
where
t.id = any(_id)
and not coalesce(l.is_cycle, false)
select
t.id,
t.parent_id,
_level,
coalesce(l.is_cycle, false) as is_cycle,
array_append(l.path, t.id) as path
from
big_tree_test t
left outer join lateral (
select p.path, t.id = any(p.path) as is_cycle
from _result p
where p.level = _level - 1 and p.parent_id = t.id
) l on true
where
t.id = any(_id)
and not coalesce(l.is_cycle, false)
- For every new node connection level, we can select the previous level
p.level = _level - 1
- for the same parent node -p.parent_id = t.id
. That is what thisleft outer join lateral
does. It is alateral
join because we need to include main tablebig_tree_test t
into the subselect join query, and it isleft lateral
because the subquery may not find any results, and we want all frombig_tree_test t
- And now we can add the current ID to this path -
array_append(l.path, t.id) as path
to calculate the right path. - That path is a cycle path if the current ID already exists in the previous path -
t.id = any(p.path) as is_cycle
. - And now, we can filter out cycles if we want with the expression
and not coalesce(l.is_cycle, false)
I created many different tests with many different datasets. And they all return the identical results to a recursive CTE version:
create or replace function select_nodes_recursive_cte(
_id text
)
returns table (
id text,
parent_id text,
level int,
is_cycle boolean,
path record[]
)
language sql
as
$$
with recursive _recursive_cte as (
select
id,
parent_id,
1 as level
from
big_tree_test
where
id = _id
union all
select
rec.parent_id as id,
t.parent_id,
rec.level + 1 as level
from
_recursive_cte rec
inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
where not is_cycle;
$$;
create or replace function select_nodes_recursive_cte(
_id text
)
returns table (
id text,
parent_id text,
level int,
is_cycle boolean,
path record[]
)
language sql
as
$$
with recursive _recursive_cte as (
select
id,
parent_id,
1 as level
from
big_tree_test
where
id = _id
union all
select
rec.parent_id as id,
t.parent_id,
rec.level + 1 as level
from
_recursive_cte rec
inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
where not is_cycle;
$$;
Performances
Finally, performances:
round | records 1 | unique nodes 2 | total rows 3 | select_nodes_recursive_function | select_nodes_recursive_cte |
---|---|---|---|---|---|
1 | 20 | 10 | 28 | 00:00:00.055 | 00:00:00.040 |
2 | 20 | 10 | 28 | 00:00:00.076 | 00:00:00.043 |
3 | 20 | 10 | 28 | 00:00:00.056 | 00:00:00.040 |
1 | 49 | 20 | 3674 | 00:00:00.106 | 00:00:00.086 |
2 | 49 | 20 | 3674 | 00:00:00.078 | 00:00:00.082 |
3 | 49 | 20 | 3674 | 00:00:00.071 | 00:00:00.074 |
1 | 101 | 40 | 1583446 | 00:00:49.447 | 00:00:31.218 |
2 | 101 | 40 | 1583446 | 00:00:46.344 | 00:00:31.118 |
3 | 101 | 40 | 1583446 | 00:00:44.252 | 00:00:31.099 |
- 1 - records: This is the number of total records in
big_tree_test
table. That is the number of tree connections. - 2 - unique nodes: This is the number of unique
id
values (nodes) inbig_tree_test
table. That is the number of the unique tree nodes. - 3 - total rows: This is the total number of records returned by those two functions. That is the number of unique paths between nodes.
Note: any larger dataset from these would yield a too big dataset, and it would end up in the same error for both of these functions: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp226.8386": No space left on device
Now, based on these results, we can conclude two things:
PostgreSQL Recursive CTE built-in functionality is indeed faster and better optimized.
Both functions are returning an insane number of records. That is because the further two nodes are distant from each other on the graph - there are more possible paths between the two nodes. If we take two nodes from opposite sides of the graphs, there will be a vast number of possible paths between those two nodes. Both of these functions return all possible paths between all existing nodes. That is why the number of function results is progressively growing as the number of nodes (records in the test table) grows.
Conslusion
Now, we see that PostgreSQL functionality for the same algorithm is indeed better optimized, as it should be.
However, they both return an insane number of records, that is, the total number of possible paths between all possible nodes. And that is, of course, very hardware intensive.
And it very well may be unnecessary.
I started this series of articles with a clear requirement - Let's write a function that will return all table names that are related to the table from a parameter on any possible level. In that context, path calculation and even cycle aren't really necessary.
And that small distinct requirement could mean milliseconds of processing versus hours of expensive processing.
A simple example, let's modify this recursive function select_nodes_recursive_function
so that:
- It doesn't return all possible paths.
- It returns all possible nodes (connections) with only one path, Any path, just not all paths.
- We will keep our "is cycle" check, just in case we don't end up in an endless loop.
In that case, we would add limit 1
to the lateral subquery that calculates our path:
select
t.id,
t.parent_id,
_level,
coalesce(l.is_cycle, false) as is_cycle,
array_append(l.path, t.id) as path
from
big_tree_test t
left outer join lateral (
select p.path, t.id = any(p.path) as is_cycle
from _result p
where p.level = _level - 1 and p.parent_id = t.id
limit 1
) l on true
where
t.id = any(_id)
and not coalesce(l.is_cycle, false)
select
t.id,
t.parent_id,
_level,
coalesce(l.is_cycle, false) as is_cycle,
array_append(l.path, t.id) as path
from
big_tree_test t
left outer join lateral (
select p.path, t.id = any(p.path) as is_cycle
from _result p
where p.level = _level - 1 and p.parent_id = t.id
limit 1
) l on true
where
t.id = any(_id)
and not coalesce(l.is_cycle, false)
And that is enough to return only one example path, not all. Let's see performances now and a bit bigger dataset:
round | records | unique nodes | total rows | select_nodes_recursive_function | select_nodes_recursive_cte |
---|---|---|---|---|---|
1 | 305 | 100 | 9130 | 00:00:01.815 | ERROR |
2 | 305 | 100 | 9130 | 00:00:01.815 | ERROR |
3 | 305 | 100 | 9130 | 00:00:01.815 | ERROR |
So this small change with limit 1
gave us:
- Unique nodes per recursion level with one example path.
- Many times better performances than recursive CTE, which tried to yield a too-huge dataset (probably hundreds of millions) and ended up with ERROR (ran out of space).
However, even that is not what is required in these functional requirements: Let's write a function that will return all table names that are related to the table from a parameter on any possible level.
For this, we simply need a unique node on any level. We basically don't even care about the level at all. So we may return a unique node on the first level we see that node.
And for that, the algorithm from the previous article is enough. I will paste it here now:
create or replace function select_nodes_recursive_function(
_id text[],
_level int
)
returns table (
id text,
parent_id text,
level int
)
language plpgsql
as
$$
declare
_row record;
_parents text[];
begin
if not exists(
select 1 from information_schema.tables t
where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
) then
create temp table _result (
id text,
parent_id text,
level int
) on commit drop;
end if;
_id = array(select unnest(_id) except select distinct t.id from _result t);
if _id = '{}' then
return;
end if;
with _insert_cte as (
insert into _result
select
t.id,
t.parent_id,
_level
from
big_tree_test t
where
t.id = any(_id)
returning
_result.parent_id
)
select array_agg(distinct p.parent_id)
into _parents
from _insert_cte p;
perform select_nodes_recursive_function(_parents, _level + 1);
return query
select
t.id,
t.parent_id,
t.level
from
_result t;
end;
$$;
-- helper overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
id text,
parent_id text,
level int
)
language sql
as 'select id, parent_id, level from select_nodes_recursive_function(array[_id]::text[], 1)';
create or replace function select_nodes_recursive_function(
_id text[],
_level int
)
returns table (
id text,
parent_id text,
level int
)
language plpgsql
as
$$
declare
_row record;
_parents text[];
begin
if not exists(
select 1 from information_schema.tables t
where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
) then
create temp table _result (
id text,
parent_id text,
level int
) on commit drop;
end if;
_id = array(select unnest(_id) except select distinct t.id from _result t);
if _id = '{}' then
return;
end if;
with _insert_cte as (
insert into _result
select
t.id,
t.parent_id,
_level
from
big_tree_test t
where
t.id = any(_id)
returning
_result.parent_id
)
select array_agg(distinct p.parent_id)
into _parents
from _insert_cte p;
perform select_nodes_recursive_function(_parents, _level + 1);
return query
select
t.id,
t.parent_id,
t.level
from
_result t;
end;
$$;
-- helper overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
id text,
parent_id text,
level int
)
language sql
as 'select id, parent_id, level from select_nodes_recursive_function(array[_id]::text[], 1)';
To repeat: this function returns unique nodes at any level of recursion and exits when all nodes have been visited. And that's all that we need.
Now, let's see some performances:
round | records | unique nodes | total rows | select_nodes_recursive_function | select_nodes_recursive_cte |
---|---|---|---|---|---|
1 | 305 | 100 | 9130 | 00:00:00.055 | ERROR |
2 | 305 | 100 | 9130 | 00:00:00.048 | ERROR |
3 | 305 | 100 | 9130 | 00:00:00.049 | ERROR |
Of course, for this dataset, recursive CTE with cycle/path check breaks down.
But the thing is, we don't really need that. We can fulfill what is required by using simple recursion with plpgsql
, and we will be able to process the data thousands of times faster.
So, in the final conclusion, I still believe that the classical recursion approach using the plpgsql
procedural approach can yield much better results. It is flexible enough to help us avoid unnecessary processing and to return exactly what we need from our tree structures. Calculation of all possible paths can lead to serious performance degradations, and it would be wise to avoid that if we can. This is one possibility how we can do that.
vb-software linkedin
You will receive notifications about new posts on your LinkedIn feed.