Recursion with PostgreSQL Part 4 - Finding the Right Path

floki

Sep 18, 2023

The mystery has finally been resolved, and I have learned something today.

In my previous article, I managed to get some really crazy performances in tree processing by using PostgreSQL recursive procedural-style function.

While recursive CTE was getting some severe performance degradation - even on a smaller scale: It usually would eat all of my available space.

Thanks to @fatalmind (Markus Winand) comment, I got better clarification on what path and cycle really mean.

Example

So we have this test table:

create table big_tree_test (id text, parent_id text);

create table big_tree_test (id text, parent_id text);

   id    | parent_id | 
---------+-----------+
 node_1  | node_3    |
 node_1  | node_8    |
 node_4  | node_8    |
 node_5  | node_7    |
 node_6  | node_4    |
 node_6  | node_6    |
 node_6  | node_9    |
 node_7  | node_10   |
 node_8  | node_10   |
 node_8  | node_5    |
 node_8  | node_9    |
 node_9  | node_9    |
 node_10 | node_1    |

Now, If we run recursive CTE for the node_1 and if we include this statement: cycle id set is_cycle using path, we will get this:

with recursive _recursive_cte as (
    select
        id,
        parent_id,
        1 as level
    from
        big_tree_test
    where 
        id = 'node_1'
    
    union all

    select
        rec.parent_id as id,
        t.parent_id,
        rec.level + 1 as level
    from
        _recursive_cte rec
        inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r

with recursive _recursive_cte as (
    select
        id,
        parent_id,
        1 as level
    from
        big_tree_test
    where 
        id = 'node_1'
    
    union all

    select
        rec.parent_id as id,
        t.parent_id,
        rec.level + 1 as level
    from
        _recursive_cte rec
        inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r

   id    | parent_id | level | is_cycle |                      path
---------+-----------+-------+----------+-------------------------------------------------
| node_1 |    node_3 |     1 |    false | {(node_1)}
| node_1 |    node_8 |     1 |    false | {(node_1)}
| node_8 |    node_9 |     2 |    false | {(node_1),(node_8)}
| node_8 |    node_5 |     2 |    false | {(node_1),(node_8)}
| node_8 |   node_10 |     2 |    false | {(node_1),(node_8)}
| node_9 |    node_9 |     3 |    false | {(node_1),(node_8),(node_9)}
| node_5 |    node_7 |     3 |    false | {(node_1),(node_8),(node_5)}
|node_10 |    node_1 |     3 |    false | {(node_1),(node_8),(node_10)}
| node_9 |    node_9 |     4 |     true | {(node_1),(node_8),(node_9),(node_9)}
| node_7 |   node_10 |     4 |    false | {(node_1),(node_8),(node_5),(node_7)}
| node_1 |    node_8 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 |    node_3 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
|node_10 |    node_1 |     5 |    false | {(node_1),(node_8),(node_5),(node_7),(node_10)}
| node_1 |    node_8 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
| node_1 |    node_3 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}

Now, If we take a look at this random record on level 3 for example:

   id    | parent_id | level | is_cycle |                      path
---------+-----------+-------+----------+-------------------------------------------------
| node_5 |    node_7 |     3 |    false | {(node_1),(node_8),(node_5)}

To be able to get to that node connection | node_5 | node_7 | on level 3, and by starting from the node_1 - we would have to go through following path:

| node_1 | node_8 | - level 1: we start with node_1 and next node is node_8 (parent of node_1).
| node_8 | node_5 | - level 2: next is node_8, because that was parent of node_1
| node_5 | node_7 | - level 3: node_5 is the parent of node_8 at previous level 2, so now, we're at node_5 at that is level 3.

Following this path, we have calculated the entire path from node_1 to node connection | node_5 | node_7 | to be this following path: node_1, node_8, node_5.

And that is the correct path calculation.

A cycle path means simply that if you follow that path - you will end at the already visited node at that path, which means you will run in circles (endless loop).

And indeed, if we check out the cycle paths in this example:

   id    | parent_id | level | is_cycle |                      path
---------+-----------+-------+----------+-------------------------------------------------
| node_9 |    node_9 |     4 |     true | {(node_1),(node_8),(node_9),(node_9)}
| node_1 |    node_8 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 |    node_3 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 |    node_8 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
| node_1 |    node_3 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}

We will notice some repeating nodes:

{(node_1),(node_8),(node_9),(node_9)} - node_9 is repeating.
{(node_1),(node_8),(node_10),(node_1)} - node_1 is repeating.
{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)} - node_1 is repeating.
{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)} - node_1 is repeating.

New Recursive Function

Given all that new understanding of what path and cycle actually means - I created a new version of my recursive function. This time to calculate and return the correct path and is cycle check:

select drop_function('public', 'select_nodes_recursive_function');

create or replace function select_nodes_recursive_function(
    _id text[],
    _level int = 1
)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path text[]
)
language plpgsql
as 
$$
declare 
    _parents text[];
begin    
    if not exists(
        select 1 from information_schema.tables t
        where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
    ) then
        create temp table _result (
            id text,
            parent_id text,
            level int,
            is_cycle boolean,
            path text[]
        ) on commit drop;
    end if;

    with _insert_cte as (
        insert into _result
        select
            t.id,
            t.parent_id,
            _level,
            coalesce(l.is_cycle, false) as is_cycle,
            array_append(l.path, t.id) as path
        from
            big_tree_test t
            left outer join lateral (
                select p.path, t.id = any(p.path) as is_cycle
                from _result p
                where p.level = _level - 1 and p.parent_id = t.id
            ) l on true
        where 
            t.id = any(_id)
            and not coalesce(l.is_cycle, false)
        returning 
            _result.parent_id
    ) 
    select array_agg(p.parent_id)
    into _parents
    from _insert_cte p;
            
    if _parents is not null then
        perform select_nodes_recursive_function(_parents, _level + 1);
    end if;
    
    return query 
    select 
        t.id,
        t.parent_id,
        t.level,
        t.is_cycle,
        t.path
    from 
        _result t;
end;
$$;

-- helper function overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path text[]
)
language sql
as 'select id, parent_id, level, is_cycle, path from select_nodes_recursive_function(array[_id]::text[])';

select drop_function('public', 'select_nodes_recursive_function');

create or replace function select_nodes_recursive_function(
    _id text[],
    _level int = 1
)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path text[]
)
language plpgsql
as 
$$
declare 
    _parents text[];
begin    
    if not exists(
        select 1 from information_schema.tables t
        where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
    ) then
        create temp table _result (
            id text,
            parent_id text,
            level int,
            is_cycle boolean,
            path text[]
        ) on commit drop;
    end if;

    with _insert_cte as (
        insert into _result
        select
            t.id,
            t.parent_id,
            _level,
            coalesce(l.is_cycle, false) as is_cycle,
            array_append(l.path, t.id) as path
        from
            big_tree_test t
            left outer join lateral (
                select p.path, t.id = any(p.path) as is_cycle
                from _result p
                where p.level = _level - 1 and p.parent_id = t.id
            ) l on true
        where 
            t.id = any(_id)
            and not coalesce(l.is_cycle, false)
        returning 
            _result.parent_id
    ) 
    select array_agg(p.parent_id)
    into _parents
    from _insert_cte p;
            
    if _parents is not null then
        perform select_nodes_recursive_function(_parents, _level + 1);
    end if;
    
    return query 
    select 
        t.id,
        t.parent_id,
        t.level,
        t.is_cycle,
        t.path
    from 
        _result t;
end;
$$;

-- helper function overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path text[]
)
language sql
as 'select id, parent_id, level, is_cycle, path from select_nodes_recursive_function(array[_id]::text[])';

This is a crucial piece of logic that allowed for correct path calculation:

select
    t.id,
    t.parent_id,
    _level,
    coalesce(l.is_cycle, false) as is_cycle,
    array_append(l.path, t.id) as path
from
    big_tree_test t
    left outer join lateral (
        select p.path, t.id = any(p.path) as is_cycle
        from _result p
        where p.level = _level - 1 and p.parent_id = t.id
    ) l on true
where 
    t.id = any(_id)
    and not coalesce(l.is_cycle, false)

select
    t.id,
    t.parent_id,
    _level,
    coalesce(l.is_cycle, false) as is_cycle,
    array_append(l.path, t.id) as path
from
    big_tree_test t
    left outer join lateral (
        select p.path, t.id = any(p.path) as is_cycle
        from _result p
        where p.level = _level - 1 and p.parent_id = t.id
    ) l on true
where 
    t.id = any(_id)
    and not coalesce(l.is_cycle, false)

For every new node connection level, we can select the previous level p.level = _level - 1 - for the same parent node - p.parent_id = t.id. That is what this left outer join lateral does. It is a lateral join because we need to include main table big_tree_test t into the subselect join query, and it is left lateral because the subquery may not find any results, and we want all from big_tree_test t
And now we can add the current ID to this path - array_append(l.path, t.id) as path to calculate the right path.
That path is a cycle path if the current ID already exists in the previous path - t.id = any(p.path) as is_cycle.
And now, we can filter out cycles if we want with the expression and not coalesce(l.is_cycle, false)

I created many different tests with many different datasets. And they all return the identical results to a recursive CTE version:

create or replace function select_nodes_recursive_cte(
    _id text
)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path record[]
)
language sql
as 
$$
with recursive _recursive_cte as (
    select
        id,
        parent_id,
        1 as level
    from
        big_tree_test
    where 
        id = _id
    
    union all

    select
        rec.parent_id as id,
        t.parent_id,
        rec.level + 1 as level
    from
        _recursive_cte rec
        inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
where not is_cycle;
$$;

create or replace function select_nodes_recursive_cte(
    _id text
)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path record[]
)
language sql
as 
$$
with recursive _recursive_cte as (
    select
        id,
        parent_id,
        1 as level
    from
        big_tree_test
    where 
        id = _id
    
    union all

    select
        rec.parent_id as id,
        t.parent_id,
        rec.level + 1 as level
    from
        _recursive_cte rec
        inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
where not is_cycle;
$$;

Performances

Finally, performances:

round	records ¹	unique nodes ²	total rows ³	`select_nodes_recursive_function`	`select_nodes_recursive_cte`
1	20	10	28	00:00:00.055	00:00:00.040
2	20	10	28	00:00:00.076	00:00:00.043
3	20	10	28	00:00:00.056	00:00:00.040
1	49	20	3674	00:00:00.106	00:00:00.086
2	49	20	3674	00:00:00.078	00:00:00.082
3	49	20	3674	00:00:00.071	00:00:00.074
1	101	40	1583446	00:00:49.447	00:00:31.218
2	101	40	1583446	00:00:46.344	00:00:31.118
3	101	40	1583446	00:00:44.252	00:00:31.099

1 - records: This is the number of total records in big_tree_test table. That is the number of tree connections.
2 - unique nodes: This is the number of unique id values (nodes) in big_tree_test table. That is the number of the unique tree nodes.
3 - total rows: This is the total number of records returned by those two functions. That is the number of unique paths between nodes.

Note: any larger dataset from these would yield a too big dataset, and it would end up in the same error for both of these functions: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp226.8386": No space left on device

Now, based on these results, we can conclude two things:

PostgreSQL Recursive CTE built-in functionality is indeed faster and better optimized.
Both functions are returning an insane number of records. That is because the further two nodes are distant from each other on the graph - there are more possible paths between the two nodes. If we take two nodes from opposite sides of the graphs, there will be a vast number of possible paths between those two nodes. Both of these functions return all possible paths between all existing nodes. That is why the number of function results is progressively growing as the number of nodes (records in the test table) grows.

Conslusion

Now, we see that PostgreSQL functionality for the same algorithm is indeed better optimized, as it should be.

However, they both return an insane number of records, that is, the total number of possible paths between all possible nodes. And that is, of course, very hardware intensive.

And it very well may be unnecessary.

I started this series of articles with a clear requirement - Let's write a function that will return all table names that are related to the table from a parameter on any possible level. In that context, path calculation and even cycle aren't really necessary.

And that small distinct requirement could mean milliseconds of processing versus hours of expensive processing.

A simple example, let's modify this recursive function select_nodes_recursive_function so that:

It doesn't return all possible paths.
It returns all possible nodes (connections) with only one path, Any path, just not all paths.
We will keep our "is cycle" check, just in case we don't end up in an endless loop.

In that case, we would add limit 1 to the lateral subquery that calculates our path:

select
    t.id,
    t.parent_id,
    _level,
    coalesce(l.is_cycle, false) as is_cycle,
    array_append(l.path, t.id) as path
from
    big_tree_test t
    left outer join lateral (
        select p.path, t.id = any(p.path) as is_cycle
        from _result p
        where p.level = _level - 1 and p.parent_id = t.id
        limit 1
    ) l on true
where 
    t.id = any(_id)
    and not coalesce(l.is_cycle, false)

select
    t.id,
    t.parent_id,
    _level,
    coalesce(l.is_cycle, false) as is_cycle,
    array_append(l.path, t.id) as path
from
    big_tree_test t
    left outer join lateral (
        select p.path, t.id = any(p.path) as is_cycle
        from _result p
        where p.level = _level - 1 and p.parent_id = t.id
        limit 1
    ) l on true
where 
    t.id = any(_id)
    and not coalesce(l.is_cycle, false)

And that is enough to return only one example path, not all. Let's see performances now and a bit bigger dataset:

round	records	unique nodes	total rows	`select_nodes_recursive_function`	`select_nodes_recursive_cte`
1	305	100	9130	00:00:01.815	`ERROR`
2	305	100	9130	00:00:01.815	`ERROR`
3	305	100	9130	00:00:01.815	`ERROR`

So this small change with limit 1 gave us:

Unique nodes per recursion level with one example path.
Many times better performances than recursive CTE, which tried to yield a too-huge dataset (probably hundreds of millions) and ended up with ERROR (ran out of space).

However, even that is not what is required in these functional requirements: Let's write a function that will return all table names that are related to the table from a parameter on any possible level.

For this, we simply need a unique node on any level. We basically don't even care about the level at all. So we may return a unique node on the first level we see that node.

And for that, the algorithm from the previous article is enough. I will paste it here now:

create or replace function select_nodes_recursive_function(
    _id text[],
    _level int
)
returns table (
    id text,
    parent_id text,
    level int
)
language plpgsql
as 
$$
declare 
    _row record;
    _parents text[];
begin    
    if not exists(
        select 1 from information_schema.tables t
        where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
    ) then
        create temp table _result (
            id text,
            parent_id text,
            level int
        ) on commit drop;
    end if;

    _id = array(select unnest(_id) except select distinct t.id from _result t);

    if _id = '{}' then
        return;
    end if;
    
    with _insert_cte as (
        insert into _result
        select
            t.id,
            t.parent_id,
            _level
        from
            big_tree_test t
        where 
            t.id = any(_id)
        returning 
            _result.parent_id
    ) 
    select array_agg(distinct p.parent_id)
    into _parents
    from _insert_cte p; 

    perform select_nodes_recursive_function(_parents, _level + 1);

    return query 
    select 
        t.id,
        t.parent_id,
        t.level
    from 
        _result t;
end;
$$;

-- helper overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
    id text,
    parent_id text,
    level int
)
language sql
as 'select id, parent_id, level from select_nodes_recursive_function(array[_id]::text[], 1)';

create or replace function select_nodes_recursive_function(
    _id text[],
    _level int
)
returns table (
    id text,
    parent_id text,
    level int
)
language plpgsql
as 
$$
declare 
    _row record;
    _parents text[];
begin    
    if not exists(
        select 1 from information_schema.tables t
        where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
    ) then
        create temp table _result (
            id text,
            parent_id text,
            level int
        ) on commit drop;
    end if;

    _id = array(select unnest(_id) except select distinct t.id from _result t);

    if _id = '{}' then
        return;
    end if;
    
    with _insert_cte as (
        insert into _result
        select
            t.id,
            t.parent_id,
            _level
        from
            big_tree_test t
        where 
            t.id = any(_id)
        returning 
            _result.parent_id
    ) 
    select array_agg(distinct p.parent_id)
    into _parents
    from _insert_cte p; 

    perform select_nodes_recursive_function(_parents, _level + 1);

    return query 
    select 
        t.id,
        t.parent_id,
        t.level
    from 
        _result t;
end;
$$;

-- helper overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
    id text,
    parent_id text,
    level int
)
language sql
as 'select id, parent_id, level from select_nodes_recursive_function(array[_id]::text[], 1)';

To repeat: this function returns unique nodes at any level of recursion and exits when all nodes have been visited. And that's all that we need.

Now, let's see some performances:

round	records	unique nodes	total rows	`select_nodes_recursive_function`	`select_nodes_recursive_cte`
1	305	100	9130	00:00:00.055	`ERROR`
2	305	100	9130	00:00:00.048	`ERROR`
3	305	100	9130	00:00:00.049	`ERROR`

Of course, for this dataset, recursive CTE with cycle/path check breaks down.

But the thing is, we don't really need that. We can fulfill what is required by using simple recursion with plpgsql, and we will be able to process the data thousands of times faster.

So, in the final conclusion, I still believe that the classical recursion approach using the plpgsql procedural approach can yield much better results. It is flexible enough to help us avoid unnecessary processing and to return exactly what we need from our tree structures. Calculation of all possible paths can lead to serious performance degradations, and it would be wise to avoid that if we can. This is one possibility how we can do that.

Part 3 - Cycle Detection

To receive notifications about new posts and updates, consider subscribing to my LinkdIn page:
vb-software linkedin

You will receive notifications about new posts on your LinkedIn feed.

Comments