Sailr Solutions

App development and consulting

Storing hierarchical data in MySQL

I recently encountered an issue with a client where we needed to query a large amount of hierarchica

I recently encountered an issue with a client where we needed to query a large amount of hierarchical data in a MySQL database and it became apparent that the way we we're storing this data was causing problems.

The management and storage of hierarchical data is not really what a relational database is intended for.  Relational database tables are not hierarchical but are simply a flat list of data whereas hierarchical data has a parent-child relationship and this is not something that is natural to represent in a relational table.  For the purposes of this post, hierarchical data is a collection of data where each item has a single parent and zero or more children.  The only exception to this is the root item which has no parent)

There have developed many strategies to enable the storage of hierarchical data in relational databases.  Each of these has its own benefits and drawbacks.  This post will look at the different options available for storing the hierarchy depicted in the following diagram.  This represents a typical sub set of our overall group hierarchy.

One common approach to storing this data in SQL is using what is known as an adjacency list.  Put simply, each node in an adjacency list has a reference to its adjacent nodes.  In our groups data this is a reference to the groups parent.  Adjacency lists are a popular choice for storing simple hierarchies as they are easy to understand.  However, working with the adjacency list model in SQL can be very difficult.  It’s expensive to calculate all of the children for a given node and extra care must be taking when deleting as it’s possible to orphan entire sub trees.

This is how our example hierarchy would be represented with the Adjacency List Model:

id

parent id

name

1

0

ROOT

2

1

Groups

3

2

Sales

4

3

North

5

3

East

6

3

South

7

2

Support

8

7

Tier 2

9

7

Tier 3

 

It is possible to retrieve a full path from an adjacency model providing you know the level that you need to traverse to, the following code shows how you can retrieve a path for our example:

SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4

FROM group AS t1

LEFT JOIN group AS t2 ON t2.parent = t1.category_id

LEFT JOIN group AS t3 ON t3.parent = t2.category_id

LEFT JOIN group AS t4 ON t4.parent = t3.category_id

WHERE t1.name = 'ROOT' AND t4.name = 'North';

 

 

The problem with this approach is that you need one self-join for every level in the hierarchy, and performance will naturally degrade with each level added as the Joins grow in complexity.  For database’s that support SQL-99 recursive syntax this isn’t such an issue but recursive functions are not supported by MySQL.

Path Enumeration Model

The Path Enumeration Model stores a full path value against each node.

id

parent_id

name

full_path

1

0

ROOT

0-1

2

1

Groups

0-1-2

3

2

Sales

0-1-2-3

4

3

North

0-1-2-3-4

5

3

East

0-1-2-3-5

6

3

South

0-1-2-3-6

7

2

Support

0-1-2-7

8

7

Tier 2

0-1-2-7-8

9

7

Tier 3

0-1-2-7-9

 

Using this approach, it is possible to get the descendants of a given node like this:

SELECT * FROM group where full_path LIKE ‘0-1-2-3’

The Closure Table Model

An alternative to the Adjacency Model is the Closure Table Model.  This approach uses a closure table to maintain a collection of all of the links (vertices) between all of the nodes.

The closure table is a many to many table and stores every path from every node to each of its descendants, nodes even connect to themselves.  This can look a little complicated at first:

In this example we would require an additional table defined as follows: 

ancestor

descendant

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

3

2

7

2

4

2

5

2

6

2

8

2

9

3

4

3

5

3

6

7

8

7

9

 

You need to maintain your closure table and this can be done with insert and update triggers.

It should be noted at this point that the closure table requires O(n2) rows.

Using this extra table, we can retrieve entire hierarchies with a single query like this:

select group_concat(g.name order by g.id separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join group g on (g.id = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

This does provide a nice solution as it supports the use of referential integrity and is very easy to query.  Marinating the closures table can also be handled by using triggers on the group table.

 

With the Nested Set Model, we stop thinking about our hierarchy as nodes and lines and start to think in terms of nested collections of data.

Our example hierarchy can now be viewed like this:

  

 

Nested sets make use of left and right values to represent the hierarchy.  Essentially the id of the node to the left of the current node and the node to the right.  These are indicated on the following diagram:

Transpose these into a table and you now have this:

group_id

name

lft

rgt

1

ROOT

1

18

2

Groups

2

17

3

Sales

3

10

4

North

4

5

5

East

6

7

6

South

8

9

7

Support

11

16

8

Tier 2

12

13

9

Tier 3

14

15

With the nested set model, we can retrieve a single path for a group without having multiple self-joins so the cost of executing the query is the same regardless of the depth of the hierarchy.

 

SELECT parent.name

FROM group AS node,
Group AS parent

WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.name = ‘Sales’

ORDER BY node.lft;

 

The Nested Set Model allows for a hierarchies of unlimited depth and offers exceptional performance for querying subtrees.  There is increased complexity in maintaining the hierarchy and inserts and deletes and moves are more complex than they are with the Adjacency List Model.  Another issue with this model is that we would need a mechanism of converting our existing Adjacency List into a nested set which is complex on a database that doesn’t support recursion.

Choosing the Right Design

While each of the approaches outlined above have their own unique advantages I have tried to summarise them in the following table:

Model

No. tables

Query Child

Query Subtree

Delete Group

Insert Group

Move Subtree

Referential Integrity

Adjacency List

1

Easy

Hard

Easy

Easy

Easy

Yes

Path Enumeration

1

Hard

Easy

Easy

Easy

Easy

No

Closure Table

2

Easy

Easy

Easy

Easy

Easy

Yes

Nested Set

1

Hard

Easy

Hard

Hard

Hard

No

 

In the end we opted for the Closure table approach as for us at least it seemed the best solution.  Hopefully I will be able to put something together soon on how we implemented this in code and how we used a non-binary tree to navigate the group relationships.

I'm sure there are other techniques that can be used as well but with any luck this will be of some benefit to someone.  

Add comment

Loading