Tries in Haskell

2018-12-27
haskelltree

A trie is a nice tree data structure for storing items we want to frequently look up. It can be used to build a simple autocomplete or spell checking system. It stores sequences with the same prefixes (beginning sequence) together to reduce space and can performs a look up that takes O(m) where m is the length of the string we search.

Here is a simple trie dictionary to give you an idea of the structure (heat, heater, help, helper, hot, hotter, hottest). The . in the trie below shows where the words end. While we can see he in the structure, it does not have a . so it is not part of the dictionary, a lookup of he returns False and of hot returns True.

           h
      /    | 
     e     o
/    |     |
a    l     t.
|    |     |
t.   p.    t
|    |     |
e    e     e
|    |     | \
r.   r.    r. s
              |
              t.

Type declaration

A trie is a recursive structure. At each level it has a Bool representing a potential sequence boundary and stores a Map of Tries. The sequence boundaries have the Bool set to True and the leaf nodes have True and an empty map.

An empty trie is simple.

The insert function

To insert and update values into a trie, we will use Map.alter Ord k => (Maybe a -> Maybe a) -> k -> Map k a -> Map k a. Given a function f it will insert, update or delete a value a at key k. If the function returns Just it will insert (if k does not exist) or update it, if it returns Nothing than it will delete the value at k if it exists.

When the sequence is finished, we set the last node to True so that it knows there is a boundary when it looks up a sequence. While iterating throught the values of the sequence, it keeps the same Bool value, and recursively updates the child nodes at x.

Now we can build a simple dictionary with insert and a helper function.

The member function

We should check that we can find these words in the dictionary.

These return True:

λ> member "heat" dictionary
λ> member "hot" dictionary
λ> member "helper" dictionary

And these return False:

λ> member "he" dictionary
λ> member "hello" dictionary
λ> member "goodbye" dictionary

The delete function

This function gave me the most trouble because you need to keep track of which sequences are shared. For example, “help” and “helper” both share “help”. When we delete “helper”, we still need to keep “help”. We need to consider the following patterns when deleting:

forked

  m
  |
  e
 / \
 t. e
    |
    t.

in-line

  h
  |
  e
  |
  a
  |
  t.
  |
  e
  |
  r.

separate paths

  g  b
  |  |
  o  a
  |  |
  o  d.
  |
  d.

The main idea is to start from the top of the trie and follow the sequence all the way down, if certain conditions are met, then we can delete the sequence, if not, we continue through the subsequences checking if we can delete anything. In case we want to delete a boundary that is not at a leaf, we the list is empty, we check if there are any children, if there are we set it to False. It took a bit of trial, error and unit testing to write these functions so I will not claim that they are correct yet.

The delete function is slow because it repeatedly traces over the same path as it gets smaller until it is deleted. If you need to perform delete a lot on a trie, one idea is to keep track of what paths are deletable with and extra boolean flag on each level. You would need to update the insert function to support this.

Testing

Here are the tests I used to develop the functions.