Sorting and searching: April 2015

Note: This blog has now moved to https://sortingsearching.com/. This is an old version.

The task of balancing binary search trees has a reputation of being very hard to implement. Data structure books seem to list dozens of different cases, especially for deleting elements.

Let's change this - it's not that hard! The analysis will be a little involved, but the code is going to be simple.

Here are some design choices:

We'll implement AVL trees. I initially started writing this in terms of 2-3 trees. Then I decided AVL was going to be even simpler.
All elements are stored in the leaves. This is unusual, people usually store values in internal nodes. But it makes things so much simpler!
Internal nodes store a reference to the smallest (left-most) element of the sub-tree. This will be useful to guide us down the tree.
Internal nodes also store the sub-tree height. AVL trees are usually described with nodes only storing the height difference of their children, but storing the height instead makes things easier still.
We'll implement this in Haskell. We like algebraic data types for trees!

Example AVL tree.

OK, let's get down to it!

Preliminaries

Start by defining binary trees.

data Tree t = Empty | Leaf t | Node Int t (Tree t) (Tree t)

A tree of elements of type t is either empty, it is a leaf, or it is an internal node. A leaf contains a value of type t. An internal node contains: the height, the smallest element in the sub-tree, and the two children. Write some helper functions:

height :: Tree t -> Int
height Empty = error "height Empty"
height (Leaf _) = 0
height (Node h _ _ _) = h

smallest :: Tree t -> t
smallest Empty = error "smallest Empty"
smallest (Leaf x) = x
smallest (Node _ s _ _) = s

left, right :: Tree t -> Tree t
left  (Node _ _ a _) = a
left _ = error "only internal nodes have children"
right (Node _ _ _ b) = b
right _ = error "only internal nodes have children"

toList :: Tree t -> [t]
toList Empty = []
toList (Leaf x) = [x]
toList (Node _ _ a b) = toList a ++ toList b

The toList function is not optimal: it works in Θ(n log n) time (why?). Θ(n) is possible. I'll leave that as a coding exercise.

Time for some AVL-specific stuff. How do we build and balance trees? As a reminder, this is how AVL trees work:

All elements in the left sub-tree should be smaller (or equal) to all the elements in the right sub-tree.
The difference in height of the two sub-trees should be at most 1. We call this "similar height".

Create another little helper function: build a tree from two sub-trees of similar height:

node :: Tree t -> Tree t -> Tree t
node a b
  | abs (height a - height b) <= 1
    = Node (max (height a) (height b) + 1) (smallest a) a b
  | otherwise = error "unbalanced tree"

It's finally time to start manipulating our trees:

Merging trees

Wait, what? Merging trees as the first operation? What about insert?

We'll get to insert later. As you'll see, merge is going to be our basic operation. Everything else will be defined in terms of it!

What do we mean by merge? Suppose we have two trees, A and B, such that all elements of A are smaller or equal to all elements of B. The merged tree will contain all elements from both.

The "node" operation works when the two trees have similar heights. What if their heights differ by more than 1? Assume that A is the shorter tree. Some helpful notation that we will use:

A + B means A merged with B
A.L, A.R are the left and right sub-trees
A[h] is a tree of height h
A[h₁..h₂] is a tree whose height is between h₁ and h₂ (inclusive)
{A,B} is a tree whose children are A and B. A = {A.L, A.R}

So, again: we're trying to merge a tree A[0 .. h-2] with another tree B[h]. Our strategy is going to be as follows: merge A with B.L, then merge the resulting tree with B.R. Does this work?

Recursively merging a small tree with a larger tree.

A[0 .. h-2] + B[h] = (A[0 .. h-2] + B.L[h-2 .. h-1]) + B.R[h-2 .. h-1] = [h-2 .. h] + B.R[h-2 .. h-1]

How do we know that the tree resulting from the first merge has height in the range [h-2 .. h]? It's because merging two trees of heights h₁ and h₂ will always result in a tree of height either max(h₁, h₂) or max(h₁, h₂) + 1. This is certainly true for similar-height trees. Go ahead and verify that it will remain true in other cases as well.

OK, so what exactly happened when we tried to merge two trees with non-similar heights? After the recursive merge, we reduced the height difference to at most 2. Now what? Let's see if the same process works for a height difference of exactly 2:

A[h-2] + B[h] = (A[h-2] + B.L[h-2 .. h-1]) + B.R[h-2 .. h-1]
= {A[h-2], B.L[h-2 .. h-1]} + B.R[h-2 .. h-1] = [h-1 .. h] + B.R[h-2 .. h-1]

The trees A and B.L now have similar heights, so their merge is easy. If B.R has height h-1, the final merge is also easy.

What if it has height h-2? Then B[h] = {B.L[h-1], B.R[h-2]} and:

Special case: bad!

A[h-2] + {B.L[h-1], B.R[h-2]} = (A[h-2] + B.L[h-1]) + B.R[h-2] = {[h-2], [h-1]} + [h-2]

Uh oh! Can you see what happened?

We tried to merge [h-2] with {[h-1], [h-2]}, and reduced the problem to merging {[h-2], [h-1]} with [h-2]. That's the mirror image of the original problem! No improvement. If we keep doing this, we'll get into an infinite loop!

We need to use a different strategy in this special case, to break the cycle. Fortunately, it's easy: A, B.L.L, B.L.R, B.R all have similar heights! Just pair them up.

Special case: good.

A[h-2] + {B.L[h-1], B.R[h-2]} = A[h-2] + B.L.L[h-3 .. h-2] + B.L.R[h-3 .. h-2] + B.R[h-2] = {{A, B.L.L}, {B.L.R, B.R}}

Here is the code:

merge :: Tree t -> Tree t -> Tree t
merge Empty x = x
merge x Empty = x
merge a b
  | abs (height a - height b) <= 1
    = node a b
  | height a == height b - 2 && height (right b) == height b - 2
    -- the special case: [h-2] + {[h-1],[h-2]}
    = let (bl,br) = (left b, right b)
          (bll,blr) = (left bl, right bl)
      in node (node a bll) (node blr br)
  | height a < height b
    = merge (merge a (left b)) (right b)
  | otherwise
    = merge (left a) (merge (right a) b)

That's it! Pretty short, huh? That's all we need.

How long does a merge take? We always reduce the difference in heights in the first recursive call, and finish by merging two trees of heights differing by at most 2, which takes constant time. Therefore, merging two trees of heights h₁ and h₂ takes O(|h₁ - h₂| + 1) time.

Splitting trees

Now we want to do the reverse of merging: split a tree into two smaller trees. OK but split where? We need some way of telling which elements belong to the left part and which elements belong to the right part.

This is accomplished by a functional argument: a function that returns "false" for the left elements and "true" for the right elements. The requirement is that the function is monotonic: we can't have x < y and x belonging to the right while y belongs to the left. Other than that, any function will do. Often, this will be something like: put x <= 5 on the left and x > 5 on the right.

Is splitting going to be more complicated than merging? No! It's going to be simpler. In fact, we will use merging to do splitting!

Here is where having the reference to the smallest element of a sub-tree comes in handy:

If the smallest element of the right sub-tree belongs to the left, the whole left sub-tree belongs to the left, and we only need to split the right sub-tree.
If the smallest element of the right sub-tree belongs to the right, the whole right sub-tree belongs to the right, and we only need to split the left sub-tree.

split :: Tree t -> (t->Bool) -> (Tree t, Tree t)
split Empty _ = (Empty, Empty)
split (Leaf x) isBig
  | isBig x   = (Empty, Leaf x)
  | otherwise = (Leaf x, Empty)

split (Node _ _ a b) isBig
  | isBig (smallest b)
    = let (a1,a2) = split a isBig
      in (a1, merge a2 b)
  | otherwise
    = let (b1,b2) = split b isBig
      in (merge a b1, b2)

Done.

How fast is it? Since each merge can take Θ(log n) time, are we running the risk of having split run in Θ(log² n) time?

Fortunately, not.

Let T(h, h₁, h₂) be the maximum run-time of split of a tree of height h that will return two trees of heights h₁, h₂.

Suppose that we are in the first case of the code above: we are splitting the left sub-tree a, and then merging a2 with b. If the height of a2 is h₃, then the run-time for the recursive split and the following merge is:
T(h-1 [or h-2], h₁, h₃) + O(h₂ - h₃ + 1)
Similarly in the second case.

Notice how the merge time is "paid for" by a corresponding decrease in one of the h's. In other words, when we have to do a long merge because of a small tree returned from a recursive call, it means we had less work to do in that recursive call.

By induction: T(h, h₁, h₂) = O(h + h₁ + h₂) = O(log n).

Insert, delete, etc

Now that we have split and merge, we can do anything we want, easily!

contains :: Ord t => Tree t -> t -> Bool
contains a x =
  case split a (>=x) of
    (_, Empty) -> False
    (_, b) -> smallest b == x

insert :: Ord t => Tree t -> t -> Tree t
insert a x =
  let (a1, a2) = split a (>=x)
  in merge a1 (merge (Leaf x) a2)

delete :: Ord t => Tree t -> t -> Tree t
delete a x =
  let (b, _) = split a (>=x)
      (_, c) = split a (>x)
  in merge b c

fromList :: Ord t => [t] -> Tree t
fromList = foldl insert Empty

Let's see if it works

$ ghci Tree.hs
GHCi, version 7.4.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
[1 of 1] Compiling Tree             ( Tree.hs, interpreted )
Ok, modules loaded: Tree.
*Tree> let a = fromList [10, 5, 7, 18, 3]
*Tree> toList a
[3,5,7,10,18]
*Tree> toList (insert a 4)
[3,4,5,7,10,18]
*Tree> toList (delete a 10)
[3,5,7,18]
*Tree> contains a 12
False
*Tree> contains a 5
True
*Tree> let (b,c) = split a (\x -> x*x > 50)
*Tree> toList b
[3,5,7]
*Tree> toList c
[10,18]

Yay!

Sorting and searching

Tuesday, April 21, 2015

Balanced binary search trees: the easy way

Preliminaries

Merging trees

Splitting trees

Insert, delete, etc

Let's see if it works