Saturday, August 6, 2011

How I learned to stop worrying and love de Bruijn indices

When I started learning how to formalise languages in Coq, one of my initial stumbling blocks was choosing an approach to name binding. There are at least three fancy sounding contenders: Locally Nameless, Nominal Reasoning and Parametric Higher Order Abstract Syntax. Of course, you could also use good old de Bruijn indices, but proponents of fancy approaches warn about the large volume of lemmas regarding lifting and substitution you'll need to prove before you can start the "real work".

Back in April I scratched around for about two weeks reading up on all the different approaches, trying to identify the "best" one (whatever that means). Then I read the following comment by Xavier Leroy on the POPLmark website. He had used the Locally Nameless approach, but had stated: "If the purpose is just to solve the challenge and be done with it, I would rather go for pure de Bruijn". That was enough to convince me to stop worrying and just get into it. I've spent about 250 hours now doing my own proofs, and I can report that using de Bruijn isn't really that bad. Yes, you need to prove some lemmas about lifting and substitution, but my entire response to naysayers is summed up henceforth: "The only computational operation that lambda calculus has is substitution, so don't pretend you can get away without proving something about it." Corollary: HOAS doesn't count.

The Lemmas

You need to know what the lemmas are though, and I advise cribbing them from someone else's proof. I learned them from Arthur Charguéraud's proof, but I don't know where they are from originally. You're more than welcome to copy them from mine.

Once you know what the lemmas are, and how they work in STLC, extending them to new languages is straightforward. For reference, the following are the definitions of the lifting and substitution operators for STLC. Note that if you just want to prove Progress and Preservation for STLC, you don't actually need the lemmas. They're needed in bigger languages such as SystemF2 to show that type substitution is commutative. Nevertheless, for illustrative purposes we'll just stick with STLC expressions.

Inductive exp : Type :=
 | XVar  : nat -> exp
 | XLam  : ty  -> exp -> exp
 | XApp  : exp -> exp -> exp.

 liftX  (d:  nat) (* current binding depth in expression *)
        (xx: exp) (* expression containing references to lift *)
        : exp
 := match xx with 
    |  XVar ix    
    => if le_gt_dec d ix
        (* Index is referencing the env, so lift it across the new elem *)
        then XVar (S ix)

        (* Index is locally bound in the expression itself, and 
           not the environment, so we don't need to change it. *)
        else xx

    (* Increase the current depth as we move across a lambda. *)
    |  XLam t1 x1
    => XLam t1 (liftX (S d) x1)

    |  XApp x1 x2
    => XApp   (liftX d x1) (liftX d x2)

 substX (d:  nat) (* current binding depth in expression *)
        (u:  exp) (* new expression to substitute *)
        (xx: exp) (* expression to substitute into *)
        : exp 
 := match xx with
    | XVar ix 
    => match nat_compare ix d with
       (* Index matches the one we are substituting for. *)
       | Eq  => u
       (* Index was free in the original expression.
          As we've removed the outermost binder, also decrease this
          index by one. *)
       | Gt  => XVar (ix - 1)

       (* Index was bound in the original expression. *)
       | Lt  => XVar ix

    (* Increase the depth as we move across a lambda. *)
    |  XLam t1 x2
    => XLam t1 (substX (S d) (liftX 0 u) x2)

    |  XApp x1 x2 
    => XApp (substX d u x1) (substX d u x2)

Drawing the deBruijn environment

Here is a nice lemma to get started.
Lemma lift_lift
 :  forall n m t
 ,  lift n              (lift (n + m) t) 
 =  lift (1 + (n + m)) (lift n t).
This is one of the lemmas you need for proving commutativity of substitution. Although it looks complicated at first glance, it will make significantly more sense when you know how to draw it.

First, consider the standard typing judgement form:
Env |- Exp :: Type
Assume there are de Bruijn indices in the expression. Some of these indices point to lambdas in the expression itself, corresponding to locally bound variables, while some point to lambdas in the environment, corresponding to free variables. Here is an example:

Suppose I apply (lift 2) to the above expression. Although it would be helpful for educational purposes to do this directly using the definition of 'lift' above (try it!), I'll tell you the shortcut instead. Firstly, if we assume that the definition of 'lift' is correct, none of the bound indices will be changed during the application. Ignoring bound indices, the only ones remaining are free indices. These are the indices that point into the environment.

Now, although applying 'lift' to an expression may increment these free indices, instead of thinking about indices being incremented, it's easier to think about environment elements being shifted. I'll explain what I mean by that in a moment, but for now let's just do it...

Here is the same expression as before, with the environment positions numbered in blue:

Shifting the elements after position two yields the following:

In the above diagram I've used '_' to represent a hole in the environment. This is a place where I could insert a new element, and be guaranteed that none of the indices in the expression point to this new element. This in turn means that if I apply (lift 2) to the expression, then insert a new element into the hole, then the resulting judgement is guaranteed to give me the same type as before (t5 in this case).

Of course, applying 'lift' to the expression doesn't actually change the environment. In this example I've moved 't1' and 't2' to the left for illustrative purposes, but this is an operation separate from applying 'lift' to the expression itself. However, applying 'lift' does changes the indices, and these indices are pointers into the environment. If we like, we could instead think about moving the pointers (the arrow heads) instead of the actual environment elements. This is an important change in perspective. Instead of thinking about 'lift' as incrementing indices in the expression, we want to think about 'lift' as shifting environment pointers to the left.

Let's look at our lemma again:
Lemma lift_lift
 :  forall n m t
 ,  lift m              (lift (n + m) t) 
 =  lift (1 + (n + m)) (lift n t).

As 't' is quantified, we need to prove this for all possible terms 't'. Using our change in perspective, this also means that we want to prove the lemma for all possible sets of environment pointers. Given the environment pointers for 't', the expression '(lift (n + m) t)' transforms them into a new set (by adding a hole at position n + m). Likewise the expression 'lift (1 + (n + m)) (lift n t)' transforms the environment pointers of 't' by adding a hole a position 'n', and then adding another hole at position '1 + (n + m)'. Here is a picture of the full situation. I've just drawn the contiguous set of environment pointers as rectangles, with the holes are marked in grey.

In the first line we have set of environment pointers for 't'. Applying 'lift n' adds a hole a position 'n'. Alternatively, applying 'lift (n + m)' adds a hole at position '(n + m)'. Given 'lift n t', if we add another hole at (1 + n + m) we get the last set shown. Likewise, if we take the set 'lift (n + m) t' and add a hole at position 'n' then we also get the last line.

That's it. That complicated looking lemma above can be drawn with a few rectangles, and provided the definition of 'lift' does what it's supposed to, the lemma is obviously true. For a language like SystemF2 you need about 5 lemmas concerning lifting and substitution. They're all easy to draw, and the proofs are straightforward once you understand the general approach. In the next post I'll show how to draw substitution, which works in a similar way.


  1. Small typo in the second occurrence of `Lemma lift_lift` (after the sentence "Let's look at our lemma again") - the LHS of the equality should be `lift n (lift (n + m) t)` instead of `lift m (lift (n + m) t)`.

    Other than that, great work!