I spent a couple of days last week undoing one particular anti-pattern which was causing bogus error messages like: "Type 'Int' does not match type 'Int'". In DDC the root cause these messages was invariably this data type:
data Bound n = UIx Int -- An anonymous, deBruijn variable. | UName n -- A named variable or constructor. | UPrim n (Type n) -- A primitive value (or type) with its type (or kind).A value of type Bound n represents the bound occurrence of a variable or constructor, where n is the underlying type used for names. In practice n is often Text or String. The data type has three constructors, UIx for occurrences of anonymous variables, UName for named variables and constructors, and UPrim for names of primitives. We use Bound type for both terms and types.
The intent was that when type checking an expression, to determine the type (or kind) of a Bound thing in UIx or UName form, we would look it up in the type environment. However, as the types (and kinds) of primitives are fixed by the language definition, we would have their types attached directly to the UPrim constructor and save ourselves the cost of environment lookup. For example, we would represent the user defined type constructor 'List' as (UName "List"), but the primitive type constructor 'Int' as (UPrim "Int" kStar), where 'kStar' refers to the kind of data types.
The pain begins the first time you accidentally represent a primitive type constructor in the wrong form. Suppose you're parsing type constructor names from a source file, and happen to represent Int as (UName "Int") instead of (UPrim "Int" kData). Both versions are pretty printed as just "Int", so dumping the parsed AST does not reveal the problem. However, internally in the compiler the types of primitive operators like add and mul are all specified using the (UPrim "Int" kData) form, and you can't pass a value of type (UName "Int") to a function expecting a (UPrim "Int" kData). The the uninformative error message produced by the compiler simply "Type 'Int' does not match type 'Int'", disaster.
The first time this happens it takes an hour to find the problem, but when found you think "oh well, that was a trivial mistake, I can just fix this instance". You move on to other things, but next week it happens again, and you spend another hour -- then a month later it happens again and it takes two hours to find. In isolation each bug is fixable, but after a couple of years this reoccurring problem becomes a noticeable drain on your productivity. When new people join the project they invariably hit the same problem, and get discouraged because the error message on the command line doesn't give any hints about what might be wrong, or how to fix it.
A better way to handle names is to parameterise the data types that represent your abstract syntax tree with separate types for each different sort of name: for the bound and binding occurrences of variables, for bound and binding occurrences of constructors, and for primitives. If the implementation is in Haskell we can use type families to produce the type of each name based on a common index type, like so:
type family BindVar l type family BoundVar l type family BindCon l type family BoundCon l type family Prim l data Exp l = XVar (BoundVar l) | XCon (BoundCon l) | XPrim (Prim l) | XLam (BindVar l) (Exp l)DDC now uses this approach for the representation of the source AST. To represent all names by a single flat text value we define a tag type to represent this variation, then give instances for each of the type families:
data Flat = Flat type instance BindVar Flat = Text type instance BoundVar Flat = Text type instance BindCon Flat = Text type instance BoundCon Flat = Text type instance Prim Flat = Text type ExpFlat = Exp FlatOn the other hand, if we want a form that allows deBruijn indices for variables, and uses separate types for constructors and primitives we can use:
data Separate = Separate data Bind = BAnon | BName Text data Bound = UIx Int | UName Text data ConName = ConName Text data PrimName = PrimName Text type instance BindVar Separate = Bind type instance BoundVar Separate = Bound type instance BindCon Separate = ConName type instance BoundCon Separate = ConName type instance Prim Separate = PrimName type ExpSeparate = Exp SeparateIt's also useful to convert between the above two representations. We might use ExpSeparate internally during program transformation, but use ExpFlat as an intermediate representation when pretty printing. To interface with legacy code we can also instantiate BoundVar with our old Bound type, so the new generic representation is strictly better than the old non-generic one using a hard-wired Bound.
Compiler engineering is full of traps of representation. Decisions taken about how to represent the core data structures permeate the entire project, and once made are very time consuming to change. Good approaches are also difficult to learn. Suppose we inspect the implementation of another compiler and the developers have set up their core data structures in some particular way. Is it set up like that because it's a good way to do so?, or is it set up like that because it's a bad way of doing so, but now it's too difficult to change? For the particular case of variable binding, using type like Bound above is a bad way of doing it. Using the generic representation is strictly better. Let this be a warning to future generations.
No comments:
Post a Comment