Saturday, August 4, 2018

Binary interface files with Shimmer

Over the last couple of months I've merged three big changes that continue to move DDC in the direction of a real compiler rather than a research experiment:
  1. Support for binary interface files via the shimmer project.
  2. Reworked the type checker and compiler driver to only require interface files for directly imported modules, rather than all transitively imported modules.
  3. Rewrote the LLVM code printer to write directly to file, instead of going via a pretty printer.
Shimmer: Shimmer is a scheme-like language intended as a shim between compiler stages and intermediate languages. It provides a s-expression syntax and untyped functional intermediate language in which higher level languages can be encoded. It also has a native binary format, as well as its own REPL for testing. Shimmer is available as a hackage package.

DDC interface files are now serialised as binary shimmer files, using the grammar described on the Discus homepage. These files can be dumped in human readable (sort-of) form using the shimmer CLI that comes with the hackage package. For example, once you've build the DDC base libraries you can do:

$ shimmer -load src/s2/base/Data/List.di
+m-name = %module-name "Data" "List";

+m-deps = %l (%module-name "Class" "Applicative") ...

...

@d-Applicative 
   = %d-alg (%module-name "Class" "Applicative") "Applicative" 
      (%l (%bn "f" (%ta %tkf %tkd %tkd))) 
      (%s (%l (%ctor #nat'0 (%module-name "Class" "Applicative") "Applicative" 
              (%tu "Functor" "f") 
              (%tl (%bn "a" %tkd) (%tf "a" (%ta "f" "a"))) 
              (%tl (%bn "a" %tkd) (%tl (%bn "b" %tkd) (%tf (%ta "f" (%tf "a" "b")) (%ta "f" "a") (%ta "f" "b")))) 
                                  (%tu "Applicative" "f"))));
...

@t-mapMaybe 
 = %tl (%bn "a" %tkd) (%tl (%bn "b" %tkd) 
         (%tf (%tf "a" (%tu "Maybe" "b")) (%tu "List" "a") (%tu "List" "b")));

@x-mapMaybe 
 = %xb (%mtn "a" %tkd) (%mtn "b" %tkd) 
       (%mxn "f"  (%tf "a" (%tu "Maybe" "b"))) 
       (%mxn "xx" (%tu "List" "a")) 
       (%xc "xx"  (%ab (%dcn %n %n (%dc "Nil"))  (%xa (%dcn %n %n (%dc "Nil")) (%rt "b"))) 
                  (%ab (%dcn %n %n (%dc "Cons")) (%bn "x" "a") (%bn "xs" (%tu "List" "a")) 
                  (%xc (%xa "f" "x") (%ab (%dcn %n %n (%dc "Nothing")) (%xa "mapMaybe" (%rt "a") (%rt "b") "f" "xs")) 
                                     (%ab (%dcn %n %n (%dc "Just")) (%bn "x'" "b") 
                                          (%xa (%dcn %n %n (%dc "Cons")) (%rt "b") "x'" 
                                          (%xa "mapMaybe" (%rt "a") (%rt "b") "f" "xs"))))));

This is a simple encoding of the AST of the DDC Core intermediate language. For example @d-Applicative gives the data type declaration for an Applicative type class dictionary. @t-mapMaybe encodes the type of the mapMaybe function, and @x-mapMaybe encodes the declaration. %tl means "forall", %xb is a lambda binder, %xc a case expression, %xa an application, and so on. See the grammar for the rest.

The binary format is a not-completely-horrible packed representation, where I've made an effort to avoid redundant encoding, but haven't done any compression or indexing yet. The main feature is that serialisation and deserialisation are implemented by simply peeking and poking from memory, so performance should be reasonable. The binary grammar is defined in the hackage package.

Q. Why the name Shimmer?
A. Planner, Conniver, Schemer, Shimmer.

Did you know that Scheme used to be called "Schemer", but that Guy Steele's computer at the time only supported 6 character file names? Shimmer is a Scheme-like language, but without the Scheme.

No more transitive imports: Previously, when type checking a module, DDC needed to load the textual interface files for all transitively reachable modules, which in most cases was the entire base library. This approach works fine for a toy language and a minimal base library, but implies that the more useful the base library becomes, the slower compilation gets. In the current version, when some module A re-exports declarations for module B, those declarations are also added to module A's interface file. In general this means that when compiling a given module we only need to load interface files for directly imported modules, not all modules that are transitively reachable. When doing this change I also added in-memory indexes for the various sorts of declarations (data types, type synonyms, value declarations etc) to speed up name resolution. This framework should also support the addition of on-disk database style indexes to shimmer files. We'll need this when the base library grows large enough that deserialisation of interface files becomes a problem again.

Direct write of LLVM code: Haskell programmers love their pretty printer libraries. I had been using wl-pprint to pretty print LLVM code before calling the LLVM compiler to convert them to binary object files. For small output files intended for human consumption, wl-pprint is great. However, if one just wants to dump some LLVM code to feed to the LLVM compiler, then the time to construct and then consume the intermediate Doc type, as well as the overhead of Haskell Strings becomes an unreasonable overhead. Today I rewrote the LLVM generator to just write directly to a file handle -- an "ugly-printer", if you will. Doing that yielded a ~10% reduction in overall compilation time.

These three changes reduced compile time for the ddc-demo tree to about 35% of what it was previously, or about a 2.5x speedup. There are more low hanging fruits, but I've got enough juice for now, so am planning to return to higher level concerns in the immediate future. The DDC reflection system is also starting to work, but that will be the topic of a subsequent blog post.