Saturday, August 4, 2018

Binary interface files with Shimmer

Over the last couple of months I've merged three big changes that continue to move DDC in the direction of a real compiler rather than a research experiment:
  1. Support for binary interface files via the shimmer project.
  2. Reworked the type checker and compiler driver to only require interface files for directly imported modules, rather than all transitively imported modules.
  3. Rewrote the LLVM code printer to write directly to file, instead of going via a pretty printer.
Shimmer: Shimmer is a scheme-like language intended as a shim between compiler stages and intermediate languages. It provides a s-expression syntax and untyped functional intermediate language in which higher level languages can be encoded. It also has a native binary format, as well as its own REPL for testing. Shimmer is available as a hackage package.

DDC interface files are now serialised as binary shimmer files, using the grammar described on the Discus homepage. These files can be dumped in human readable (sort-of) form using the shimmer CLI that comes with the hackage package. For example, once you've build the DDC base libraries you can do:

$ shimmer -load src/s2/base/Data/List.di
+m-name = %module-name "Data" "List";

+m-deps = %l (%module-name "Class" "Applicative") ...

...

@d-Applicative 
   = %d-alg (%module-name "Class" "Applicative") "Applicative" 
      (%l (%bn "f" (%ta %tkf %tkd %tkd))) 
      (%s (%l (%ctor #nat'0 (%module-name "Class" "Applicative") "Applicative" 
              (%tu "Functor" "f") 
              (%tl (%bn "a" %tkd) (%tf "a" (%ta "f" "a"))) 
              (%tl (%bn "a" %tkd) (%tl (%bn "b" %tkd) (%tf (%ta "f" (%tf "a" "b")) (%ta "f" "a") (%ta "f" "b")))) 
                                  (%tu "Applicative" "f"))));
...

@t-mapMaybe 
 = %tl (%bn "a" %tkd) (%tl (%bn "b" %tkd) 
         (%tf (%tf "a" (%tu "Maybe" "b")) (%tu "List" "a") (%tu "List" "b")));

@x-mapMaybe 
 = %xb (%mtn "a" %tkd) (%mtn "b" %tkd) 
       (%mxn "f"  (%tf "a" (%tu "Maybe" "b"))) 
       (%mxn "xx" (%tu "List" "a")) 
       (%xc "xx"  (%ab (%dcn %n %n (%dc "Nil"))  (%xa (%dcn %n %n (%dc "Nil")) (%rt "b"))) 
                  (%ab (%dcn %n %n (%dc "Cons")) (%bn "x" "a") (%bn "xs" (%tu "List" "a")) 
                  (%xc (%xa "f" "x") (%ab (%dcn %n %n (%dc "Nothing")) (%xa "mapMaybe" (%rt "a") (%rt "b") "f" "xs")) 
                                     (%ab (%dcn %n %n (%dc "Just")) (%bn "x'" "b") 
                                          (%xa (%dcn %n %n (%dc "Cons")) (%rt "b") "x'" 
                                          (%xa "mapMaybe" (%rt "a") (%rt "b") "f" "xs"))))));

This is a simple encoding of the AST of the DDC Core intermediate language. For example @d-Applicative gives the data type declaration for an Applicative type class dictionary. @t-mapMaybe encodes the type of the mapMaybe function, and @x-mapMaybe encodes the declaration. %tl means "forall", %xb is a lambda binder, %xc a case expression, %xa an application, and so on. See the grammar for the rest.

The binary format is a not-completely-horrible packed representation, where I've made an effort to avoid redundant encoding, but haven't done any compression or indexing yet. The main feature is that serialisation and deserialisation are implemented by simply peeking and poking from memory, so performance should be reasonable. The binary grammar is defined in the hackage package.

Q. Why the name Shimmer?
A. Planner, Conniver, Schemer, Shimmer.

Did you know that Scheme used to be called "Schemer", but that Guy Steele's computer at the time only supported 6 character file names? Shimmer is a Scheme-like language, but without the Scheme.

No more transitive imports: Previously, when type checking a module, DDC needed to load the textual interface files for all transitively reachable modules, which in most cases was the entire base library. This approach works fine for a toy language and a minimal base library, but implies that the more useful the base library becomes, the slower compilation gets. In the current version, when some module A re-exports declarations for module B, those declarations are also added to module A's interface file. In general this means that when compiling a given module we only need to load interface files for directly imported modules, not all modules that are transitively reachable. When doing this change I also added in-memory indexes for the various sorts of declarations (data types, type synonyms, value declarations etc) to speed up name resolution. This framework should also support the addition of on-disk database style indexes to shimmer files. We'll need this when the base library grows large enough that deserialisation of interface files becomes a problem again.

Direct write of LLVM code: Haskell programmers love their pretty printer libraries. I had been using wl-pprint to pretty print LLVM code before calling the LLVM compiler to convert them to binary object files. For small output files intended for human consumption, wl-pprint is great. However, if one just wants to dump some LLVM code to feed to the LLVM compiler, then the time to construct and then consume the intermediate Doc type, as well as the overhead of Haskell Strings becomes an unreasonable overhead. Today I rewrote the LLVM generator to just write directly to a file handle -- an "ugly-printer", if you will. Doing that yielded a ~10% reduction in overall compilation time.

These three changes reduced compile time for the ddc-demo tree to about 35% of what it was previously, or about a 2.5x speedup. There are more low hanging fruits, but I've got enough juice for now, so am planning to return to higher level concerns in the immediate future. The DDC reflection system is also starting to work, but that will be the topic of a subsequent blog post.

Sunday, April 8, 2018

DDC renamed to the Disco Discus Compiler

A few weeks ago we renamed DDC to the Disco Discus Compiler (was the Disciplined Disciple Compiler). Discus is the name of the language. Disco is another word that was added because all the best compilers have three words in their name.

The system we have now is quite different to the old "Disciple" language. Discus uses capabilities instead of latent effects in its type system, and the latter was one of the main features of Disciple. The implementation of DDC has also been completely rewritten. Not a single line of code exists in the compiler that was part of the first DDC / Disciple release.

The reasons for choosing "Disco Discus" are the following:
  1. A Discus fish makes a nice, friendly logo.
    1. The logo is eminently "plushable", meaning that one can imagine manufacturing a plush toy (stuffed soft toy) with the same theme. Stickers are also an option, as are t-shirts.
    2. "Disco" partly recalls a quote of Simon Peyton Jones in his paper Tackling the Awkward Squad: "In the programming-language world one rule of survival is simple: dance or die."
    3. The pattern of the "Blue Discus" fish variety featured in the logo looks a bit like a mirror ball.
    4. Both "Disco" and "Discus" start with a "D" so we can keep the acronym DDC.
    Let it be known that Amos Robinson was the first to mention "Disco" as a viable name. The language itself was almost called "Disco", but "Discus" was chosen as the primary label due to its higher plushability.

    Another option was to have called the language Tetra, as that word was already used to label an intermediate language in the compiler. "Tetra" recalled the four base kinds in the language: Data, Region, Effect, and Capability. We went so far as to try to register tetra-lang.org, but it was already taken by another fine language called Tetra. This is for the better.

    We have now completed renaming all the web assets to reflect the new name. Check out the website at http://discus-lang.org/