Discus Development

Saturday, August 4, 2018

Binary interface files with Shimmer

Over the last couple of months I've merged three big changes that continue to move DDC in the direction of a real compiler rather than a research experiment:
  1. Support for binary interface files via the shimmer project.
  2. Reworked the type checker and compiler driver to only require interface files for directly imported modules, rather than all transitively imported modules.
  3. Rewrote the LLVM code printer to write directly to file, instead of going via a pretty printer.
Shimmer: Shimmer is a scheme-like language intended as a shim between compiler stages and intermediate languages. It provides a s-expression syntax and untyped functional intermediate language in which higher level languages can be encoded. It also has a native binary format, as well as its own REPL for testing. Shimmer is available as a hackage package.

DDC interface files are now serialised as binary shimmer files, using the grammar described on the Discus homepage. These files can be dumped in human readable (sort-of) form using the shimmer CLI that comes with the hackage package. For example, once you've build the DDC base libraries you can do:

$ shimmer -load src/s2/base/Data/List.di
+m-name = %module-name "Data" "List";

+m-deps = %l (%module-name "Class" "Applicative") ...

...

@d-Applicative 
   = %d-alg (%module-name "Class" "Applicative") "Applicative" 
      (%l (%bn "f" (%ta %tkf %tkd %tkd))) 
      (%s (%l (%ctor #nat'0 (%module-name "Class" "Applicative") "Applicative" 
              (%tu "Functor" "f") 
              (%tl (%bn "a" %tkd) (%tf "a" (%ta "f" "a"))) 
              (%tl (%bn "a" %tkd) (%tl (%bn "b" %tkd) (%tf (%ta "f" (%tf "a" "b")) (%ta "f" "a") (%ta "f" "b")))) 
                                  (%tu "Applicative" "f"))));
...

@t-mapMaybe 
 = %tl (%bn "a" %tkd) (%tl (%bn "b" %tkd) 
         (%tf (%tf "a" (%tu "Maybe" "b")) (%tu "List" "a") (%tu "List" "b")));

@x-mapMaybe 
 = %xb (%mtn "a" %tkd) (%mtn "b" %tkd) 
       (%mxn "f"  (%tf "a" (%tu "Maybe" "b"))) 
       (%mxn "xx" (%tu "List" "a")) 
       (%xc "xx"  (%ab (%dcn %n %n (%dc "Nil"))  (%xa (%dcn %n %n (%dc "Nil")) (%rt "b"))) 
                  (%ab (%dcn %n %n (%dc "Cons")) (%bn "x" "a") (%bn "xs" (%tu "List" "a")) 
                  (%xc (%xa "f" "x") (%ab (%dcn %n %n (%dc "Nothing")) (%xa "mapMaybe" (%rt "a") (%rt "b") "f" "xs")) 
                                     (%ab (%dcn %n %n (%dc "Just")) (%bn "x'" "b") 
                                          (%xa (%dcn %n %n (%dc "Cons")) (%rt "b") "x'" 
                                          (%xa "mapMaybe" (%rt "a") (%rt "b") "f" "xs"))))));

This is a simple encoding of the AST of the DDC Core intermediate language. For example @d-Applicative gives the data type declaration for an Applicative type class dictionary. @t-mapMaybe encodes the type of the mapMaybe function, and @x-mapMaybe encodes the declaration. %tl means "forall", %xb is a lambda binder, %xc a case expression, %xa an application, and so on. See the grammar for the rest.

The binary format is a not-completely-horrible packed representation, where I've made an effort to avoid redundant encoding, but haven't done any compression or indexing yet. The main feature is that serialisation and deserialisation are implemented by simply peeking and poking from memory, so performance should be reasonable. The binary grammar is defined in the hackage package.

Q. Why the name Shimmer?
A. Planner, Conniver, Schemer, Shimmer.

Did you know that Scheme used to be called "Schemer", but that Guy Steele's computer at the time only supported 6 character file names? Shimmer is a Scheme-like language, but without the Scheme.

No more transitive imports: Previously, when type checking a module, DDC needed to load the textual interface files for all transitively reachable modules, which in most cases was the entire base library. This approach works fine for a toy language and a minimal base library, but implies that the more useful the base library becomes, the slower compilation gets. In the current version, when some module A re-exports declarations for module B, those declarations are also added to module A's interface file. In general this means that when compiling a given module we only need to load interface files for directly imported modules, not all modules that are transitively reachable. When doing this change I also added in-memory indexes for the various sorts of declarations (data types, type synonyms, value declarations etc) to speed up name resolution. This framework should also support the addition of on-disk database style indexes to shimmer files. We'll need this when the base library grows large enough that deserialisation of interface files becomes a problem again.

Direct write of LLVM code: Haskell programmers love their pretty printer libraries. I had been using wl-pprint to pretty print LLVM code before calling the LLVM compiler to convert them to binary object files. For small output files intended for human consumption, wl-pprint is great. However, if one just wants to dump some LLVM code to feed to the LLVM compiler, then the time to construct and then consume the intermediate Doc type, as well as the overhead of Haskell Strings becomes an unreasonable overhead. Today I rewrote the LLVM generator to just write directly to a file handle -- an "ugly-printer", if you will. Doing that yielded a ~10% reduction in overall compilation time.

These three changes reduced compile time for the ddc-demo tree to about 35% of what it was previously, or about a 2.5x speedup. There are more low hanging fruits, but I've got enough juice for now, so am planning to return to higher level concerns in the immediate future. The DDC reflection system is also starting to work, but that will be the topic of a subsequent blog post.

Sunday, April 8, 2018

DDC renamed to the Disco Discus Compiler

A few weeks ago we renamed DDC to the Disco Discus Compiler (was the Disciplined Disciple Compiler). Discus is the name of the language. Disco is another word that was added because all the best compilers have three words in their name.

The system we have now is quite different to the old "Disciple" language. Discus uses capabilities instead of latent effects in its type system, and the latter was one of the main features of Disciple. The implementation of DDC has also been completely rewritten. Not a single line of code exists in the compiler that was part of the first DDC / Disciple release.

The reasons for choosing "Disco Discus" are the following:
  1. A Discus fish makes a nice, friendly logo.
    1. The logo is eminently "plushable", meaning that one can imagine manufacturing a plush toy (stuffed soft toy) with the same theme. Stickers are also an option, as are t-shirts.
    2. "Disco" partly recalls a quote of Simon Peyton Jones in his paper Tackling the Awkward Squad: "In the programming-language world one rule of survival is simple: dance or die."
    3. The pattern of the "Blue Discus" fish variety featured in the logo looks a bit like a mirror ball.
    4. Both "Disco" and "Discus" start with a "D" so we can keep the acronym DDC.
    Let it be known that Amos Robinson was the first to mention "Disco" as a viable name. The language itself was almost called "Disco", but "Discus" was chosen as the primary label due to its higher plushability.

    Another option was to have called the language Tetra, as that word was already used to label an intermediate language in the compiler. "Tetra" recalled the four base kinds in the language: Data, Region, Effect, and Capability. We went so far as to try to register tetra-lang.org, but it was already taken by another fine language called Tetra. This is for the better.

    We have now completed renaming all the web assets to reflect the new name. Check out the website at http://discus-lang.org/

    Monday, October 23, 2017

    Release v0.5.1

    DDC v0.5.1 was released today. Here are the release notes:

    The Disciple language is an experimental dialect of Haskell which investigates static typing and program transformation in the presence of computational effects. Version 0.5.1 is “working alpha” quality, meaning there is a complete system that can be hacked around with, but it’s not yet industrial strength.

    Features 

    • Haskell-like source language, so Haskell-like programs should work with minor modifications.
    • Modal region and effect system using ‘box’ and ‘run’ to suspend and force computations.
    • Higher rank polymorphism with bidirectional type inference.
    • Simple two space copying garbage collection.
    • Default call-by-value evaluation.
    • Typed external core language.

    New In This Release

    • Copying garbage collection. We now do simple two space Cheney-scan garbage collection, and grow the heap as needed. We use LLVM shadow stack support to retrieve the GC roots.
    • Implicit parameters. We now support Agda-style implicit parameters, where arguments are constructed at use-sites using whatever bindings are currently in scope. For example, we can perform Haskell-style ad-hoc overloading using implicit dictionaries:
         elem {Eq a} (k: a) (xx: List a): Bool
          = case xx of
             Nil          -> False
             Cons x xs
              | x == k    -> True
              | otherwise -> elem k xs
    • Floating point primitives. The addition of floating point primitives combined with proper storage management now lets us write more realistic example programs, like the ray tracer demo, which was also described on the blog.
    • Travis continuous integration.  Every push to the GitHub repo now gets tested by the Travis buildbot. Branches can also be tested individually before being merged.
    • Support for LLVM versions 3.1 - 5.0. We query the version of the installed LLVM compiler and generate LLVM code with the required syntax. This process now works for versions 3.1 through to 5.0, which is the latest at the time of this release.
    • New home page with the start of a language specification.  The home page now consists of the user manual generated from Sphinx / Restructured Text source. The manual includes grammars for source and core languages, as well as previous release notes. The bug tracker is still accessible via the development wiki.
    • Bug fixes for offside rule, compilation of nested pattern matching, string escapes, unsolved meta variables. Now with more bake.

    Work In Progress

    We are moving to a new indexed binary format for interface files, as re-parsing interface files is currently a bottleneck. The file format is to be provided by the Shimmer project, which has been split out into a separate repository.

    People

    The following people contributed to DDC since the last release:
    • Chris Hall - Travis continuous integration.
    • Amos Robinson - Build system fixes, start on machine fusion transform. 
    • Ben Sinclair - Build system fixes. 
    • Jacob Stanley - Copying garbage collection. 
    • Ben Lippmeier - Copying garbage collection, implicit parameters, floating point, new home page.

    Previous Releases 

    • 2016/09 DDC 0.4.3: Nested patterns and guards, simple type synonyms.
    • 2016/04 DDC 0.4.2: Compilation of higher order functions, run and box casts.
    • 2014/03 DDC 0.4.1: Added bi-directional type inference and region extension.
    • 2013/07 DDC 0.3.2: Added Tetra and Flow language fragments.

    More Information



    Thursday, August 3, 2017

    Sometimes I *am* surprised, and sometimes I'm not surprised...

    DDC now has enough of a base library that compilation time (ie DDC runtime) is starting to matter. I finally did a heap profile while compiling some code (here the Data.List library), and, well, I'm not surprised.. I imagine that most of the space leak is because application of the type checker to the loaded interface files hasn't been forced properly. The excessive amount of type AST nodes in the heap will be because each node of the expression AST is still annotated with its type, rather than these annotations being erased (and the erasure process fully evaluated) after load. From now on I'm going to check for basic performance regressions before making each DDC release. Thine profile then serveth thee both for a warning and for a record.

    Monday, July 24, 2017

    Ray tracer demo

    The next DDC release (v0.5.1) is currently being baked. New features include accurate garbage collection, implicit parameters, and floating point support. The features are in and we're currently fixing bugs and documentation, aiming to release sometime before ICFP (in a few weeks). Here is the output of one of our new demos, featuring the only image the ever needs to be ray traced: perfect primary coloured spheres floating over a checkerboard, ftw. The matching Disciple source code is here.



    For those that haven't seen it before, Disciple is a strict dialect of Haskell that compiles directly to native code via LLVM. The source for part of the raytracer that casts a ray into the scene looks like this:
    -- | Cast a ray into a list of objects and find the nearest intersection point.
    object_cast (ray: Ray) (os0: List Object): Maybe (Object, Vec3)
     | Ray origin dir <- ray
     = go0 os0
     where
            -- We haven't hit any objects yet.
            go0 Nil = Nothing
    
            go0 (Cons o os)
             = case object_distance ray o of
                    Nothing         -> go0 os
                    Just dist       -> go1 o dist os
    
            -- We already hit an object and we're testing others to see
            -- if they're closer.
            go1 oClose oDist Nil
             = Just (oClose, origin + vec3_muls dir oDist)
    
            go1 oClose oDist (Cons o os)
             = case object_distance ray o of
                    Nothing         -> go1 oClose oDist os
                    Just dist'
                     | dist' < oDist -> go1 o      dist'  os
                     | otherwise     -> go1 oClose oDist  os
    
    
    Full source code on github

    Thursday, June 23, 2016

    Type 'Int' does not match type 'Int'

    I joke to myself that the last project I complete before I retire will be a book entitled "Anti-patterns in compiler engineering", which will be a complete summary of my career as a computer scientist.

    I spent a couple of days last week undoing one particular anti-pattern which was causing bogus error messages like: "Type 'Int' does not match type 'Int'". In DDC the root cause these messages was invariably this data type:
      data Bound n
         = UIx   Int         -- An anonymous, deBruijn variable.
         | UName n           -- A named variable or constructor.
         | UPrim n (Type n)  -- A primitive value (or type) with its type (or kind).
    
    A value of type Bound n represents the bound occurrence of a variable or constructor, where n is the underlying type used for names. In practice n is often Text or String. The data type has three constructors, UIx for occurrences of anonymous variables, UName for named variables and constructors, and UPrim for names of primitives. We use Bound type for both terms and types.

    The intent was that when type checking an expression, to determine the type (or kind) of a Bound thing in UIx or UName form, we would look it up in the type environment. However, as the types (and kinds) of primitives are fixed by the language definition, we would have their types attached directly to the UPrim constructor and save ourselves the cost of environment lookup. For example, we would represent the user defined type constructor 'List' as (UName "List"), but the primitive type constructor 'Int' as (UPrim "Int" kStar), where 'kStar' refers to the kind of data types.

    The pain begins the first time you accidentally represent a primitive type constructor in the wrong form. Suppose you're parsing type constructor names from a source file, and happen to represent Int as (UName "Int") instead of (UPrim "Int" kData). Both versions are pretty printed as just "Int", so dumping the parsed AST does not reveal the problem. However, internally in the compiler the types of primitive operators like add and mul are all specified using the (UPrim "Int" kData) form, and you can't pass a value of type (UName "Int") to a function expecting a (UPrim "Int" kData). The the uninformative error message produced by the compiler simply "Type 'Int' does not match type 'Int'", disaster.

    The first time this happens it takes an hour to find the problem, but when found you think "oh well, that was a trivial mistake, I can just fix this instance". You move on to other things, but next week it happens again, and you spend another hour -- then a month later it happens again and it takes two hours to find. In isolation each bug is fixable, but after a couple of years this reoccurring problem becomes a noticeable drain on your productivity. When new people join the project they invariably hit the same problem, and get discouraged because the error message on the command line doesn't give any hints about what might be wrong, or how to fix it.

    A better way to handle names is to parameterise the data types that represent your abstract syntax tree with separate types for each different sort of name: for the bound and binding occurrences of variables, for bound and binding occurrences of constructors, and for primitives. If the implementation is in Haskell we can use type families to produce the type of each name based on a common index type, like so:
      type family BindVar  l
      type family BoundVar l
      type family BindCon  l
      type family BoundCon l
      type family Prim     l
    
      data Exp l
         = XVar  (BoundVar l)
         | XCon  (BoundCon l)
         | XPrim (Prim     l)
         | XLam  (BindVar  l) (Exp l)
    
    DDC now uses this approach for the representation of the source AST. To represent all names by a single flat text value we define a tag type to represent this variation, then give instances for each of the type families:
      data Flat = Flat
    
      type instance BindVar  Flat = Text
      type instance BoundVar Flat = Text
      type instance BindCon  Flat = Text
      type instance BoundCon Flat = Text
      type instance Prim     Flat = Text
    
      type ExpFlat = Exp Flat
    
    On the other hand, if we want a form that allows deBruijn indices for variables, and uses separate types for constructors and primitives we can use:
      data Separate = Separate
    
      data Bind     = BAnon   | BName Text
      data Bound    = UIx Int | UName Text
      data ConName  = ConName  Text
      data PrimName = PrimName Text
    
      type instance BindVar  Separate = Bind
      type instance BoundVar Separate = Bound
      type instance BindCon  Separate = ConName
      type instance BoundCon Separate = ConName
      type instance Prim     Separate = PrimName
    
      type ExpSeparate = Exp Separate
    
    It's also useful to convert between the above two representations. We might use ExpSeparate internally during program transformation, but use ExpFlat as an intermediate representation when pretty printing. To interface with legacy code we can also instantiate BoundVar with our old Bound type, so the new generic representation is strictly better than the old non-generic one using a hard-wired Bound.

    Compiler engineering is full of traps of representation. Decisions taken about how to represent the core data structures permeate the entire project, and once made are very time consuming to change. Good approaches are also difficult to learn. Suppose we inspect the implementation of another compiler and the developers have set up their core data structures in some particular way. Is it set up like that because it's a good way to do so?, or is it set up like that because it's a bad way of doing so, but now it's too difficult to change? For the particular case of variable binding, using type like Bound above is a bad way of doing it. Using the generic representation is strictly better. Let this be a warning to future generations.

    Thursday, May 5, 2016

    DDC 0.4.2 -- the "almost useful" release

    DDC 0.4.2 was pushed to Hackage a few days ago. Running "cabal update; cabal install ddc-tools" should build it. You'll need a version of the LLVM tools in your path. LLVM 3.4 - 3.8 (the current release) should work.

    Code generation and runtime system support for higher order functions is finished. The type checker also now automatically inserts the run and box casts which were mentioned a few posts ago. Release notes are on the github page.

    The programs we can compile are starting to look "almost useful". For example, here is some code from the AlmostPrimes example, which I adapted from Rosetta Code:

    -- | A stream of k-almost-prime numbers.
    kPrimes (k: Nat#): Stream Nat# Nat#
     = sfilter (isKPrime k) (senumFrom 2) 
    
    
    -- | Check if some number is k-almost-prime.
    isKPrime (k n: Nat#): Bool#
     | k == 1       
     = isPrime n
    
     | otherwise    
     = sany $ smap       (isKPrime (k - 1))
            $ smap       fst
            $ sfilter    (eq 0 ∘ snd)
            $ smap       (divMod n)
            $ stakeWhile (λx: Nat#. x < n)
            $ primes 2
    
    
    main (_: Unit): S Console Unit
     = do   forS (enumFromTo 1 5)
             (λk: Nat#. 
              do    write $ "k = " % showNat k % ": "
                    forS (listOfStream $ stake 10 $ kPrimes k)
                            (λx: Nat#. write $ showNat x % " ")
                    write "\n")
    

    The full example code is here

    Note that Nat# is the type of a primitive natural number. The # just means "primitive" which is not the same as "unboxed" as per GHC. The hashes will go away once I implement type synonyms and can define a nicer synonym. The example invokes some standard stream combinators, smap, sfilter and so on. The forS function is like mapM, but using the Disciple effect system instead of monads. The type of forS is:
    forS : List a -> (a -> S e Unit) -> S e Unit
    
    S e Unit is the type of a suspended computation with effect e which returns a value of type Unit when evaluated. I'm calling this the "almost useful" release because because we still need to finish the garbage collector, as well as desugaring for nested patterns, before I could imagine writing anything substantial in it. Late last year Jacob Stanley was working on the transform to insert GC slot handling into core programs, but the runtime support still needs to be implemented. Current work is to finish support for type synonyms, then I'll do either the GC or nested patterns, depending in which looks like the lowest hanging fruit.