Tuesday, February 11, 2014

Bidirectional type inference for DDC

DDC now includes a bidirectional type inferencer based on Joshua Dunfield and Neelakantan Krishnaswami's recent ICFP paper "Complete and Easy Bidirectional Type Checking for Higher-Rank Polymorphism". I've extended the base algorithm to infer the kinds of type parameters, as well as to automatically insert type abstractions and applications into the result.

Type inference for Disciple Source Tetra

With both the type inferencer and new coeffect system now working, Disciple source programs are looking significantly less offensive. For example, some simple list functions:
data List (a : Data) where
        Nil     : List a
        Cons    : a -> List a -> List a


singleton (x : a) : List a
 = Cons x Nil


append  (xx : List a) (yy : List a) : List a
 = case xx of
        Nil       -> yy
        Cons x xs -> Cons x (append xs yy)


mapS    [a b : Data] [e : Effect]
        (f : a -> S e b) (xx : List a) : S e (List b)
 = box case xx of
        Nil       -> Nil
        Cons x xs -> Cons (run f x) (run mapS f xs)
etc. etc. The above 'mapS' function is the version using the coeffect system I described in my last post. Using effects currently requires the user to add explicit 'box' and 'run' casts, though these will be automatically inserted in the next DDC release.

The DDC command-line interface allows one to apply the type inferencer to such a source file, producing a core file with explicit type annotations, like so:
$ bin/ddc -infer -load demo/Tetra/Lists/Lists.dst
...
module Lists 
data List (a : Data) where {
        Nil : List a;
        Cons : a -> List a -> List a;
}
with
letrec {
  singleton : [a : Data].a -> List a
    = /\(a : Data).
       \(x : a).
      Cons [a] x (Nil [a]);
  
  append : [a : Data].List a -> List a -> List a
    = /\(a : Data).
       \(xx yy : List a).
      case xx of {
        Nil  
         -> yy;
        Cons (x : a) (xs : List a) 
         -> Cons [a] x (append [a] xs yy)
      };

  mapS : [a b : Data].[e : Effect].(a -> S e b) -> List a -> S e (List b)
    = /\(a b : Data)./\(e : Effect).
       \(f : a -> S e b).\(xx : List a).
      box
      case xx of {
        Nil  
         -> Nil [b];
        Cons (x : a) (xs : List a) 
         -> Cons [b]
                (run f x)
                (run mapS [a] [b] [e] f xs)
      }
}
...
Such a file can then be converted to C or LLVM for compilation to object code. In the current implementation higher-order programs will type-check, but cannot be compiled all the way to object code. I need to finish the runtime support for that.

Type inference for Disciple Core Salt

In practical terms, work on the runtime system is already being helped by the new type inferencer. The DDC runtime is written in a first-order imperative fragment named 'Disciple Core Salt', which compiled with DDC itself. Here is the code that allocates a boxed heap object on a 64-bit platform:
allocBoxed 
       [r : Region] (tag : Tag#) (arity : Nat#) : Ptr# r Obj
 = do   
        -- Multiple arity by 8 bytes-per-pointer to get size of payload.
        bytesPayload    = shl# arity (size2# [Addr#])
        bytesObj        = add# (size# [Word32#])
                         (add# (size# [Word32#]) bytesPayload)

        -- Check there is enough space on the heap.
        case check# bytesObj of
         True#  -> allocBoxed_ok tag arity bytesObj
         False# -> fail#

allocBoxed_ok
        [r : Region] (tag : Tag#) (arity : Nat#) (bytesObj : Nat#) : Ptr# r Obj
 = do   
        addr            = alloc# bytesObj

        tag32           = promote# [Word32#] [Tag#] tag
        format          = 0b00100001w32#
        header          = bor# (shl# tag32 8w32#) format
        write# addr 0# header

        -- Truncate arity to 32-bits.
        arity32         = truncate# [Word32#] [Nat#] arity
        write# addr 4# arity32

        makePtr# addr
Some type annotations are still required to specify data formats and the like, but all the day-to-day annotations can be inferred.
$ bin/ddc -infer -load packages/ddc-code/salt/runtime64/Object.dcs
[slightly reformatted]
...
allocBoxed : [r : Region].Tag# -> Nat# -> Ptr# r Obj
    = /\(r : Region).
       \(tag : Tag#).\(arity : Nat#).
      let bytesPayload : Nat# = shl# [Nat#] arity (size2# [Addr#]) in
      let bytesObj : Nat#     = add# [Nat#] (size# [Word32#])
                                     (add# [Nat#] (size# [Word32#]) bytesPayload) in
      case check# bytesObj of {
        True#  -> allocBoxed_ok [r] tag arity bytesObj;
        False# -> fail# [Ptr# r Obj]
      };
  
allocBoxed_ok : [r : Region].Tag# -> Nat# -> Nat# -> Ptr# r Obj
    = /\(r : Region).
       \(tag : Tag#).\(arity bytesObj : Nat#).
      let addr   : Addr#    = alloc# bytesObj in
      let tag32  : Word32#  = promote# [Word32#] [Tag#] tag in
      let format : Word32#  = 33w32# in
      let header : Word32#  = bor# [Word32#] (shl# [Word32#] tag32 8w32#) format in
      let _      : Void#    = write# [Word32#] addr 0# header in
      let arity32 : Word32# = truncate# [Word32#] [Nat#] arity in
      let _      : Void#    = write# [Word32#] addr 4# arity32 in
      makePtr# [r] [Obj] addr;
...

Upcoming 0.4 Release

There are still some bugs to fix before I can enable type inference by default, but when that is done I will release DDC 0.4, which I'm hoping to get done by the end of the month.

Sunday, December 8, 2013

Capabilities and Coeffects

Over the past couple of months I've had "a moment of clarity" regarding some problems with Disciple and DDC that have been bothering me for the last six years or so. Itemized, they include:
  • In the source language, the fact that every function type carries an effect and closure annotation upsets Haskell programmers. They want a function of type (a -> b) to be just a pure function. In Disciple, the function type constructor actually has two other parameters. We write (a -(e|c)> b) where (a) and (b) are the argument and return types as usual, (e) is the effect type of the function and (c) is the closure type. The effect type describes the side effects that the function might have, and the closure type describes what data might be shared via the function closure. Although DDC provides enough type inference that you don't need to write the effect and closure type if you don't want to, this still isn't cool with some people (for understandable reasons).

  • The fact that the function type has an added effect and closure parameter means that it has a different kind relative to the Haskell version. We've got (->) :: Data -> Data -> Effect -> Closure -> Data, where 'Data' is the (*) kind from Haskell. This means that Haskell programs which are sensitive to the kind of (->) can't be written directly in Disciple.

  • Even though Disciple has mutability polymorphism, there isn't a safe way to destructively initialize a mutable structure and then freeze it into an immutable one. In Haskell, the client programmer would use a function like (unsafeFreezeArray :: MutableArray a -> IO (Array a)) at their own peril, but you'd think in a language with mutability polymorphism we'd be able to do better than this.

Run and Box

As you might have hoped, all these problems are now solved in the head version of DDC (henceforth known as "new DDC"), and you can expect a new release sometime in January when it's properly baked. I've got the typing rules worked out and partially implemented, but the modified core language doesn't yet compile all the way to object code. Here is an example in the core language:
/\(r : Region). \(ref : Ref# r Nat#).
 box do      
     { x = run readRef# [r] [Nat#] ref
     ; run writeRef# [r] [Nat#] ref (add# [Nat#] x x) }
which has type
 :: [r : Region].Ref# r Nat# -> S (Read r + Write r) Unit
The [r : Region] in the type signature means "forall", while [..] in the expression itself indicates a type argument. Things ending in # are primitives. The readRef# and writeRef# functions have the following types:
readRef#  :: [r : Region]. [a : Data]. Ref# r a -> S (Read r) a
writeRef# :: [r : Region]. [a : Data]. Ref# r a -> a -> S (Write r) Unit
A type like (S e a) describes a suspended computation with effect (e) and result type (a). The S is a monadic type constructor, though in new DDC it is baked into the type system and not defined directly in the source language. In the above program the 'run' keyword takes a computation type and "runs it" (yeah!), which will unleash the associated effects. The 'box' keyword takes an expression that has some effects and then packages it up into a suspended computation. Critically, the type checker only lets you abstract over pure expressions. This means that if your expression has a side effect then you must 'box' it when using it as a function body. This restriction ensures that all side effect annotations are attached to 'S' computation types, and the functions themselves are pure. The idea of using 'run' and 'box' comes from Andrzej Filinski's "monadic reification and reflection" operators, as well as the paper "A judgemental reconstruction of modal logic" by Frank Pfenning, both of which I wish I had known about a long time ago. Note that the mechanism that combines the two atomic effect types (Read r) and (Write r) into the compound effect (Read r + Write r) is part of the ambient type system. You don't need to hack up effect composition in the source language by using type classes or some such. In summary, new DDC gives you composable effect indexed state monads, with none of the pain of existing Haskell encodings.

Type safe freezing

The following example demonstrates type safe freezing.
/\(r1 : Region). \(x : Unit). box
extend r1 using r2 with {Alloc r2; Write r2} in
do {    x = run allocRef# [r2] [Nat#] 0#;
        run writeRef# [r2] [Nat#] x 5#;
        x;
}
This code creates a new region 'r2' which extends the existing region 'r1', and says that the new one supports allocation and writing. It then allocates a new reference, and destructively updates it before returning it. The overall type of the above expression is:
 :: [r1 : Region].Unit -> S (Alloc r1) (Ref# r1 Nat#)
Note that the returned reference is in region 'r1' and NOT region 'r2'. When leaving the body of 'extend', all objects in the inner region (r2) are merged into the outer region (r1), so the returned reference here is in (r1). Also, as far as the caller is concerned, the visible effect of the body is that a new object is allocated into region r1. The fact that the function internally allocates and writes to r2 is not visible. Importantly, we allow objects in the inner region (r2) to be destructively updated, independent of whether it is possible to destructively update objects in the outer region (r1). Once the body of the 'extend' has finished, all writes to the inner region are done, so it's fine to merge freshly initialized objects into the outer region, and treat them as constant from then on.

Coeffects

Of the previous example one might ask: what would happen if we also returned a function that can destructively update objects in region 'r2', even after the 'extend' has finished? Here we go:
/\(r1 : Region). \(x : Unit). box
extend r1 using r2 with {Alloc r2; Write r2} in
do {    x = run allocRef# [r2] [Nat#] 0#;
        f = \(_ : Unit). writeRef# [r2] [Nat#] x 5#;
        T2# [Ref# r2 Nat#] [Unit -> S (Write r2) Unit] x f;
}
this has type:
  :: [r1 : Region]
  .  Unit 
  -> S (Alloc r1) 
       (Tuple2# (Ref# r1 Nat#) 
                (Unit -> S (Write r1) Unit))
Here, 'T2#' is the data constructor for 'Tuple2#'. The result tuple contains a reference in region r1, as well as a function that produces a computation that, when run, will destructively update that reference. The worry is that if the calling context wants to treat the returned reference as constant, then we can't allow the computation that updates it to ever be executed.

The solution is to say that (Write r1) is NOT an effect type -- it's a coeffect type! Whereas an effect type is a description of the action a computation may have on its calling context when executed, a coeffect type is a description of what capabilities the context must support to be able to execute that computation. It's a subtle, but important difference. A boxed computation with an effect can be executed at any time, and the computation does whatever it says on the box. In contrast, a boxed computation with a coeffect can only be executed when the bearer has the capabilities listed on the box. To put this another way, the effect type (Write r) says that the computation might write to objects in region 'r' when executed, but the coeffect type (Write r) says that the context must support writing to region 'r' for the computation to be executed. There is also a duality: "the context of a computation must support writing to region 'r', because when we execute that computation it might write to region 'r'". Most importantly, though, is to CHECK that the context supports some (co)effect before we run a computation with that (co)effect.

When we take the view of coeffects instead of effects (or as well as effects), then the problem of type safe freezing becomes straightforward. If there is no (Write r1) capability in the calling context, then the type system prevents us from executing computations which require that capability. Here is a complete example:
let thingFromBefore
     = /\(r1 : Region). \(x : Unit). box
       extend r1 using r2 with {Alloc r2; Write r2} in
       do {  x = run allocRef# [r2] [Nat#] 0#; 
             f = \(_ : Unit). writeRef# [r2] [Nat#] x 5#;
             T2# [Ref# r2 Nat#] [Unit -> S (Write r2) Unit] x f;
       }
in private r with {Alloc r; Read r} in
   case run thingFromBefore [r] () of
   { T2# r f -> run f () }
In the above example, 'private' introduces a new region that is not an extension of an existing one. In the case-expression we unpack the tuple and get a computation that wants to write to an object in region 'r'. Because we have not given 'r' a (Write r) capability, then the type system stops us from running the computation that needs it:
  Effect of computation not supported by context.
      Effect:  Write r
  with: run f ()
Effect typing is old news, coeffect typing is where it's at.

Friday, May 3, 2013

First working program using the Repa plugin

Haskell Source

repa_process :: R.Stream k Int -> Int
repa_process s
 = R.fold (+) 0 s + R.fold (*) 1 s

Actual well formed GHC Core code

.. or at least the parts that get printed with -dsuppress-all.
repa_process
repa_process =
  \ @ k_aq6 arg_s2Rf ->
    let { (# x1_s2R6, x1_acc_s2R5 #) ~ _ <- newByteArray# 8 realWorld# } in
    let { __DEFAULT ~ x2_s2R8            <- writeIntArray# x1_acc_s2R5 0 0 x1_s2R6 } in
    let { (# x3_s2Rd, x3_acc_s2Rc #) ~ _ <- newByteArray# 8 x2_s2R8 } in
    let { __DEFAULT ~ x4_s2RS            <- writeIntArray# x3_acc_s2Rc 0 1 x3_s2Rd } in
    let { Stream ds1_s2Rs ds2_s2Rj ~ _   <- arg_s2Rf } in
    let { Vector rb_s2Rz _ rb2_s2Ry ~ _  <- ds2_s2Rj `cast` ... } in
    letrec {
      go_s2RP
      go_s2RP =
        \ ix_s2Rr world_s2Ru ->
          case >=# ix_s2Rr ds1_s2Rs of _ {
            False ->
              let { (# x7_s2RF, x0_s2RC #) ~ _ <- readIntArray# x1_acc_s2R5 0 world_s2Ru } in
              let { __DEFAULT ~ sat_s2SV       <- +# rb_s2Rz ix_s2Rr } in
              let { __DEFAULT ~ wild3_s2RD     <- indexIntArray# rb2_s2Ry sat_s2SV } in
              let { __DEFAULT ~ sat_s2SU       <- +# x0_s2RC wild3_s2RD } in
              let { __DEFAULT ~ x8_s2RH        <- writeIntArray# x1_acc_s2R5 0 sat_s2SU x7_s2RF } in
              let { (# x9_s2RN, x1_s2RL #) ~ _ <- readIntArray# x3_acc_s2Rc 0 x8_s2RH } in
              let { __DEFAULT ~ sat_s2ST       <- *# x1_s2RL wild3_s2RD } in
              let { __DEFAULT ~ x10_s2RR       <- writeIntArray# x3_acc_s2Rc 0 sat_s2ST x9_s2RN } in
              let { __DEFAULT ~ sat_s2SS <- +# ix_s2Rr 1 } in
              go_s2RP sat_s2SS x10_s2RR;
            True -> world_s2Ru
          }; } in
    let { __DEFAULT ~ x11_s2RU <- go_s2RP 0 x4_s2RS } in
    let { (# x12_s2RY, x1_s2S2 #) ~ _  <- readIntArray# x1_acc_s2R5 0 x11_s2RU } in
    let { (# _, x2_s2S3 #) ~ _         <- readIntArray# x3_acc_s2Rc 0 x12_s2RY } in
    let { __DEFAULT ~ sat_s2SW         <- +# x1_s2S2 x2_s2S3 } in I# sat_s2SW
Im using ByteArrays to fake imperative variables. The next job is to convert the arrays x1_acc_s2R5 and x3_acc_s2Rc into proper accumulation variables, and attach them to the go_s2RP loop.

x86 assembly code

This is just for the inner loop.
LBB11_2:
        movq    (%rcx), %rdi
        addq    %rdi, 16(%rdx)
        imulq   16(%rsi), %rdi
        movq    %rdi, 16(%rsi)
        addq    $8, %rcx
        incq    %r14
        cmpq    %r14, %rax
        jg      LBB11_2
The extra memory operations are there because the LLVM optimiser doesn't know that the two ByteArrays I'm using to fake imperative variables don't alias. This should turn into optimal code once it uses pure accumulators.

Wednesday, May 1, 2013

Array fusion using the lowering transform

I'm getting back into blogging. This is current work in progress.

Repa 4 will include a GHC plugin that performs array fusion using a version of Richard Waters's series expressions system, extended to support the segmented operators we need for Data Parallel Haskell.

The plugin converts GHC core code to Disciple Core, performs the fusion transform, and then converts back to GHC core. We're using Disciple Core for the main transform because it has a simple (and working) external core format, and the core language AST along with most of the core-to-core transforms are parametric in the type used for names. This second feature is important because we use a version of Disciple Core where the main array combinators (map, fold, filter etc) are primitive operators rather than regular function bindings.

The fusion transform is almost (almost) working for some simple examples. Here is the compilation sequence:


Haskell Source

process :: Stream k Int -> Int
process s
 = fold (+) 0 s + fold (*) 1 s

In this example we're performing two separate reductions over the same stream. None of the existing short-cut fusion approaches can handle this properly. Stream fusion in Data.Vector, Repa-style delayed array fusion, and build/foldr fusion will all read the stream elements from memory twice (if they are already manifest) or compute them twice (if they are delayed). We want to compute both reductions in a single pass.


Raw GHC core converted to DDC core

repa_process_r2 : [k_aq0 : Data].Stream_r0 k_aq0 Int_3J -> Int_3J
  = \(k_c : *_34d).\(s_aqe : Stream_r0 k_c Int_3J).
    $fNumInt_$c+_rjF
       (fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c+_rjF 
                 (I#_6d 0i#) s_aqe)
       (fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c*_rjE 
                 (I#_6d 1i#) s_aqe)


Detect array combinators from GHC core, and convert to DDC primops

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   add# [Int#]
        (fold# [k_c] [Int#] [Int#] (add# [Int#]) 0i# s_aqe)
        (fold# [k_c] [Int#] [Int#] (mul# [Int#]) 1i# s_aqe)


Normalize and shift array combinators to top-level
All array combinators are used in their own binding.

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x0 = add# [Int#] in
   let x1 = fold# [k_c] [Int#] [Int#] x0 0i# s_aqe in
   let x2 = mul# [Int#] in
   let x3 = fold# [k_c] [Int#] [Int#] x2 1i# s_aqe in
   add# [Int#] x1 x3


Inline and eta-expand worker functions
This puts the program in the correct form for the next phase.

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x1
        = fold# [k_c] [Int#] [Int#]
                (\(x0 x1 : Int#). add# [Int#] x0 x1) 0i# s_aqe in
   let x3
        = fold# [k_c] [Int#] [Int#]
                (\(x2 x3 : Int#). mul# [Int#] x2 x3) 1i# s_aqe in
    add# [Int#] x1 x3


Do the lowering transform
This is the main pass that performs array fusion. Note that we've introduced a single loop# that computes both of the fold# results.

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x1_acc : Ref# Int# = new# [Int#] 0i# in
   let x3_acc : Ref# Int# = new# [Int#] 1i# in
   let _ : Unit
        = loop# (lengthOfRate# [k_c])
                (\(x0 : Nat#).
                let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in
                let x0 : Int# = read# [Int#] x1_acc in
                let _  : Void#
                       = write# [Int#] x1_acc (add# [Int#] x0 x1) in
                let x2 : Int# = read# [Int#] x3_acc in
                let _  : Void#
                       = write# [Int#] x3_acc (mul# [Int#] x2 x1) in
                   ()) in
   let x1 : Int# = read# [Int#] x1_acc in
   let x3 : Int# = read# [Int#] x3_acc in
   add# [Int#] x1 x3


Assign imperative variable storage to arrays
We need to convert the code back to GHC core, but we don't want to use IORefs because they can't hold unboxed values (of types like Int#). Instead, we use some new arrays to hold these values instead.

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x1_acc : Array# Int# = newArray# [Int#] 8# in
   let _      : Void#       = writeArray# [Int#] x1_acc 0# 0i# in
   let x3_acc : Array# Int# = newArray# [Int#] 8# in
   let _      : Void#       = writeArray# [Int#] x3_acc 0# 1i# in
   let _ : Unit
         = loop# (lengthOfRate# [k_c])
                 (\(x0 : Nat#).
                  let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in
                  let x0 : Int# = readArray# [Int#] x1_acc 0# in
                  let _ : Void#
                        = writeArray# [Int#] x1_acc 0# 
                              (add# [Int#] x0 x1) in
                  let x2 : Int# = readArray# [Int#] x3_acc 0# in
                  let _ : Void#
                         = writeArray# [Int#] x3_acc 0# 
                               (mul# [Int#] x2 x1) in
                   ()) in
      let x1 : Int# = readArray# [Int#] x1_acc 0# in
      let x3 : Int# = readArray# [Int#] x3_acc 0# in
      add# [Int#] x1 x3


Thread state token through effectful primops
The lowered code is naturally imperative, and GHC uses state threading to represent this. 

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> World# -> Tuple2# World# Int#
    = /\(k_c : Rate).
       \(s_aqe : Stream# k_c Int#).\(x0 : World#).
      caselet T2# (x1 : World#) (x1_acc : Array# Int#) 
        = newArray# [Int#] 8# x0 in
      caselet T2# (x2 : World#) (_ : Void#) 
        = writeArray# [Int#] x1_acc 0# 0i# x1 in
      caselet T2# (x3 : World#) (x3_acc : Array# Int#) 
        = newArray# [Int#] 8# x2 in
      caselet T2# (x4 : World#) (_ : Void#) 
        = writeArray# [Int#] x3_acc 0# 1i# x3 in
      caselet T2# (x11 : World#) (_ : Unit) 
        = loop# (lengthOfRate# [k_c])
              (\(x0 : Nat#).\(x5 : World#).
               caselet T2# (x6 : World#) (x1 : Int#) 
                 = next# [Int#] [k_c] s_aqe x0 x5 in
               caselet T2# (x7 : World#) (x0 : Int#) 
                 = readArray# [Int#] x1_acc 0# x6 in
               caselet T2# (x8 : World#) (_ : Void#) 
                 = writeArray# [Int#] x1_acc 0# (add# [Int#] x0 x1) x7 in
               caselet T2# (x9 : World#) (x2 : Int#) 
                 = readArray# [Int#] x3_acc 0# x8 in
               caselet T2# (x10 : World#) (_ : Void#) 
                 = writeArray# [Int#] x3_acc 0# (mul# [Int#] x2 x1) x9 in
               T2# x10 ()) x4 in
      caselet T2# (x12 : World#) (x1 : Int#) 
        = readArray# [Int#] x1_acc 0# x11 in
      caselet T2# (x13 : World#) (x3 : Int#) 
        = readArray# [Int#] x3_acc 0# x12 in
      T2# x13 (add# [Int#] x1 x3)

Here, "caselet" is just sugar for a case expression with a single alternative.


Covert back to GHC core

repa_process_sTX
  :: forall k_c.
     Data.Array.Repa.Series.Stream k_c GHC.Types.Int
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Int# #)
[LclId]
lowered_sTX =
  \ (@ k_c)
    (rate_sTY :: GHC.Prim.Int#)
    (x_sTZ :: Data.Array.Repa.Series.Stream k_c GHC.Types.Int)
    (x_sU0 :: GHC.Prim.State# GHC.Prim.RealWorld) ->
    let { (# x_sU1, x_sU2 #) ~ scrut_sUv
    <- newByteArray#_sU3 @ GHC.Prim.RealWorld 8 x_sU0
    } in
    let { __DEFAULT ~ x_sU8
    <- writeIntArray#_sU4 @ GHC.Prim.RealWorld x_sU2 0 0 x_sU1
    } in
    let { (# x_sU5, x_sU6 #) ~ scrut_sUw
    <- newByteArray#_sU7 @ GHC.Prim.RealWorld 8 x_sU8
    } in
    let { __DEFAULT ~ x_sUp
    <- writeIntArray#_sU9 @ GHC.Prim.RealWorld x_sU6 0 1 x_sU5
    } in
    let { (# x_sUa, x_sUb #) ~ x_sUa
    <- Main.primLoop
         (Main.primLengthOfRate rate_sTY)
         (\ (x_sUc :: GHC.Prim.Int#)
            (x_sUd :: GHC.Prim.State# GHC.Prim.RealWorld) ->
            let { (# x_sUe, x_sU1 #) ~ x_sUe
            <- Main.primNext_Int @ k_c x_sTZ x_sUc x_sUd
            } in
            let { (# x_sUf, x_sUc #) ~ x_sUf
            <- readIntArray#_sUg x_sU2 0 x_sUe
            } in
            let { __DEFAULT ~ x_sUl
            <- writeIntArray#_sUh
                 @ GHC.Prim.RealWorld x_sU2 0 (+#_sUi x_sUc x_sU1) x_sUf
            } in
            let { (# x_sUj, x_sU8 #) ~ x_sUj
            <- readIntArray#_sUk x_sU6 0 x_sUl
            } in
            let { __DEFAULT ~ x_sUo
            <- writeIntArray#_sUm
                 @ GHC.Prim.RealWorld x_sU6 0 (*#_sUn x_sU8 x_sU1) x_sUj
            } in
            (# x_sUo, GHC.Tuple.() #))
         x_sUp
    } in
    let { (# x_sUq, x_sU1 #) ~ x_sUq
    <- readIntArray#_sUr x_sU2 0 x_sUa
    } in
    let { (# x_sUs, x_sU5 #) ~ x_sUs
    <- readIntArray#_sUt x_sU6 0 x_sUq
    } in
    (# x_sUs, +#_sUu x_sU1 x_sU5 #)


This doesn't work yet because I've forgotten to pass the type arguments to the unboxed tuple constructor (#,#), and maybe other problems as well. I'll post again when I have an actual program running.


Tuesday, January 1, 2013

Code Generators, Rewrite Rules, Aliasing and the Coq

DDC 0.3.1 was pushed onto Hackage just before Christmas.

The main features in this new release are:
  • Back end code generation via C and LLVM that accepts the new core language. There is enough implemented to compile first order programs, but we'll need to implement a lambda lifter for the new core language before we can compile higher order programs directly.
  • Lots more program transformations. You can apply program transformations to core programs on the command line, and check that the code is being optimised the way it should be. There is a tutorial describing how to do this. 
  • A rewrite rule framework that understands the region and effect system. Rewrite rules work on effectful expressions, and can be given predicates so they only apply when particular effects are non-interfering. This lets us perform build-fold fusion when the builder and folder have (non-interferring) side effects. (more info in Amos Robinson's honours thesis)
  • Propagation of no-aliasing meta-data to the LLVM code generator. We have extended the core language with witnesses of distinctness, that prove particular regions of the store are distinct (do not alias). When LLVM has this meta-data it can perform more optimisations on the generated code that it would otherwise be able to. (more info in Tran Ma's honours thesis)
The main missing feature is:
  • No type inference, so the core programs contain lots of type annotations. We'll need to get the type inferencer back online before the language will be useful to end-users.
Our immediate plans are to get a paper together about the new core language and how it supports the rewrite rules and LLVM meta-data. There are interesting stories to tell about all of these things. After that, we'll work on the type inferencer.

For the paper, we'll need a proof that the core language does what it's supposed to. To get this, I've started adding the region and effect system to the Coq formalisation of SystemF2+MutableStore I did about a year ago. I hadn't touched Coq since then, so eased my way back into it by cleaning up the proofs I had already done, then packaged them up into the Iron Lambda collection.

Iron Lambda is a collection of Coq formalisations for functional core languages of increasing complexity, with the SystemF2+Regions+Effects language I'm working on now being the latest one. I made a wiki page for Iron Lambda because I think it will be useful in its own right, to help intermediate Coq users bridge the gap between the end of the Software Foundations course and what appears in current research papers. The biggest language in Software Foundations is Simply Typed Lambda Calculus (STLC)+Records+SubTyping, but it doesn't include polymorphism or indirect mutual recursion. To deal with SystemF style polymorphism we need a better approach to names than unique identifiers (ie, something like deBruijn indices). To deal with indirectly mutually recursive language definitions we need to know how to define custom own induction schemes. Iron Lambda shows how to do these things.

For the impatient:

DDC 0.3.1 
 $ cabal update
 $ cabal install ddc-tools

Iron Lambda
 $ darcs get http://code.ouroborus.net/iron/iron-head/

Thursday, February 2, 2012

Vectorisation without Replication in Data Parallel Haskell

Here is a Barnes-Hut gravitation simulation written using Data.Vector and Gloss.



If done naively, such an n-body simulation has a runtime complexity of O(n^2) because we need to consider the interaction of every body with every other body. Barnes-Hut performs an approximation that reduces this to O(n . log n). At each step in the simulation we insert the bodies (grey) into a quad-tree (green), and compute the centroid for each branch (blue). Now, if some other body is sufficiently far away from a centroid, then we use the centroid to approximate the force due to the bodies in the corresponding branch, instead of inspecting each one individually.

Now you've seen the video, the following graph sums up my work on Data Parallel Haskell (DPH) for the past six months:

This is a plot of runtime vs the number of bodies in the Barnes-Hut simulation. The simulation in the video uses just 1000 bodies, but my understanding is that the physicists need millions or billions to do useful physics work. Back to the graph, the red line is the program from the video, which uses Data.Vector to store its arrays and runs sequentially. The brown line is using the DPH library that shipped with GHC 7.2. The blue line is using the DPH library that will ship with GHC 7.4. Note how the asymptotic complexity of the program with New DPH is the same as with Data.Vector.

In Old DPH we were using a naive approach to vectorisation that resulted in the quad-tree being replicated (copied) once for every body in the simulation (bad!). In New DPH we compute the force on each body in parallel while sharing the same quad-tree. There is a story about this, and you can read all about it when we finish the paper.

Of course, asymptotic complexity isn't everything. The DPH program is still running about 10x slower than the Data.Vector program for large input sizes, and I'm sure that production C programs would run at least 5x faster than that. There's much low hanging fruit here. DPH misses many standard optimisation opportunities, which results in numerical values being unboxed and reboxed in our inner loops. There are also algorithmic improvements to the library that are just waiting for me to implement them. If I can make the blue line cross the red line in the next six months, then I'm buying everyone a beer.

Monday, January 23, 2012

The new Disciple Core language

It's been a while since updates. The last one was in August, and after that I got distracted with ICFP. When I got back I spent more time doing Coq proofs, finishing with Progress and Preservation for a version of SystemF2 with mutable algebraic data. The proofs are here. All data is allocated into the store (heap), and there are primitive functions to update it. In this language you can, say, destructively update a list so that the tail points to different physical data. This language doesn't have effect typing, but after this proof I felt like I had crossed the line from "Coq-shy" to "Coq-sure".

In early November I read about ciu-equivalence, and how to prove contextual equivalence of program transformations. Thinking ahead, it felt like time to cleanup the sponginess in the existing DDC core language, because I'd need to do this before trying to formalise it. Although the plain calculus has a proper semantics and a hand-done soundness proof, the actual core language as used in the compiler also has some half-baked hacks. The reviewers of a previous paper had made suggestions about how to cleanup other aspects of the core calculus, and having Coqified all those languages, I now understand why it was good advice.

Cutting to the chase, I've redesigned the DDC core language and there is an interpreter that works well enough to evaluate simple recursive functions. Here is an example:
  letrec {
   ack    [r1 r2 r3: %] 
          (m : Int r1) {!0 | Use r1 + Use r2 + Use r3}
          (n : Int r2) { Read r1 + Read r2 + Alloc r3
                       | Use r1  + Use r2  + Use r3}
          : Int r3
    = letregion r4 in
      let zero = 0 [r4] () in
      let one  = 1 [r4] () in
      case eqInt [:r1 r4 r4:] m zero of {
          1 -> addInt [:r2 r4 r3:] n one;
          _ -> case eqInt [:r2 r4 r4:] n zero of {
                  1 -> ack [:r4 r4 r3:] 
                           (subInt [:r1 r4 r4:] m one) 
                           (1 [r4] ());

                  _ -> ack [:r4 r4 r3:] 
                           (subInt [:r1 r4 r4:] m one)
                           (ack [:r1 r4 r4:] m (subInt [:r2 r4 r4:] n one));
          }
    }
  } in ack [:R0# R0# R0#:] (2 [R0#] ()) (3 [R0#] ());;
Here is another example that does destructive update of an integer:
  letregion r1 with {w1 : Mutable r1} in
  let x : Int r1 = 0 [r1] () in
  let _ : Unit   = updateInt [:r1 r1:] < w1 > x (2 [r1] ()) in
  addInt [:r1 r1 R2#:] x (3 [r1] ());;
and one that suspends the allocation of an integer with lazy evaluation:
  letregion r1 with { w1 : Const r1; w2 : Lazy r1; w3 : Global r1 } in
  let x : Int r1 lazy < w2 > 
     = purify < alloc [r1] w1 >  in
       forget < use [r1] w3 w1 > in
       2 [r1] () 
  in addInt [:r1 R0# R0#:] x (3 [R0#] ());;

Note that this is an intermediate representation for a compiler, and has vastly more type information than a real user would want to write. The compiler will perform type inference on the source language, and automatically translate the user program to this lower level language.

The new DDC core language is described on the wiki and so far I've been reasonably good at keeping the wiki up to date with what's implemented.

The main changes are:
  • Witnesses now exist in their own universe, at level 0 next to values. Both values and witnesses are classified by types, and types classified by kinds. This removes the dependent-kinding of the old language. The more I thought about it, the less fun formalising a dependently kinded language seemed to be.

  • I've removed the more-than constraints on effect and closure variables. The type inference algorithm I am using returns constraints on effect and closure variables, so I had added similar constraints to the core language. This was a bad idea because managing the additional subsumption judgements during core type checking is a headache. The new core language has no such constraints, and I think I know to process the output of the inferencer so I won't need them.

  • Introducing lazy evaluation is now done with the let .. lazy binding form instead of a primitive suspend operator. This goes hand-in-hand with the purify and forget keywords that mask the effect and closure of their body respectively. These two keywords are akin to the type casts used by SystemFc and the GHC core language to support GADTs. I think it makes more sense to mask out the effect of an expression before suspending it, than to pass a witness as to why it's pure. The former style will still be used in the source program because that's how the type inferencer works, but the latter should be easier to work with in the core language.
Anyway, if you darcs get the DDC repo you can make bin/ddci-core to build the interpreter. Run the examples like bin/ddci-core test/ddci-core/30-Eval/30-Letrec/Test.dcx. It's not completely finished, but all the examples under the test/ddci-core directory run fine.

The interpreter should be done in another couple of weeks. After that it'll be time to split off the LLVM backend from the existing compiler so that we can compile core programs directly.