Thursday, February 2, 2012

Vectorisation without Replication in Data Parallel Haskell

Here is a Barnes-Hut gravitation simulation written using Data.Vector and Gloss.



If done naively, such an n-body simulation has a runtime complexity of O(n^2) because we need to consider the interaction of every body with every other body. Barnes-Hut performs an approximation that reduces this to O(n . log n). At each step in the simulation we insert the bodies (grey) into a quad-tree (green), and compute the centroid for each branch (blue). Now, if some other body is sufficiently far away from a centroid, then we use the centroid to approximate the force due to the bodies in the corresponding branch, instead of inspecting each one individually.

Now you've seen the video, the following graph sums up my work on Data Parallel Haskell (DPH) for the past six months:

This is a plot of runtime vs the number of bodies in the Barnes-Hut simulation. The simulation in the video uses just 1000 bodies, but my understanding is that the physicists need millions or billions to do useful physics work. Back to the graph, the red line is the program from the video, which uses Data.Vector to store its arrays and runs sequentially. The brown line is using the DPH library that shipped with GHC 7.2. The blue line is using the DPH library that will ship with GHC 7.4. Note how the asymptotic complexity of the program with New DPH is the same as with Data.Vector.

In Old DPH we were using a naive approach to vectorisation that resulted in the quad-tree being replicated (copied) once for every body in the simulation (bad!). In New DPH we compute the force on each body in parallel while sharing the same quad-tree. There is a story about this, and you can read all about it when we finish the paper.

Of course, asymptotic complexity isn't everything. The DPH program is still running about 10x slower than the Data.Vector program for large input sizes, and I'm sure that production C programs would run at least 5x faster than that. There's much low hanging fruit here. DPH misses many standard optimisation opportunities, which results in numerical values being unboxed and reboxed in our inner loops. There are also algorithmic improvements to the library that are just waiting for me to implement them. If I can make the blue line cross the red line in the next six months, then I'm buying everyone a beer.

Monday, January 23, 2012

The new Disciple Core language

It's been a while since updates. The last one was in August, and after that I got distracted with ICFP. When I got back I spent more time doing Coq proofs, finishing with Progress and Preservation for a version of SystemF2 with mutable algebraic data. The proofs are here. All data is allocated into the store (heap), and there are primitive functions to update it. In this language you can, say, destructively update a list so that the tail points to different physical data. This language doesn't have effect typing, but after this proof I felt like I had crossed the line from "Coq-shy" to "Coq-sure".

In early November I read about ciu-equivalence, and how to prove contextual equivalence of program transformations. Thinking ahead, it felt like time to cleanup the sponginess in the existing DDC core language, because I'd need to do this before trying to formalise it. Although the plain calculus has a proper semantics and a hand-done soundness proof, the actual core language as used in the compiler also has some half-baked hacks. The reviewers of a previous paper had made suggestions about how to cleanup other aspects of the core calculus, and having Coqified all those languages, I now understand why it was good advice.

Cutting to the chase, I've redesigned the DDC core language and there is an interpreter that works well enough to evaluate simple recursive functions. Here is an example:
  letrec {
   ack    [r1 r2 r3: %] 
          (m : Int r1) {!0 | Use r1 + Use r2 + Use r3}
          (n : Int r2) { Read r1 + Read r2 + Alloc r3
                       | Use r1  + Use r2  + Use r3}
          : Int r3
    = letregion r4 in
      let zero = 0 [r4] () in
      let one  = 1 [r4] () in
      case eqInt [:r1 r4 r4:] m zero of {
          1 -> addInt [:r2 r4 r3:] n one;
          _ -> case eqInt [:r2 r4 r4:] n zero of {
                  1 -> ack [:r4 r4 r3:] 
                           (subInt [:r1 r4 r4:] m one) 
                           (1 [r4] ());

                  _ -> ack [:r4 r4 r3:] 
                           (subInt [:r1 r4 r4:] m one)
                           (ack [:r1 r4 r4:] m (subInt [:r2 r4 r4:] n one));
          }
    }
  } in ack [:R0# R0# R0#:] (2 [R0#] ()) (3 [R0#] ());;
Here is another example that does destructive update of an integer:
  letregion r1 with {w1 : Mutable r1} in
  let x : Int r1 = 0 [r1] () in
  let _ : Unit   = updateInt [:r1 r1:] < w1 > x (2 [r1] ()) in
  addInt [:r1 r1 R2#:] x (3 [r1] ());;
and one that suspends the allocation of an integer with lazy evaluation:
  letregion r1 with { w1 : Const r1; w2 : Lazy r1; w3 : Global r1 } in
  let x : Int r1 lazy < w2 > 
     = purify < alloc [r1] w1 >  in
       forget < use [r1] w3 w1 > in
       2 [r1] () 
  in addInt [:r1 R0# R0#:] x (3 [R0#] ());;

Note that this is an intermediate representation for a compiler, and has vastly more type information than a real user would want to write. The compiler will perform type inference on the source language, and automatically translate the user program to this lower level language.

The new DDC core language is described on the wiki and so far I've been reasonably good at keeping the wiki up to date with what's implemented.

The main changes are:
  • Witnesses now exist in their own universe, at level 0 next to values. Both values and witnesses are classified by types, and types classified by kinds. This removes the dependent-kinding of the old language. The more I thought about it, the less fun formalising a dependently kinded language seemed to be.

  • I've removed the more-than constraints on effect and closure variables. The type inference algorithm I am using returns constraints on effect and closure variables, so I had added similar constraints to the core language. This was a bad idea because managing the additional subsumption judgements during core type checking is a headache. The new core language has no such constraints, and I think I know to process the output of the inferencer so I won't need them.

  • Introducing lazy evaluation is now done with the let .. lazy binding form instead of a primitive suspend operator. This goes hand-in-hand with the purify and forget keywords that mask the effect and closure of their body respectively. These two keywords are akin to the type casts used by SystemFc and the GHC core language to support GADTs. I think it makes more sense to mask out the effect of an expression before suspending it, than to pass a witness as to why it's pure. The former style will still be used in the source program because that's how the type inferencer works, but the latter should be easier to work with in the core language.
Anyway, if you darcs get the DDC repo you can make bin/ddci-core to build the interpreter. Run the examples like bin/ddci-core test/ddci-core/30-Eval/30-Letrec/Test.dcx. It's not completely finished, but all the examples under the test/ddci-core directory run fine.

The interpreter should be done in another couple of weeks. After that it'll be time to split off the LLVM backend from the existing compiler so that we can compile core programs directly.