I'm getting back into blogging. This is current work in progress.
Repa 4 will include a GHC plugin that performs array fusion using a version of Richard Waters's series expressions system, extended to support the segmented operators we need for Data Parallel Haskell.
The plugin converts GHC core code to Disciple Core, performs the fusion transform, and then converts back to GHC core. We're using Disciple Core for the main transform because it has a simple (and working) external core format, and the core language AST along with most of the core-to-core transforms are parametric in the type used for names. This second feature is important because we use a version of Disciple Core where the main array combinators (map, fold, filter etc) are primitive operators rather than regular function bindings.
The fusion transform is almost (almost) working for some simple examples. Here is the compilation sequence:
process :: Stream k Int -> Int
process s
= fold (+) 0 s + fold (*) 1 s
In this example we're performing two separate reductions over the same stream. None of the existing short-cut fusion approaches can handle this properly. Stream fusion in Data.Vector, Repa-style delayed array fusion, and build/foldr fusion will all read the stream elements from memory twice (if they are already manifest) or compute them twice (if they are delayed). We want to compute both reductions in a single pass.
repa_process_r2 : [k_aq0 : Data].Stream_r0 k_aq0 Int_3J -> Int_3J
= \(k_c : *_34d).\(s_aqe : Stream_r0 k_c Int_3J).
$fNumInt_$c+_rjF
(fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c+_rjF
(I#_6d 0i#) s_aqe)
(fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c*_rjE
(I#_6d 1i#) s_aqe)
Detect array combinators from GHC core, and convert to DDC primops
repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
= /\(k_c : Rate).
\(s_aqe : Stream# k_c Int#).
add# [Int#]
(fold# [k_c] [Int#] [Int#] (add# [Int#]) 0i# s_aqe)
(fold# [k_c] [Int#] [Int#] (mul# [Int#]) 1i# s_aqe)
Normalize and shift array combinators to top-level
All array combinators are used in their own binding.
repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
= /\(k_c : Rate).
\(s_aqe : Stream# k_c Int#).
let x0 = add# [Int#] in
let x1 = fold# [k_c] [Int#] [Int#] x0 0i# s_aqe in
let x2 = mul# [Int#] in
let x3 = fold# [k_c] [Int#] [Int#] x2 1i# s_aqe in
add# [Int#] x1 x3
Inline and eta-expand worker functions
This puts the program in the correct form for the next phase.
repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
= /\(k_c : Rate).
\(s_aqe : Stream# k_c Int#).
let x1
= fold# [k_c] [Int#] [Int#]
(\(x0 x1 : Int#). add# [Int#] x0 x1) 0i# s_aqe in
let x3
= fold# [k_c] [Int#] [Int#]
(\(x2 x3 : Int#). mul# [Int#] x2 x3) 1i# s_aqe in
add# [Int#] x1 x3
Do the lowering transform
This is the main pass that performs array fusion. Note that we've introduced a single loop# that computes both of the fold# results.
repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#
= /\(k_c : Rate).
\(s_aqe : Stream# k_c Int#).
let x1_acc : Ref# Int# = new# [Int#] 0i# in
let x3_acc : Ref# Int# = new# [Int#] 1i# in
let _ : Unit
= loop# (lengthOfRate# [k_c])
(\(x0 : Nat#).
let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in
let x0 : Int# = read# [Int#] x1_acc in
let _ : Void#
= write# [Int#] x1_acc (add# [Int#] x0 x1) in
let x2 : Int# = read# [Int#] x3_acc in
let _ : Void#
= write# [Int#] x3_acc (mul# [Int#] x2 x1) in
()) in
let x1 : Int# = read# [Int#] x1_acc in
let x3 : Int# = read# [Int#] x3_acc in
add# [Int#] x1 x3
Assign imperative variable storage to arrays
We need to convert the code back to GHC core, but we don't want to use IORefs because they can't hold unboxed values (of types like Int#). Instead, we use some new arrays to hold these values instead.
repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#
= /\(k_c : Rate).
\(s_aqe : Stream# k_c Int#).
let x1_acc : Array# Int# = newArray# [Int#] 8# in
let _ : Void# = writeArray# [Int#] x1_acc 0# 0i# in
let x3_acc : Array# Int# = newArray# [Int#] 8# in
let _ : Void# = writeArray# [Int#] x3_acc 0# 1i# in
let _ : Unit
= loop# (lengthOfRate# [k_c])
(\(x0 : Nat#).
let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in
let x0 : Int# = readArray# [Int#] x1_acc 0# in
let _ : Void#
= writeArray# [Int#] x1_acc 0#
(add# [Int#] x0 x1) in
let x2 : Int# = readArray# [Int#] x3_acc 0# in
let _ : Void#
= writeArray# [Int#] x3_acc 0#
(mul# [Int#] x2 x1) in
()) in
let x1 : Int# = readArray# [Int#] x1_acc 0# in
let x3 : Int# = readArray# [Int#] x3_acc 0# in
add# [Int#] x1 x3
Thread state token through effectful primops
The lowered code is naturally imperative, and GHC uses state threading to represent this.
repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> World# -> Tuple2# World# Int#
= /\(k_c : Rate).
\(s_aqe : Stream# k_c Int#).\(x0 : World#).
caselet T2# (x1 : World#) (x1_acc : Array# Int#)
= newArray# [Int#] 8# x0 in
caselet T2# (x2 : World#) (_ : Void#)
= writeArray# [Int#] x1_acc 0# 0i# x1 in
caselet T2# (x3 : World#) (x3_acc : Array# Int#)
= newArray# [Int#] 8# x2 in
caselet T2# (x4 : World#) (_ : Void#)
= writeArray# [Int#] x3_acc 0# 1i# x3 in
caselet T2# (x11 : World#) (_ : Unit)
= loop# (lengthOfRate# [k_c])
(\(x0 : Nat#).\(x5 : World#).
caselet T2# (x6 : World#) (x1 : Int#)
= next# [Int#] [k_c] s_aqe x0 x5 in
caselet T2# (x7 : World#) (x0 : Int#)
= readArray# [Int#] x1_acc 0# x6 in
caselet T2# (x8 : World#) (_ : Void#)
= writeArray# [Int#] x1_acc 0# (add# [Int#] x0 x1) x7 in
caselet T2# (x9 : World#) (x2 : Int#)
= readArray# [Int#] x3_acc 0# x8 in
caselet T2# (x10 : World#) (_ : Void#)
= writeArray# [Int#] x3_acc 0# (mul# [Int#] x2 x1) x9 in
T2# x10 ()) x4 in
caselet T2# (x12 : World#) (x1 : Int#)
= readArray# [Int#] x1_acc 0# x11 in
caselet T2# (x13 : World#) (x3 : Int#)
= readArray# [Int#] x3_acc 0# x12 in
T2# x13 (add# [Int#] x1 x3)
Here, "caselet" is just sugar for a case expression with a single alternative.
Covert back to GHC core
repa_process_sTX
:: forall k_c.
Data.Array.Repa.Series.Stream k_c GHC.Types.Int
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Int# #)
[LclId]
lowered_sTX =
\ (@ k_c)
(rate_sTY :: GHC.Prim.Int#)
(x_sTZ :: Data.Array.Repa.Series.Stream k_c GHC.Types.Int)
(x_sU0 :: GHC.Prim.State# GHC.Prim.RealWorld) ->
let { (# x_sU1, x_sU2 #) ~ scrut_sUv
<- newByteArray#_sU3 @ GHC.Prim.RealWorld 8 x_sU0
} in
let { __DEFAULT ~ x_sU8
<- writeIntArray#_sU4 @ GHC.Prim.RealWorld x_sU2 0 0 x_sU1
} in
let { (# x_sU5, x_sU6 #) ~ scrut_sUw
<- newByteArray#_sU7 @ GHC.Prim.RealWorld 8 x_sU8
} in
let { __DEFAULT ~ x_sUp
<- writeIntArray#_sU9 @ GHC.Prim.RealWorld x_sU6 0 1 x_sU5
} in
let { (# x_sUa, x_sUb #) ~ x_sUa
<- Main.primLoop
(Main.primLengthOfRate rate_sTY)
(\ (x_sUc :: GHC.Prim.Int#)
(x_sUd :: GHC.Prim.State# GHC.Prim.RealWorld) ->
let { (# x_sUe, x_sU1 #) ~ x_sUe
<- Main.primNext_Int @ k_c x_sTZ x_sUc x_sUd
} in
let { (# x_sUf, x_sUc #) ~ x_sUf
<- readIntArray#_sUg x_sU2 0 x_sUe
} in
let { __DEFAULT ~ x_sUl
<- writeIntArray#_sUh
@ GHC.Prim.RealWorld x_sU2 0 (+#_sUi x_sUc x_sU1) x_sUf
} in
let { (# x_sUj, x_sU8 #) ~ x_sUj
<- readIntArray#_sUk x_sU6 0 x_sUl
} in
let { __DEFAULT ~ x_sUo
<- writeIntArray#_sUm
@ GHC.Prim.RealWorld x_sU6 0 (*#_sUn x_sU8 x_sU1) x_sUj
} in
(# x_sUo, GHC.Tuple.() #))
x_sUp
} in
let { (# x_sUq, x_sU1 #) ~ x_sUq
<- readIntArray#_sUr x_sU2 0 x_sUa
} in
let { (# x_sUs, x_sU5 #) ~ x_sUs
<- readIntArray#_sUt x_sU6 0 x_sUq
} in
(# x_sUs, +#_sUu x_sU1 x_sU5 #)
This doesn't work yet because I've forgotten to pass the type arguments to the unboxed tuple constructor (#,#), and maybe other problems as well. I'll post again when I have an actual program running.
No comments:
Post a Comment