As mentioned above, ParallelAccelerator aims to optimize implicitly parallel Julia programs that are safe to parallelize. It also tries to be non-invasive, which means a user function or program should continue to work as expected even when only a part of it is accelerated. It is still important to know what parts are accelerated, however. As a general guideline, we encourage users to write programs using high-level array operations rather than writing explicit for-loops which can have unrestricted mutations or unknown side-effects. High-level operations are more amenable to analysis and optimization provided by ParallelAccelerator.
To help user verify program correctness, the optimizations of ParallelAccelerator
can be turned off by setting environment variable
running Julia. Programs that use ParallelAccelerator will still run
(including those that use
runStencil, described below), but no optimizations or
Julia-to-C translation will take place. Users can also use
at the function call site to use the original version of the function.
Map and Reduce¶
Array operations that work uniformly on all elements of input arrays and
produce an output array of equal size are called point-wise operations.
Point-wise binary operations in Julia usually have a . prefix in the
operator name. These operations are translated internally into data-parallel map operations by
ParallelAccelerator. The following are recognized by
@acc as map
- Unary functions:
- Binary functions:
Array assignments are also recognized and converted into in-place map
operations. Expressions like
a = a .+ b will be turned into an in-place map
that takes two inputs arrays,
b, and updates
Array operations that compute a single result by repeating an associative
and commutative operator on all input array elements are called reduce operations.
The following are recognized by
We also support range operations to a limited extent. For example,
r is either a
internally converted to parallel operations when the ranges can be inferred
statically to be compatible. However, such support is still
experimental, and occasionally ParallelAccelerator will complain about not
being able to optimize them. We are working on improving this feature
to provide more coverage and better error messages.
Array comprehensions in Julia are in general also parallelizable, because
unlike general loops, their iteration variables have no inter-dependencies.
@acc macro will turn them into an internal form that we call
A = Type[ f (x1, x2, ...) for x1 in r1, x2 in r2, ... ]
cartesianarray((i1,i2,...) -> begin x1 = r1[i1]; x2 = r2[i2]; f(x1,x2,...) end, Type,(length(r1), length(r2), ...))
cartesianarray function is also exported by
can be directly used by the user. Both the above two forms are acceptable
programs, and equivalent in semantics. They both produce a N-dimensional array
whose element is of
N is the number of x and r variables, and
currently only up-to-3 dimensions are supported.
It should be noted, however, that not all comprehensions are safe to parallelize.
For example, if the function
f above reads and writes to a variable outside of the comprehension,
then making it run in parallel can produce a non-deterministic
result. Therefore, it is the responsibility of the user to avoid using
@acc in such situations.
Another difference between parallel comprehension and the aforementioned map operation is that array indexing operations in the body of a parallel comprehension remain explicit and therefore should go through necessary bounds-checking to ensure safety. On the other hand, in all map operations such bounds-checking is skipped.
Stencils are commonly found in scientific computing and image processing. A stencil
computation is one that computes new values for all elements of an array based
on the current values of their neighboring elements. Since Julia’s base library
does not provide such an API, ParallelAccelerator exports a general
runStencil interface to help with stencil programming:
runStencil(kernel :: Function, buffer1, buffer2, ..., iteration :: Int, boundaryHandling :: Symbol)
As an example, the following (taken from
our Gaussian blur example)
performs a 5x5 stencil computation (note the use of Julia’s
do-block syntax that lets
the user write a lambda function):
runStencil(buf, img, iterations, :oob_skip) do b, a b[0,0] = (a[-2,-2] * 0.003 + a[-1,-2] * 0.0133 + a[0,-2] * 0.0219 + a[1,-2] * 0.0133 + a[2,-2] * 0.0030 + a[-2,-1] * 0.0133 + a[-1,-1] * 0.0596 + a[0,-1] * 0.0983 + a[1,-1] * 0.0596 + a[2,-1] * 0.0133 + a[-2, 0] * 0.0219 + a[-1, 0] * 0.0983 + a[0, 0] * 0.1621 + a[1, 0] * 0.0983 + a[2, 0] * 0.0219 + a[-2, 1] * 0.0133 + a[-1, 1] * 0.0596 + a[0, 1] * 0.0983 + a[1, 1] * 0.0596 + a[2, 1] * 0.0133 + a[-2, 2] * 0.003 + a[-1, 2] * 0.0133 + a[0, 2] * 0.0219 + a[1, 2] * 0.0133 + a[2, 2] * 0.0030) return a, b end
It takes two input arrays, buf and img, and performs an iterative stencil
loop (ISL) of the number of iterations given by iterations.
The stencil kernel is specified by a lambda
function that takes two arrays a and b (that correspond to buf and
img), and computes the value of the output buffer using relative indices
as if a cursor is traversing all array elements. [0,0] represents
the current cursor position. The return statement in this lambda reverses
the position of a and b to specify a buffer rotation that should happen
in between the stencil iterations.
runStencil assumes that
all input and output buffers are of the same dimension and size.
Stencil boundary handling can be specified as one of the following symbols:
:oob_skip: Writing to output is skipped when input indexing is out-of-bound.
:oob_wraparound: Indexing is “wrapped around” at the array boundaries so they are always safe.
:oob_dst_zero: Write 0 to the output array when any of the input indices is out-of-bounds.
:oob_src_zero. Assume 0 is returned by a read operation when indexing is out-of-bounds.
Just as with parallel comprehension, accessing the variables outside the body
runStencil lambda expression is allowed.
However, accessing outside array values is
not supported, and reading/writing the same outside variable can cause
All arrays that need to be relatively indexed can be specified as
runStencil does not impose any implicit buffer rotation
order, and the user can choose not to rotate buffers in
can be multiple output buffers as well. Finally, the call to
not have any return value, and inputs are rotated
iterations - 1 times if rotation is specified.
ParallelAccelerator exports a naive Julia implementation of
runs without using
@acc. Its purpose is mostly for correctness checking.
@acc is being used with environment variable
instead of parallelizing the stencil computation
@acc will expand the call
runStencil to a fast sequential implementation.