Writing codelets in Julia
The IPUToolkit.IPUCompiler
submodule allows you to write codelets for the IPU in Julia. Codelets are defined with the @codelet
macro, and then you can use them inside a program, written using the interface to the Poplar SDK described before. This mechanism uses the GPUCompiler.jl
package, which is a generic framework for generating LLVM IR code for specialised targets, not limited to GPUs despite the historical name.
Examples of codelets written in Julia are shown in the files examples/main.jl
, examples/pi.jl
, examples/adam.jl
, examples/diffeq.jl
.
The code inside a codelet has the same limitations as all the compilation models based on GPUCompiler.jl
:
- the code has to be statically inferred and compiled, dynamic dispatch is not admitted;
- you cannot use functionalities which require the Julia runtime, most notably the garbage collector;
- you cannot call into any other external binary library at runtime, for example you cannot call into a BLAS library.
After defining a codelet with @codelet
you can add a vertex calling this codelet to the graph with the function add_vertex
, which also allows controlling the tile mapping in a basic way, or Poplar.GraphAddVertex
.
IPUToolkit.IPUCompiler.@codelet
— Macro@codelet graph <function definition>
Define a codelet and add it to the graph
. The @codelet
macro takes two argument:
- the graph to which to add the codelet with the
Poplar.GraphAddCodelets
function; - the function definition of the codelet that you want to compile for the IPU device.
All the arguments of the function must be either VertexVector
s, which represent the Vector
vertex type in the Poplar SDK, or VertexScalar
s, which represent scalar arguments. The function passed as second argument to @codelet
should have a single method.
@codelet
defines the function passed as argument, generates its LLVM Intermediate Representation (IR) using GPUCompiler.jl
and then compiles it down to native code using the Poplar compiler popc
, which must be in PATH
. By default the LLVM IR of the function is written to a temporary file, but you can choose to keep it in the current directory by customising IPUCompiler.KEEP_LLVM_FILES
. You can control flags passed to the popc
compiler like debug and optimisation levels or target types by customising IPUCompiler.POPC_FLAGS
. During compilation of codelets a spinner is displayed to show the progress, as this step can take a few seconds for each codelet to be generated. This can be disabled by setting IPUCompiler.PROGRESS_SPINNER
. All the options mentioned in this section have to be set before the @codelet
invocation where you want them to have effect.
The codelet is automatically added to the graph but you will have to separately use it in a vertex, by using either the add_vertex
function, or Poplar's Poplar.GraphAddVertex
.
Example
using IPUToolkit.IPUCompiler, IPUToolkit.Poplar
device = Poplar.get_ipu_device()
target = Poplar.DeviceGetTarget(device)
graph = Poplar.Graph(target)
@codelet graph function test(in::VertexVector{Int32,In}, out::VertexVector{Float32,Out})
for idx in eachindex(out)
out[idx] = sin(in[idx])
end
end
This snippet of code defines a codelet called test
, which takes in input the vector in
, whose elements are Int32
s, and modifies the vector out
, of type Float32
, by computing the sine of the elements of in
.
IPUToolkit.IPUCompiler.VertexVector
— TypeVertexVector{T, S} <: AbstractVector{T}
This datatype formally represents vectors to be used in codelets (vertices) in IPU programs. They are the counterpart of the vertex vector types in the Poplar SDK.
The parameters of VertexVector{T,S}
are
T
: the type of the elements of the vector, e.g.Int32
,Float32
, etc.;S
: the scope of the vector in the codelet,In
,Out
, orInOut
.
VertexVector
is only meant to be used by end-user to define the arguments of codelets with the @codelet
macro. You should not try to manually instantiate or access the fields of a VertexVector
.
For scalar arguments use VertexScalar
.
Example
VertexVector{Float32, In} # input-only vector of `Float32` elements
VertexVector{Int32, Out} # output-only vector of `Int32` elements
VertexVector{UInt32, InOut} # input/output vector of `UInt32` elements
IPUToolkit.IPUCompiler.VertexScalar
— TypeVertexScalar{T, S}
This datatype formally represents scalars to be used in codelets (vertices) in IPU programs. Technically, these are implemented as single-element tensors.
The parameters of VertexScalar{T,S}
are
T
: the type of the scalar, e.g.Int32
,Float32
, etc.;S
: the scope of the scalar in the codelet,In
,Out
, orInOut
.
VertexScalar
is only meant to be used by end-user to define the arguments of codelets with the @codelet
macro. You should not try to manually instantiate or access the fields of a VertexScalar
.
Inside a codelet you can access and set the number by unwrapping it with []
.
For vector arguments use VertexVector
.
Example
Examples of types
VertexScalar{Float32, In} # input-only `Float32` number
VertexScalar{Int32, Out} # output-only `Int32` number
VertexScalar{UInt32, InOut} # input/output `UInt32` number
Inside a codelet, let x
have type VertexScalar
, you can access its value if it has scope In
or InOut
with
@ipushow x[]
y = x[] / 3.14
If x
has scope Out
or InOut
you can set its value with x[] = ...
:
x[] = 3.14
IPUToolkit.IPUCompiler.add_vertex
— Functionadd_vertex(graph::Poplar.GraphAllocated,
compute_set_or_program::Union{Poplar.ComputeSetAllocated, Poplar.ProgramSequenceAllocated},
[tiles::Union{Integer,AbstractVector{<:Integer}},]
codelet::Function,
args::Union{Number,Poplar.TensorAllocated}...) -> Nothing
Add the codelet function codelet
created with @codelet
to graph
, using the tensors args
as arguments. The function codelet
must have exactly one method, no more, no less. The second argument can be either the program or the compute set to which to add the new vertex/vertices. If a program is passed, a new compute set will be automatically created.
add_vertex
also evenly maps all tensors and vertices across all tiles
, which can be either a single tile ID or an AbstractVector
of IDs and defaults to single tile 0 if this argument is omitted. Note that all argument tensors args
must be longer than or equal to the number of tiles
. If you want to have better control over tile mapping, use Poplar.GraphAddVertex
instead.
IPUToolkit.IPUCompiler.TARGET_COLOSSUS
— ConstantIPUToolkit.IPUCompiler.TARGET_COLOSSUS::Base.RefValue{Bool}
Option to control whether to target the Colossus backend when generating the LLVM Intermediate Representation (IR) of the codelets. If set to false
, the default, codelets will generate code for the host machine, which may be inefficient, while still being valid.
You can target the Colossus backend only if your Julia links to a version of libllvm compiled from Graphcore's fork of LLVM.
This option is experimental, Julia code generation using Graphcore's LLVM has not been tested extensively and is known to cause miscompilations, unexpected errors may happen.
Example
IPUToolkit.IPUCompiler.TARGET_COLOSSUS[] = false # Generate LLVM IR for the host, the default
IPUToolkit.IPUCompiler.TARGET_COLOSSUS[] = true # Generate LLVM IR for the Colossus backend
IPUToolkit.IPUCompiler.KEEP_LLVM_FILES
— ConstantIPUToolkit.IPUCompiler.KEEP_LLVM_FILES::Base.RefValue{Bool}
Option to control whether to keep in the current directory the files with the LLVM Intermediate Representation (IR) generated for the codelets.
Example
IPUToolkit.IPUCompiler.KEEP_LLVM_FILES[] = false # Generated LLVM IR files are automatically deleted after compilation, default
IPUToolkit.IPUCompiler.KEEP_LLVM_FILES[] = true # Generated LLVM IR files are kept in the current directory
IPUToolkit.IPUCompiler.POPC_FLAGS
— ConstantIPUToolkit.IPUCompiler.POPC_FLAGS::Base.RefValue{Cmd}
Options to pass to the popc
compiler to compile the code.
Example
IPUToolkit.IPUCompiler.POPC_FLAGS = `-O3 -g0 -target ipu2`
IPUToolkit.IPUCompiler.POPC_FLAGS = `-O2 -g`
IPUToolkit.IPUCompiler.PROGRESS_SPINNER
— ConstantIPUToolkit.IPUCompiler.PROGRESS_SPINNER::Base.RefValue{Bool}
Option to control whether to display a spinner to show progress during compilation of IPU codelets. This is forcibly disabled if DEBUG_COMPILATION_ERRORS
is true
.
Example
IPUToolkit.IPUCompiler.PROGRESS_SPINNER[] = true # enable spinner, default
IPUToolkit.IPUCompiler.PROGRESS_SPINNER[] = false # disable spinner
IPU builtins
Inside codelets defined with @codelet
all calls to random functions
rand(Float16)
rand(Float32)
rand(UInt32)
rand(UInt64)
randn(Float16)
randn(Float32)
result to call to corresponding IPU builtins for random number generation. The uniformly distributed numbers follow the general semantic of the Julia function rand
(floating point numbers are uniformely distributed in the $[0, 1)$ range), while the normally distributed numbers have the properties described in the Poplar SDK documentation (numbers are in the range $[-93/16, 93/16]$).
The IPU builtins for random numbers return pairs of numbers, but the Julia functions randn(Float16)
and randn(Float32)
return only a single number, discarding the second number of the pair. If you have a vector of even length that you want to fill in-place with normally distributed numbers, you can use the randn2!
function to do that efficiently, without discarding any number.
Additionally, you can use the IPU builtins listed below.
IPUToolkit.IPUCompiler.get_scount_l
— Functionget_scount_l()
Call the __builtin_ipu_get_scount_l()
builtin:
Get the value of the control/status register (CSR)
SCOUNT_L
, which is the lower 32 bits of the tile cycle counter value.
IPUToolkit.IPUCompiler.get_tile_id
— FunctionIPUToolkit.IPUCompiler.randn2!
— Functionrandn2!(v::VertexVector) -> v
Fill the vector v
with normally-distributed (mean 0, standard deviation 1) random numbers. The vector must have even length. This function takes advantage of IPU builtins for random number generation, which return pairs of numbers at a time.
Printing
Inside codelets you can print text and value of variables using the macros @ipuprintf
, @ipuprint
, @ipuprintln
, and @ipushow
. These macros are useful for debugging purposes but printing inside a codelet might incur performance penalty. To completely disable all printing and make these macros no-op you can set IPUCompiler.DISABLE_PRINT
:
IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipuprintf
— Macro@ipuprintf("%Fmt", args...)
Print a formatted string in device context on the host standard output.
Note that this is not a fully C-compliant printf
implementation.
Also beware that it is an untyped, and unforgiving printf
implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld
formatting string.
More user-friendly versions of this macro are @ipuprint
, @ipuprintln
. See also @ipushow
, which is built on top of @ipuprintf
functionalities.
Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT
:
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipuprint
— Macro@ipuprint(xs...)
@ipuprintln(xs...)
Print a textual representation of values xs
to standard output from the IPU. The functionality builds on @ipuprintf
, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchar
s and pointers. For more complex output, use @ipuprintf
directly.
Limited string interpolation is also possible:
@ipuprint("Hello, World ", 42, "\n")
@ipuprint "Hello, World $(42)\n"
Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT
:
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipuprintln
— Macro@ipuprint(xs...)
@ipuprintln(xs...)
Print a textual representation of values xs
to standard output from the IPU. The functionality builds on @ipuprintf
, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchar
s and pointers. For more complex output, use @ipuprintf
directly.
Limited string interpolation is also possible:
@ipuprint("Hello, World ", 42, "\n")
@ipuprint "Hello, World $(42)\n"
Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT
:
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipushow
— Macro@ipushow(ex)
IPU analogue of Base.@show
. It comes with the same type restrictions as @ipuprintf
.
@ipushow x
Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT
:
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.DISABLE_PRINT
— ConstantIPUToolkit.IPUCompiler.DISABLE_PRINT::Base.RefValue{Bool}
Global constant which controls whether printing through the various @ipuprint*
macros should be disabled or not. You may want to completely disable printing for production runs, to avoid the cost of printing on the device, but keep it enabled during development.
Examples:
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = false # Do not disable printing, this is the default.
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true # Disable printing, the `@ipuprint*` macros are no-op.
Benchmarking
To benchmark expressions inside codelets you can use the macros @ipucycles
, @ipushowcycles
, and @ipuelapsed
, which report the number of cycles spent in the wrapped expression. They are similar to Julia's @time
, @showtime
, and @elapsed
macros, but report the number of cycles, as the clockspeed of tiles cannot be easily obtained inside a codelet. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target)
outside of the codelet, and should usually be 1.330 GHz or 1.850 GHz depending on the model of your IPU. The printing macros @ipucycles
and @ipushowcycles
can be made completely no-op by setting IPUCompiler.DISABLE_PRINT
.
Timing of expressions taking longer than typemax(UInt32) / tile_clock_frequency
(about 2 or 3 seconds depending on your IPU model) is unreliable because the difference between the starting and the ending cycle counts would overflow.
Note also that the Poplar.TargetGetTileClockFrequency(target)
function may not return a reliable value, but this is an upstream bug (this has been observed at least up to Poplar SDK v3.0). You may have to use tools like gc-monitor
, gc-inventory
, or gc-info --device-id <N> --tile-clock-speed
to obtain the correct tile clock frequency.
IPUToolkit.IPUCompiler.@ipucycles
— Macro@ipucycles ex
@ipucycles "description" ex
Print from inside a codelet the number of cycles spent to compute the expression ex
. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target)
outside of the codelet. The optional argument description
, a literal String
, can be used to print also a label to identify the timed expression. A label is added automatically by @ipushowcycles
.
See also @ipuelapsed
.
This macro can be made no-op completely by setting
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipushowcycles
— Macro@ipushowcycles ex
Print from inside a codelet the expression ex
and the number of cycles spent to compute it. This is useful when benchmarking multiple expression, to identify their contributions more easily. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target)
outside of the codelet.
See also @ipucycles
, @ipuelapsed
.
This macro can be made no-op completely by setting
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipuelapsed
— Macro@ipuelapsed ex
Return number of cycles spent to compute the expression ex
. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target)
outside of the codelet.
See also @ipucycles
, @ipushowcycles
.
Passing non-constant variables from global scope
If your kernel references a non-constant (const
) global variable, the generated code will result in a reference to a memory address on the host, and this will fatally fail at runtime because programs running on the IPU don't have access to the host memory. Constant variables are not affected by this problem because their values are inlined when the function is compiled. If you can't or don't want to make a variable constant you can interpolate its value with a top-level @eval
when defining the codelet. For example:
using IPUToolkit.IPUCompiler, IPUToolkit.Poplar
device = Poplar.get_ipu_device()
target = Poplar.DeviceGetTarget(device)
graph = Poplar.Graph(target)
tile_clock_frequency = Poplar.TargetGetTileClockFrequency(target)
@eval @codelet graph function test(invec::VertexVector{Float32, In}, outvec::VertexVector{Float32, Out})
# We can use the intrinsic `get_scount_l` to get the cycle counter right
# before and after some operations, so that we can benchmark it.
cycles_start = get_scount_l()
# Do some operations here...
cycles_end = get_scount_l()
# Divide the difference between the two cycle counts by the tile frequency
# clock to get the time.
time = (cycles_end - cycles_start) / $(tile_clock_frequency)
# Show the time spent doing your operations
@ipushow time
end
The use of @eval
allows you not to have to pass an extra argument to your kernel just to use the value of the variable inside the codelet.
Debugging compilation errors in codelets
Writing codelets for the IPU takes some practice, because you cannot use any arbitrary construct or package as you would normally do when running code on a CPU. As mentioned above, codelets have to be statically compiled with GPUCompiler.jl
, with all the limitations of this framework, which can only use a subset of the Julia language. Therefore, it happens frequently that you run into compilation errors while developing a codelet function, and you have then to resolve the issues, which usually involves removing dynamic dispatch calls (which would require the JIT compiler at runtime), resolving type-instabilities, avoiding memory allocations, etc... If you have Cthulhu.jl
installed, you can set IPUCompiler.DEBUG_COMPILATION_ERRORS
to true
to automatically open an interactive shell when compiling a codelet results into invalid LLVM IR, to more easily debug the codelet code.
We suggest again taking a look at the code samples in the examples/
directory for learning how to write working IPU codelets in Julia.
IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS
— ConstantIPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS::Base.RefValue{Bool}
Option to control whether a failure to compile LLVM IR in @codelet
should drop you into an interactive debug session with Cthulhu.jl
. This forcibly disables the progress spinner enabled by PROGRESS_SPINNER
, as it would not play nicely with the interactive debug session.
Cthulhu.jl
must be installed in the environment you are currently using and you have to run using Cthulhu
before the @codelet
definition. IPUToolkit.jl
does not install Cthulhu.jl
automatically to limit the number of dependencies.
Example
IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS[] = false # Do not automatically open interactive debug shell when a compilation error arises, the default
IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS[] = true # Automatically open interactive debug shell when a compilation error arises
Domain-Specific Language: @ipuprogram
The IPUCompiler.@ipuprogram
macro provides a very simple and limited DSL to automatically generate most of the boilerplate code needed when writing an IPU program. You can do very little with this DSL, which is mainly a showcase of Julia's meta-programming capabilities. A fully commented examples of use of the @ipuprogram
macro is available in the examples/dsl.jl
file.