Writing codelets in Julia

The IPUToolkit.IPUCompiler submodule allows you to write codelets for the IPU in Julia. Codelets are defined with the @codelet macro, and then you can use them inside a program, written using the interface to the Poplar SDK described before. This mechanism uses the GPUCompiler.jl package, which is a generic framework for generating LLVM IR code for specialised targets, not limited to GPUs despite the historical name.

Examples of codelets written in Julia are shown in the files examples/main.jl, examples/pi.jl, examples/adam.jl, examples/diffeq.jl.

The code inside a codelet has the same limitations as all the compilation models based on GPUCompiler.jl:

  • the code has to be statically inferred and compiled, dynamic dispatch is not admitted;
  • you cannot use functionalities which require the Julia runtime, most notably the garbage collector;
  • you cannot call into any other external binary library at runtime, for example you cannot call into a BLAS library.

After defining a codelet with @codelet you can add a vertex calling this codelet to the graph with the function add_vertex, which also allows controlling the tile mapping in a basic way, or Poplar.GraphAddVertex.

IPUToolkit.IPUCompiler.@codeletMacro
@codelet graph <function definition>

Define a codelet and add it to the graph. The @codelet macro takes two argument:

  • the graph to which to add the codelet with the Poplar.GraphAddCodelets function;
  • the function definition of the codelet that you want to compile for the IPU device.

All the arguments of the function must be either VertexVectors, which represent the Vector vertex type in the Poplar SDK, or VertexScalars, which represent scalar arguments. The function passed as second argument to @codelet should have a single method.

@codelet defines the function passed as argument, generates its LLVM Intermediate Representation (IR) using GPUCompiler.jl and then compiles it down to native code using the Poplar compiler popc, which must be in PATH. By default the LLVM IR of the function is written to a temporary file, but you can choose to keep it in the current directory by customising IPUCompiler.KEEP_LLVM_FILES. You can control flags passed to the popc compiler like debug and optimisation levels or target types by customising IPUCompiler.POPC_FLAGS. During compilation of codelets a spinner is displayed to show the progress, as this step can take a few seconds for each codelet to be generated. This can be disabled by setting IPUCompiler.PROGRESS_SPINNER. All the options mentioned in this section have to be set before the @codelet invocation where you want them to have effect.

The codelet is automatically added to the graph but you will have to separately use it in a vertex, by using either the add_vertex function, or Poplar's Poplar.GraphAddVertex.

Example

using IPUToolkit.IPUCompiler, IPUToolkit.Poplar
device = Poplar.get_ipu_device()
target = Poplar.DeviceGetTarget(device)
graph = Poplar.Graph(target)
@codelet graph function test(in::VertexVector{Int32,In}, out::VertexVector{Float32,Out})
    for idx in eachindex(out)
        out[idx] = sin(in[idx])
    end
end

This snippet of code defines a codelet called test, which takes in input the vector in, whose elements are Int32s, and modifies the vector out, of type Float32, by computing the sine of the elements of in.

source
IPUToolkit.IPUCompiler.VertexVectorType
VertexVector{T, S} <: AbstractVector{T}

This datatype formally represents vectors to be used in codelets (vertices) in IPU programs. They are the counterpart of the vertex vector types in the Poplar SDK.

The parameters of VertexVector{T,S} are

  • T: the type of the elements of the vector, e.g. Int32, Float32, etc.;
  • S: the scope of the vector in the codelet, In, Out, or InOut.

VertexVector is only meant to be used by end-user to define the arguments of codelets with the @codelet macro. You should not try to manually instantiate or access the fields of a VertexVector.

For scalar arguments use VertexScalar.

Example

VertexVector{Float32, In}    # input-only vector of `Float32` elements
VertexVector{Int32, Out}     # output-only vector of `Int32` elements
VertexVector{UInt32, InOut}  # input/output vector of `UInt32` elements
source
IPUToolkit.IPUCompiler.VertexScalarType
VertexScalar{T, S}

This datatype formally represents scalars to be used in codelets (vertices) in IPU programs. Technically, these are implemented as single-element tensors.

The parameters of VertexScalar{T,S} are

  • T: the type of the scalar, e.g. Int32, Float32, etc.;
  • S: the scope of the scalar in the codelet, In, Out, or InOut.

VertexScalar is only meant to be used by end-user to define the arguments of codelets with the @codelet macro. You should not try to manually instantiate or access the fields of a VertexScalar.

Inside a codelet you can access and set the number by unwrapping it with [].

For vector arguments use VertexVector.

Example

Examples of types

VertexScalar{Float32, In}    # input-only `Float32` number
VertexScalar{Int32, Out}     # output-only `Int32` number
VertexScalar{UInt32, InOut}  # input/output `UInt32` number

Inside a codelet, let x have type VertexScalar, you can access its value if it has scope In or InOut with

@ipushow x[]
y = x[] / 3.14

If x has scope Out or InOut you can set its value with x[] = ...:

x[] = 3.14
source
IPUToolkit.IPUCompiler.add_vertexFunction
add_vertex(graph::Poplar.GraphAllocated,
           compute_set_or_program::Union{Poplar.ComputeSetAllocated, Poplar.ProgramSequenceAllocated},
           [tiles::Union{Integer,AbstractVector{<:Integer}},]
           codelet::Function,
           args::Union{Number,Poplar.TensorAllocated}...) -> Nothing

Add the codelet function codelet created with @codelet to graph, using the tensors args as arguments. The function codelet must have exactly one method, no more, no less. The second argument can be either the program or the compute set to which to add the new vertex/vertices. If a program is passed, a new compute set will be automatically created.

add_vertex also evenly maps all tensors and vertices across all tiles, which can be either a single tile ID or an AbstractVector of IDs and defaults to single tile 0 if this argument is omitted. Note that all argument tensors args must be longer than or equal to the number of tiles. If you want to have better control over tile mapping, use Poplar.GraphAddVertex instead.

source
IPUToolkit.IPUCompiler.TARGET_COLOSSUSConstant
IPUToolkit.IPUCompiler.TARGET_COLOSSUS::Base.RefValue{Bool}

Option to control whether to target the Colossus backend when generating the LLVM Intermediate Representation (IR) of the codelets. If set to false, the default, codelets will generate code for the host machine, which may be inefficient, while still being valid.

Note

You can target the Colossus backend only if your Julia links to a version of libllvm compiled from Graphcore's fork of LLVM.

Warning

This option is experimental, Julia code generation using Graphcore's LLVM has not been tested extensively and is known to cause miscompilations, unexpected errors may happen.

Example

IPUToolkit.IPUCompiler.TARGET_COLOSSUS[] = false # Generate LLVM IR for the host, the default
IPUToolkit.IPUCompiler.TARGET_COLOSSUS[] = true  # Generate LLVM IR for the Colossus backend
source
IPUToolkit.IPUCompiler.KEEP_LLVM_FILESConstant
IPUToolkit.IPUCompiler.KEEP_LLVM_FILES::Base.RefValue{Bool}

Option to control whether to keep in the current directory the files with the LLVM Intermediate Representation (IR) generated for the codelets.

Example

IPUToolkit.IPUCompiler.KEEP_LLVM_FILES[] = false # Generated LLVM IR files are automatically deleted after compilation, default
IPUToolkit.IPUCompiler.KEEP_LLVM_FILES[] = true  # Generated LLVM IR files are kept in the current directory
source
IPUToolkit.IPUCompiler.POPC_FLAGSConstant
IPUToolkit.IPUCompiler.POPC_FLAGS::Base.RefValue{Cmd}

Options to pass to the popc compiler to compile the code.

Example

IPUToolkit.IPUCompiler.POPC_FLAGS = `-O3 -g0 -target ipu2`
IPUToolkit.IPUCompiler.POPC_FLAGS = `-O2 -g`
source
IPUToolkit.IPUCompiler.PROGRESS_SPINNERConstant
IPUToolkit.IPUCompiler.PROGRESS_SPINNER::Base.RefValue{Bool}

Option to control whether to display a spinner to show progress during compilation of IPU codelets. This is forcibly disabled if DEBUG_COMPILATION_ERRORS is true.

Example

IPUToolkit.IPUCompiler.PROGRESS_SPINNER[] = true  # enable spinner, default
IPUToolkit.IPUCompiler.PROGRESS_SPINNER[] = false # disable spinner
source

IPU builtins

Inside codelets defined with @codelet all calls to random functions

  • rand(Float16)
  • rand(Float32)
  • rand(UInt32)
  • rand(UInt64)
  • randn(Float16)
  • randn(Float32)

result to call to corresponding IPU builtins for random number generation. The uniformly distributed numbers follow the general semantic of the Julia function rand (floating point numbers are uniformely distributed in the $[0, 1)$ range), while the normally distributed numbers have the properties described in the Poplar SDK documentation (numbers are in the range $[-93/16, 93/16]$).

Note

The IPU builtins for random numbers return pairs of numbers, but the Julia functions randn(Float16) and randn(Float32) return only a single number, discarding the second number of the pair. If you have a vector of even length that you want to fill in-place with normally distributed numbers, you can use the randn2! function to do that efficiently, without discarding any number.

Additionally, you can use the IPU builtins listed below.

Printing

Inside codelets you can print text and value of variables using the macros @ipuprintf, @ipuprint, @ipuprintln, and @ipushow. These macros are useful for debugging purposes but printing inside a codelet might incur performance penalty. To completely disable all printing and make these macros no-op you can set IPUCompiler.DISABLE_PRINT:

IPUCompiler.DISABLE_PRINT[] = true
IPUToolkit.IPUCompiler.@ipuprintfMacro
@ipuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

More user-friendly versions of this macro are @ipuprint, @ipuprintln. See also @ipushow, which is built on top of @ipuprintf functionalities.

Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT:

IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
source
IPUToolkit.IPUCompiler.@ipuprintMacro
@ipuprint(xs...)
@ipuprintln(xs...)

Print a textual representation of values xs to standard output from the IPU. The functionality builds on @ipuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @ipuprintf directly.

Limited string interpolation is also possible:

    @ipuprint("Hello, World ", 42, "\n")
    @ipuprint "Hello, World $(42)\n"

Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT:

IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
source
IPUToolkit.IPUCompiler.@ipuprintlnMacro
@ipuprint(xs...)
@ipuprintln(xs...)

Print a textual representation of values xs to standard output from the IPU. The functionality builds on @ipuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @ipuprintf directly.

Limited string interpolation is also possible:

    @ipuprint("Hello, World ", 42, "\n")
    @ipuprint "Hello, World $(42)\n"

Printing can be completely disabled by setting IPUCompiler.DISABLE_PRINT:

IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
source
IPUToolkit.IPUCompiler.DISABLE_PRINTConstant
IPUToolkit.IPUCompiler.DISABLE_PRINT::Base.RefValue{Bool}

Global constant which controls whether printing through the various @ipuprint* macros should be disabled or not. You may want to completely disable printing for production runs, to avoid the cost of printing on the device, but keep it enabled during development.

Examples:

IPUToolkit.IPUCompiler.DISABLE_PRINT[] = false # Do not disable printing, this is the default.
IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true  # Disable printing, the `@ipuprint*` macros are no-op.
source

Benchmarking

To benchmark expressions inside codelets you can use the macros @ipucycles, @ipushowcycles, and @ipuelapsed, which report the number of cycles spent in the wrapped expression. They are similar to Julia's @time, @showtime, and @elapsed macros, but report the number of cycles, as the clockspeed of tiles cannot be easily obtained inside a codelet. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target) outside of the codelet, and should usually be 1.330 GHz or 1.850 GHz depending on the model of your IPU. The printing macros @ipucycles and @ipushowcycles can be made completely no-op by setting IPUCompiler.DISABLE_PRINT.

Warning

Timing of expressions taking longer than typemax(UInt32) / tile_clock_frequency (about 2 or 3 seconds depending on your IPU model) is unreliable because the difference between the starting and the ending cycle counts would overflow.

Note also that the Poplar.TargetGetTileClockFrequency(target) function may not return a reliable value, but this is an upstream bug (this has been observed at least up to Poplar SDK v3.0). You may have to use tools like gc-monitor, gc-inventory, or gc-info --device-id <N> --tile-clock-speed to obtain the correct tile clock frequency.

IPUToolkit.IPUCompiler.@ipucyclesMacro
@ipucycles ex
@ipucycles "description" ex

Print from inside a codelet the number of cycles spent to compute the expression ex. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target) outside of the codelet. The optional argument description, a literal String, can be used to print also a label to identify the timed expression. A label is added automatically by @ipushowcycles.

See also @ipuelapsed.

This macro can be made no-op completely by setting

IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
source
IPUToolkit.IPUCompiler.@ipushowcyclesMacro
@ipushowcycles ex

Print from inside a codelet the expression ex and the number of cycles spent to compute it. This is useful when benchmarking multiple expression, to identify their contributions more easily. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target) outside of the codelet.

See also @ipucycles, @ipuelapsed.

This macro can be made no-op completely by setting

IPUToolkit.IPUCompiler.DISABLE_PRINT[] = true
source
IPUToolkit.IPUCompiler.@ipuelapsedMacro
@ipuelapsed ex

Return number of cycles spent to compute the expression ex. The corresponding time can be obtained by dividing the number of cycles by the clock frequency of the the tile, which you can get with Poplar.TargetGetTileClockFrequency(target) outside of the codelet.

See also @ipucycles, @ipushowcycles.

source

Passing non-constant variables from global scope

If your kernel references a non-constant (const) global variable, the generated code will result in a reference to a memory address on the host, and this will fatally fail at runtime because programs running on the IPU don't have access to the host memory. Constant variables are not affected by this problem because their values are inlined when the function is compiled. If you can't or don't want to make a variable constant you can interpolate its value with a top-level @eval when defining the codelet. For example:

using IPUToolkit.IPUCompiler, IPUToolkit.Poplar
device = Poplar.get_ipu_device()
target = Poplar.DeviceGetTarget(device)
graph = Poplar.Graph(target)
tile_clock_frequency = Poplar.TargetGetTileClockFrequency(target)
@eval @codelet graph function test(invec::VertexVector{Float32, In}, outvec::VertexVector{Float32, Out})
    # We can use the intrinsic `get_scount_l` to get the cycle counter right
    # before and after some operations, so that we can benchmark it.
    cycles_start = get_scount_l()
    # Do some operations here...
    cycles_end = get_scount_l()
    # Divide the difference between the two cycle counts by the tile frequency
    # clock to get the time.
    time = (cycles_end - cycles_start) / $(tile_clock_frequency)
    # Show the time spent doing your operations
    @ipushow time
end

The use of @eval allows you not to have to pass an extra argument to your kernel just to use the value of the variable inside the codelet.

Debugging compilation errors in codelets

Writing codelets for the IPU takes some practice, because you cannot use any arbitrary construct or package as you would normally do when running code on a CPU. As mentioned above, codelets have to be statically compiled with GPUCompiler.jl, with all the limitations of this framework, which can only use a subset of the Julia language. Therefore, it happens frequently that you run into compilation errors while developing a codelet function, and you have then to resolve the issues, which usually involves removing dynamic dispatch calls (which would require the JIT compiler at runtime), resolving type-instabilities, avoiding memory allocations, etc... If you have Cthulhu.jl installed, you can set IPUCompiler.DEBUG_COMPILATION_ERRORS to true to automatically open an interactive shell when compiling a codelet results into invalid LLVM IR, to more easily debug the codelet code.

We suggest again taking a look at the code samples in the examples/ directory for learning how to write working IPU codelets in Julia.

IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORSConstant
IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS::Base.RefValue{Bool}

Option to control whether a failure to compile LLVM IR in @codelet should drop you into an interactive debug session with Cthulhu.jl. This forcibly disables the progress spinner enabled by PROGRESS_SPINNER, as it would not play nicely with the interactive debug session.

Note

Cthulhu.jl must be installed in the environment you are currently using and you have to run using Cthulhu before the @codelet definition. IPUToolkit.jl does not install Cthulhu.jl automatically to limit the number of dependencies.

Example

IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS[] = false # Do not automatically open interactive debug shell when a compilation error arises, the default
IPUToolkit.IPUCompiler.DEBUG_COMPILATION_ERRORS[] = true  # Automatically open interactive debug shell when a compilation error arises
source

Domain-Specific Language: @ipuprogram

The IPUCompiler.@ipuprogram macro provides a very simple and limited DSL to automatically generate most of the boilerplate code needed when writing an IPU program. You can do very little with this DSL, which is mainly a showcase of Julia's meta-programming capabilities. A fully commented examples of use of the @ipuprogram macro is available in the examples/dsl.jl file.