codon/docs/interop/numpy.md

16 KiB

Codon ships with a feature-complete, fully-compiled native NumPy implementation. It uses the same API as NumPy, but re-implements everything in Codon itself, allowing for a range of optimizations and performance improvements. Codon-NumPy works with Codon's Python interoperability (you can transfer arrays to and from regular Python seamlessly), parallel backend (you can do array operations in parallel), and GPU backend (you can transfer arrays to and from the GPU seamlessly, and operate on them on the GPU).

Getting started

Importing numpy in Codon will use Codon-NumPy (as opposed to from python import numpy, which would use standard NumPy):

import numpy as np

We can then create and manipulate arrays just like in standard NumPy:

x = np.arange(15, dtype=np.int64).reshape(3, 5)
print(x)
#   0   1   2   3   4
#   5   6   7   8   9
#  10  11  12  13  14

x[1:, ::2] = -99
#   0   1   2   3   4
# -99   6 -99   8 -99
# -99  11 -99  13 -99

y = x.max(axis=1)
print(y)
#   4   8  13

In Codon-NumPy, any Codon type can be used as the array type. The numpy module has the same aliases that regular NumPy has, like np.int64, np.float32 etc., but these simply refer to the regular Codon types.

{% hint style="warning" %} Using a string (e.g. "i4" or "f8") for the dtype is not yet supported. {% endhint %}

Codon array type

The Codon array type is parameterized by the array data type ("dtype") and the array dimension ("ndim"). That means that, in Codon-NumPy, the array dimension is a property of the type, so a 1-d array is a different type than a 2-d array and so on:

import numpy as np

arr = np.array([[1.1, 2.2], [3.3, 4.4]])
print(arr.__class__.__name__)  # ndarray[float,2]

arr = np.arange(10)
print(arr.__class__.__name__)  # ndarray[int,1]

The array dimension must also be known at compile-time. This allows the compiler to perform a wider range of optimizations on array operations. Usually, this has no impact on the code as the NumPy functions can determine input and output dimensions automatically. However, the dimension (and dtype) must be given when, for instance, reading arrays from disk:

# 'dtype' argument specifies array type
# 'ndim' argument specifies array dimension
arr = np.load('arr.npy', dtype=float, ndim=3)

A very limited number of NumPy functions return an array whose dimension cannot be deduced from its inputs. One such example is squeeze(), which removes axes of length 1; since the number of axes of length 1 is not determinable at compile-time, this function requires an extra argument that indicates which axes to remove.

Python interoperability

Codon's ndarray type supports Codon's standard Python interoperability API (i.e. __to_py__ and __from_py__ methods), so arrays can be transferred to and from Python seamlessly.

PyTorch integration

Because PyTorch tensors and NumPy arrays are interchangeable without copying data, it is easy to use Codon to efficiently manipulate or operate on PyTorch tensors. This can be achieved either via Codon's just-in-time (JIT) compilation mode or via its Python extension mode.

Using Codon JIT

Here is an example showing initializing a $128 \times 128 \times 128$ tensor

A
import numpy as np
import time
import codon
import torch

@codon.jit
def initialize(arr):
    for i in range(128):
        for j in range(128):
            for k in range(128):
                arr[i, j, k] = i + j + k

# first call JIT-compiles; subsequent calls use cached JIT'd code
tensor = torch.empty(128, 128, 128)
initialize(tensor.numpy())

tensor = torch.empty(128, 128, 128)
t0 = time.time()
initialize(tensor.numpy())
t1 = time.time()

print(tensor)
print(t1 - t0, 'seconds')

Timings on an M1 MacBook Pro:

  • Without @codon.jit: 0.1645 seconds
  • With @codon.jit: 0.001485 seconds (110x speedup)

For more information, see the Codon JIT docs.

Using Codon Python extensions

Codon can compile directly to a Python extension module, similar to writing a C extension for CPython or using Cython.

Taking the same example, we can create a file init.py:

import numpy as np
import numpy.pybridge

def initialize(arr: np.ndarray[np.float32, 3]):
    for i in range(128):
        for j in range(128):
            for k in range(128):
                arr[i, j, k] = i + j + k

Note that extension module functions need to specify argument types. In this case, the argument is a 3-dimensional array of type float32, which is expressed as np.ndarray[np.float32, 3] in Codon.

Now we can use a setup script setup.py to create the extension module as described in the Codon Python extension docs:

python3 setup.py build_ext --inplace  # setup.py from docs linked above

Finally, we can call the function from Python:

from codon_initialize import initialize
import torch

tensor = torch.empty(128, 128, 128)
initialize(tensor.numpy())
print(tensor)

Note that there is no compilation happening at runtime with this approach. Instead, everything is compiled ahead of time when creating the extension. The timing is the same as the first approach.

You can also use any Codon compilation flags with this approach by adding them to the spawn call in the setup script. For example, you can use the -disable-exceptions flag to disable runtime exceptions, which can yield performance improvements and generate more streamlined code.

Parallel processing

Unlike Python, Codon has no global interpreter lock ("GIL") and supports full multithreading, meaning NumPy code can be parallelized. For example:

import numpy as np
import numpy.random as rnd
import time

N = 100000000
n = 10

rng = rnd.default_rng(seed=0)
x = rng.normal(size=(N,n))
y = np.empty(n)

t0 = time.time()

@par(num_threads=n)
for i in range(n):
    y[i] = x[:,i].sum()

t1 = time.time()

print(y)
print(t1 - t0, 'seconds')
# no par - 1.4s
# w/ par - 0.4s

GPU processing

Codon-NumPy supports seamless GPU processing: arrays can be passed to and from the GPU, and array operations can be performed on the GPU using Codon's GPU backend. Here's an example that computes the Mandelbrot set:

import numpy as np
import gpu

MAX    = 1000  # maximum Mandelbrot iterations
N      = 4096  # width and height of image
pixels = np.empty((N, N), int)

def scale(x, a, b):
    return a + (x/N)*(b - a)

@gpu.kernel
def mandelbrot(pixels):
    i = (gpu.block.x * gpu.block.dim.x) + gpu.thread.x
    j = (gpu.block.y * gpu.block.dim.y) + gpu.thread.y
    c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12))
    z = 0j
    iteration = 0

    while abs(z) <= 2 and iteration < MAX:
        z = z**2 + c
        iteration += 1

    pixels[i, j] = 255 * iteration/MAX

mandelbrot(pixels, grid=(N//32, N//32), block=(32, 32))

Here is the same code using GPU-parallelized for-loops:

import numpy as np
import gpu

MAX    = 1000  # maximum Mandelbrot iterations
N      = 4096  # width and height of image
pixels = np.empty((N, N), int)

def scale(x, a, b):
    return a + (x/N)*(b - a)

@par(gpu=True, collapse=2)  # <--
for i in range(N):
    for j in range(N):
        c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12))
        z = 0j
        iteration = 0

        while abs(z) <= 2 and iteration < MAX:
            z = z**2 + c
            iteration += 1

        pixels[i, j] = 255 * iteration/MAX

Linear algebra

Codon-NumPy fully supports the NumPy linear algebra module which provides a comprehensive set of functions for linear algebra operations. Importing the linear algebra module, just like in standard NumPy:

import numpy.linalg as LA

For example, the eig() function computes the eigenvalues and eigenvectors of a square matrix:

eigenvalues, eigenvectors = LA.eig(np.diag((1, 2, 3)))
print(eigenvalues)
# 1.+0.j 2.+0.j 3.+0.j

print(eigenvectors)
# [[1.+0.j 0.+0.j 0.+0.j]
#  [0.+0.j 1.+0.j 0.+0.j]
#  [0.+0.j 0.+0.j 1.+0.j]]

Just like standard NumPy, Codon will use an optimized BLAS library under the hood to implement many linear algebra operations. This defaults to OpenBLAS on Linux and Apple's Accelerate framework on macOS.

Because Codon supports full multithreading, it's possible to use outer-loop parallelism to perform linear algebra operations in parallel. Here's an example that multiplies several matrices in parallel:

import numpy as np
import numpy.random as rnd
import time

N = 5000
n = 10
rng = rnd.default_rng(seed=0)
a = rng.normal(size=(n, N, N))
b = rng.normal(size=(n, N, N))
y = np.empty((n, N, N))
t0 = time.time()

@par(num_threads=n)
for i in range(n):
    y[i, :, :] = a[i, :, :] @ b[i, :, :]

t1 = time.time()
print(y.sum())
print(t1 - t0, 'seconds')  # Python - 53s
                           # Codon  -  6s

{% hint style="warning" %} When using Codon's outer-loop parallelism, make sure to set the environment variable OPENBLAS_NUM_THREADS to 1 (i.e. export OPENBLAS_NUM_THREADS=1) to avoid conflicts with OpenBLAS multithreading. {% endhint %}

NumPy-specific compiler optimizations

Codon includes compiler passes that optimize NumPy code through methods like operator fusion, which combine distinct operations so that they can be executed during a single pass through the argument arrays, saving both execution time and memory (since intermediate arrays no longer need to be allocated).

To showcase this, here's a simple NumPy program that approximates \pi$$. The code below generates two random vectors $xandy$ with entries in the range [0, 1) and computes the fraction of pairs of points that lie in the circle of radius $0.5centered at(0.5, 0.5)$, which is approximately

\pi \over 4
import time
import numpy as np

rng = np.random.default_rng(seed=0)
x = rng.random(500_000_000)
y = rng.random(500_000_000)

t0 = time.time()
# pi ~= 4 x (fraction of points in circle)
pi = ((x-1)**2 + (y-1)**2 < 1).sum() * (4 / len(x))
t1 = time.time()

print(pi)
print(t1 - t0, 'seconds')

The expression (x-1)**2 + (y-1)**2 < 1 gets fused by Codon so that it is executed in just a single pass over the x and y arrays, rather than in multiple passes for each sub-expression x-1, y-1 etc. as is the case with standard NumPy.

Here are the resulting timings on an M1 MacBook Pro:

  • Python / standard NumPy: 2.4 seconds
  • Codon: 0.42 seconds (6x speedup)

You can display information about fused expressions by using the -npfuse-verbose flag of codon, as in codon run -release -npfuse-verbose pi.py. Here's the output for the program above:

Optimizing expression at pi.py:10:7
lt <array[bool, 1]> [cost=6]
  add <array[f64, 1]> [cost=5]
    pow <array[f64, 1]> [cost=2]
      sub <array[f64, 1]> [cost=1]
        a0 <array[f64, 1]>
        a1 <i64>
      a2 <i64>
    pow <array[f64, 1]> [cost=2]
      sub <array[f64, 1]> [cost=1]
        a3 <array[f64, 1]>
        a4 <i64>
      a5 <i64>
  a6 <i64>

-> static fuse:
lt <array[bool, 1]> [cost=6]
  add <array[f64, 1]> [cost=5]
    pow <array[f64, 1]> [cost=2]
      sub <array[f64, 1]> [cost=1]
        a0 <array[f64, 1]>
        a1 <i64>
      a2 <i64>
    pow <array[f64, 1]> [cost=2]
      sub <array[f64, 1]> [cost=1]
        a3 <array[f64, 1]>
        a4 <i64>
      a5 <i64>
  a6 <i64>

As shown, the optimization pass employs a cost model to decide how to best handle a given expression, be it by fusing or evaluating sequentially. You can adjust the fusion cost thresholds via the following flags:

  • -npfuse-always <cost1>: Expression cost below which to always fuse a given expression (default: 10).
  • -npfuse-never <cost2>: Expression cost above which (>) to never fuse a given expression (default: 50).

Given an expression cost C, the logic implemented in the pass is to:

  • Always fuse expressions where C <= cost1.
  • Fuse expressions where cost1 < C <= cost2 if there is no broadcasting involved.
  • Never fuse expressions where C > cost2 and instead evaluate them sequentially.

This logic is applied recursively to a given expression to determine the optimal evaluation strategy.

You can disable these optimizations altogether by disabling the corresponding compiler pass via the flag -disable-opt core-numpy-fusion.

I/O

Codon-NumPy supports most of NumPy's I/O API. One important difference, however, is that I/O functions must specify the dtype and dimension of arrays being read, since Codon-NumPy array types are parameterized by dtype and dimension:

import numpy as np

a = np.arange(27, dtype=np.int16).reshape(3, 3, 3)
np.save('arr.npy', a)

# Notice the 'dtype' and 'ndim' arguments:
b = np.load('arr.npy', dtype=np.int16, ndim=3)

Writing arrays has no such requirement.

Datetimes

Codon-NumPy fully supports NumPy's datetime types: datetime64 and timedelta64. One difference from standard NumPy is how these types are specified. Here's an example:

# datetime64 type with units of "1 day"
# same as "dtype='datetime64[D]'" in standard NumPy
dt = np.array(['2020-01-02', '2021-09-15', '2022-07-01'],
              dtype=np.datetime64['D', 1])

# timedelta64 type with units of "15 minutes"
# same as "dtype='timedelta64[15m]'" in standard NumPy
td = np.array([100, 200, 300], dtype=np.timedelta64['m', 15])

Passing array data to C/C++

You can pass an ndarray's underlying data pointer to a C/C++ function by using the data attribute of the array. For example:

from C import foo(p: Ptr[float], n: int)

arr = np.ndarray([1.0, 2.0, 3.0])
foo(arr.data, arr.size)

Of course, it's the caller's responsibility to make sure the array is contiguous as needed and/or pass additional shape or stride information. See the C interoperability docs for more information.

Array ABI

The ndarray[dtype, ndim] data structure has three fields, in the following order:

  • shape: length-ndim tuple of non-negative 64-bit integers representing the array shape
  • strides: length-ndim tuple of 64-bit integers representing the stride in bytes along each axis of the array
  • data: pointer of type dtype to the array's data

For example, ndarray[np.float32, 3] would correspond to the following C structure:

struct ndarray_float32_3 {
  int64_t shape[3];
  int64_t strides[3];
  float *data;
};

This can be used to pass an entire ndarray object to a C function without breaking it up into its constituent components.

Performance tips

Array layouts

As with standard NumPy, Codon-NumPy performs best when array data is contiguous in memory, ideally in row-major order (also called "C order"). Most NumPy functions will return C-order arrays, but operations like slicing and transposing arrays can alter contiguity. You can use numpy.ascontiguousarray() to create a contiguous array from an arbitrary array.

Linux huge pages

When working with large arrays on Linux, enabling transparent hugepages can result in significant performance improvements.

You can check if transparent hugepages are enabled via

cat /sys/kernel/mm/transparent_hugepage/enabled

and you can enable them via

echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Disabling exceptions

By default, Codon performs various validation checks at runtime (e.g. bounds checks when indexing an array) just like standard NumPy, and raises an exception if they fail. If you know your program will not raise or catch any exceptions, you can disable these checks through the -disable-exceptions compiler flag.

Note that when using this flag, raising an exception will terminate the process with a SIGTRAP.

Fast-math

You can enable "fast-math" optimizations via the -fast-math compiler flag. It is advisable to use this flag with caution as it changes floating point semantics and makes assumptions regarding inf and nan values. For more information, consult LLVM's documentation on fast-math flags.

Not-yet-supported

The following features of NumPy are not yet supported, but are planned for the future:

  • String operations
  • Masked arrays
  • Polynomials

A few miscellaneous Python-specific functions like get_include() are also not supported, as they are not applicable in Codon.