Codon ships with a feature-complete, fully-compiled native NumPy implementation. It uses the same API as NumPy, but re-implements everything in Codon itself, allowing for a range of optimizations and performance improvements. Codon-NumPy works with Codon's Python interoperability (you can transfer arrays to and from regular Python seamlessly), parallel backend (you can do array operations in parallel), and GPU backend (you can transfer arrays to and from the GPU seamlessly, and operate on them on the GPU). # Getting started Importing `numpy` in Codon will use Codon-NumPy (as opposed to `from python import numpy`, which would use standard NumPy): ``` python import numpy as np ``` We can then create and manipulate arrays just like in standard NumPy: ``` python x = np.arange(15, dtype=np.int64).reshape(3, 5) print(x) # 0 1 2 3 4 # 5 6 7 8 9 # 10 11 12 13 14 x[1:, ::2] = -99 # 0 1 2 3 4 # -99 6 -99 8 -99 # -99 11 -99 13 -99 y = x.max(axis=1) print(y) # 4 8 13 ``` In Codon-NumPy, any Codon type can be used as the array type. The `numpy` module has the same aliases that regular NumPy has, like `np.int64`, `np.float32` etc., but these simply refer to the regular Codon types. {% hint style="warning" %} Using a string (e.g. `"i4"` or `"f8"`) for the dtype is not yet supported. {% endhint %} # Codon array type The Codon array type is parameterized by the array data type ("`dtype`") and the array dimension ("`ndim`"). That means that, in Codon-NumPy, the array dimension is a property of the type, so a 1-d array is a different type than a 2-d array and so on: ``` python import numpy as np arr = np.array([[1.1, 2.2], [3.3, 4.4]]) print(arr.__class__.__name__) # ndarray[float,2] arr = np.arange(10) print(arr.__class__.__name__) # ndarray[int,1] ``` The array dimension must also be known at compile-time. This allows the compiler to perform a wider range of optimizations on array operations. Usually, this has no impact on the code as the NumPy functions can determine input and output dimensions automatically. However, the dimension (and dtype) must be given when, for instance, reading arrays from disk: ``` python # 'dtype' argument specifies array type # 'ndim' argument specifies array dimension arr = np.load('arr.npy', dtype=float, ndim=3) ``` A very limited number of NumPy functions return an array whose dimension cannot be deduced from its inputs. One such example is `squeeze()`, which removes axes of length 1; since the number of axes of length 1 is not determinable at compile-time, this function requires an extra argument that indicates which axes to remove. # Python interoperability Codon's `ndarray` type supports Codon's standard Python interoperability API (i.e. `__to_py__` and `__from_py__` methods), so arrays can be transferred to and from Python seamlessly. ## PyTorch integration Because PyTorch tensors and NumPy arrays are interchangeable without copying data, it is easy to use Codon to efficiently manipulate or operate on PyTorch tensors. This can be achieved either via Codon's just-in-time (JIT) compilation mode or via its Python extension mode. ### Using Codon JIT Here is an example showing initializing a $$128 \times 128 \times 128$$ tensor $$A$$ such that $$A_{i,j,k} = i + j + k$$: ``` python import numpy as np import time import codon import torch @codon.jit def initialize(arr): for i in range(128): for j in range(128): for k in range(128): arr[i, j, k] = i + j + k # first call JIT-compiles; subsequent calls use cached JIT'd code tensor = torch.empty(128, 128, 128) initialize(tensor.numpy()) tensor = torch.empty(128, 128, 128) t0 = time.time() initialize(tensor.numpy()) t1 = time.time() print(tensor) print(t1 - t0, 'seconds') ``` Timings on an M1 MacBook Pro: - Without `@codon.jit`: 0.1645 seconds - With `@codon.jit`: 0.001485 seconds (*110x speedup*) For more information, see the [Codon JIT docs](../interop/decorator.md). ### Using Codon Python extensions Codon can compile directly to a Python extension module, similar to writing a C extension for CPython or using Cython. Taking the same example, we can create a file `init.py`: ``` python import numpy as np import numpy.pybridge def initialize(arr: np.ndarray[np.float32, 3]): for i in range(128): for j in range(128): for k in range(128): arr[i, j, k] = i + j + k ``` Note that extension module functions need to specify argument types. In this case, the argument is a 3-dimensional array of type `float32`, which is expressed as `np.ndarray[np.float32, 3]` in Codon. Now we can use a setup script `setup.py` to create the extension module as described in the [Codon Python extension docs](../interop/pyext.md): ``` bash python3 setup.py build_ext --inplace # setup.py from docs linked above ``` Finally, we can call the function from Python: ``` python from codon_initialize import initialize import torch tensor = torch.empty(128, 128, 128) initialize(tensor.numpy()) print(tensor) ``` Note that there is no compilation happening at runtime with this approach. Instead, everything is compiled ahead of time when creating the extension. The timing is the same as the first approach. You can also use any Codon compilation flags with this approach by adding them to the `spawn` call in the setup script. For example, you can use the `-disable-exceptions` flag to disable runtime exceptions, which can yield performance improvements and generate more streamlined code. # Parallel processing Unlike Python, Codon has no global interpreter lock ("GIL") and supports full multithreading, meaning NumPy code can be parallelized. For example: ``` python import numpy as np import numpy.random as rnd import time N = 100000000 n = 10 rng = rnd.default_rng(seed=0) x = rng.normal(size=(N,n)) y = np.empty(n) t0 = time.time() @par(num_threads=n) for i in range(n): y[i] = x[:,i].sum() t1 = time.time() print(y) print(t1 - t0, 'seconds') # no par - 1.4s # w/ par - 0.4s ``` # GPU processing Codon-NumPy supports seamless GPU processing: arrays can be passed to and from the GPU, and array operations can be performed on the GPU using Codon's GPU backend. Here's an example that computes the Mandelbrot set: ``` python import numpy as np import gpu MAX = 1000 # maximum Mandelbrot iterations N = 4096 # width and height of image pixels = np.empty((N, N), int) def scale(x, a, b): return a + (x/N)*(b - a) @gpu.kernel def mandelbrot(pixels): i = (gpu.block.x * gpu.block.dim.x) + gpu.thread.x j = (gpu.block.y * gpu.block.dim.y) + gpu.thread.y c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12)) z = 0j iteration = 0 while abs(z) <= 2 and iteration < MAX: z = z**2 + c iteration += 1 pixels[i, j] = 255 * iteration/MAX mandelbrot(pixels, grid=(N//32, N//32), block=(32, 32)) ``` Here is the same code using GPU-parallelized `for`-loops: ``` python import numpy as np import gpu MAX = 1000 # maximum Mandelbrot iterations N = 4096 # width and height of image pixels = np.empty((N, N), int) def scale(x, a, b): return a + (x/N)*(b - a) @par(gpu=True, collapse=2) # <-- for i in range(N): for j in range(N): c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12)) z = 0j iteration = 0 while abs(z) <= 2 and iteration < MAX: z = z**2 + c iteration += 1 pixels[i, j] = 255 * iteration/MAX ``` # Linear algebra Codon-NumPy fully supports the NumPy linear algebra module which provides a comprehensive set of functions for linear algebra operations. Importing the linear algebra module, just like in standard NumPy: ``` python import numpy.linalg as LA ``` For example, the `eig()` function computes the eigenvalues and eigenvectors of a square matrix: ``` python eigenvalues, eigenvectors = LA.eig(np.diag((1, 2, 3))) print(eigenvalues) # 1.+0.j 2.+0.j 3.+0.j print(eigenvectors) # [[1.+0.j 0.+0.j 0.+0.j] # [0.+0.j 1.+0.j 0.+0.j] # [0.+0.j 0.+0.j 1.+0.j]] ``` Just like standard NumPy, Codon will use an optimized BLAS library under the hood to implement many linear algebra operations. This defaults to OpenBLAS on Linux and Apple's Accelerate framework on macOS. Because Codon supports full multithreading, it's possible to use outer-loop parallelism to perform linear algebra operations in parallel. Here's an example that multiplies several matrices in parallel: ``` python import numpy as np import numpy.random as rnd import time N = 5000 n = 10 rng = rnd.default_rng(seed=0) a = rng.normal(size=(n, N, N)) b = rng.normal(size=(n, N, N)) y = np.empty((n, N, N)) t0 = time.time() @par(num_threads=n) for i in range(n): y[i, :, :] = a[i, :, :] @ b[i, :, :] t1 = time.time() print(y.sum()) print(t1 - t0, 'seconds') # Python - 53s # Codon - 6s ``` {% hint style="warning" %} When using Codon's outer-loop parallelism, make sure to set the environment variable `OPENBLAS_NUM_THREADS` to 1 (i.e. `export OPENBLAS_NUM_THREADS=1`) to avoid conflicts with OpenBLAS multithreading. {% endhint %} # NumPy-specific compiler optimizations Codon includes compiler passes that optimize NumPy code through methods like operator fusion, which combine distinct operations so that they can be executed during a single pass through the argument arrays, saving both execution time and memory (since intermediate arrays no longer need to be allocated). To showcase this, here's a simple NumPy program that approximates $$\pi$$. The code below generates two random vectors $$x$$ and $$y$$ with entries in the range $$[0, 1)$$ and computes the fraction of pairs of points that lie in the circle of radius $$0.5$$ centered at $$(0.5, 0.5)$$, which is approximately $$\pi \over 4$$. ``` python import time import numpy as np rng = np.random.default_rng(seed=0) x = rng.random(500_000_000) y = rng.random(500_000_000) t0 = time.time() # pi ~= 4 x (fraction of points in circle) pi = ((x-1)**2 + (y-1)**2 < 1).sum() * (4 / len(x)) t1 = time.time() print(pi) print(t1 - t0, 'seconds') ``` The expression `(x-1)**2 + (y-1)**2 < 1` gets fused by Codon so that it is executed in just a single pass over the `x` and `y` arrays, rather than in multiple passes for each sub-expression `x-1`, `y-1` etc. as is the case with standard NumPy. Here are the resulting timings on an M1 MacBook Pro: - Python / standard NumPy: 2.4 seconds - Codon: 0.42 seconds (*6x speedup*) You can display information about fused expressions by using the `-npfuse-verbose` flag of `codon`, as in `codon run -release -npfuse-verbose pi.py`. Here's the output for the program above: ``` Optimizing expression at pi.py:10:7 lt [cost=6] add [cost=5] pow [cost=2] sub [cost=1] a0 a1 a2 pow [cost=2] sub [cost=1] a3 a4 a5 a6 -> static fuse: lt [cost=6] add [cost=5] pow [cost=2] sub [cost=1] a0 a1 a2 pow [cost=2] sub [cost=1] a3 a4 a5 a6 ``` As shown, the optimization pass employs a cost model to decide how to best handle a given expression, be it by fusing or evaluating sequentially. You can adjust the fusion cost thresholds via the following flags: - `-npfuse-always `: Expression cost below which to always fuse a given expression (default: `10`). - `-npfuse-never `: Expression cost above which (>) to never fuse a given expression (default: `50`). Given an expression cost `C`, the logic implemented in the pass is to: - Always fuse expressions where `C <= cost1`. - Fuse expressions where `cost1 < C <= cost2` if there is no broadcasting involved. - Never fuse expressions where `C > cost2` and instead evaluate them sequentially. This logic is applied recursively to a given expression to determine the optimal evaluation strategy. You can disable these optimizations altogether by disabling the corresponding compiler pass via the flag `-disable-opt core-numpy-fusion`. # I/O Codon-NumPy supports most of NumPy's I/O API. One important difference, however, is that I/O functions must specify the dtype and dimension of arrays being read, since Codon-NumPy array types are parameterized by dtype and dimension: ``` python import numpy as np a = np.arange(27, dtype=np.int16).reshape(3, 3, 3) np.save('arr.npy', a) # Notice the 'dtype' and 'ndim' arguments: b = np.load('arr.npy', dtype=np.int16, ndim=3) ``` Writing arrays has no such requirement. # Datetimes Codon-NumPy fully supports NumPy's datetime types: `datetime64` and `timedelta64`. One difference from standard NumPy is how these types are specified. Here's an example: ``` python # datetime64 type with units of "1 day" # same as "dtype='datetime64[D]'" in standard NumPy dt = np.array(['2020-01-02', '2021-09-15', '2022-07-01'], dtype=np.datetime64['D', 1]) # timedelta64 type with units of "15 minutes" # same as "dtype='timedelta64[15m]'" in standard NumPy td = np.array([100, 200, 300], dtype=np.timedelta64['m', 15]) ``` # Passing array data to C/C++ You can pass an `ndarray`'s underlying data pointer to a C/C++ function by using the `data` attribute of the array. For example: ``` python from C import foo(p: Ptr[float], n: int) arr = np.ndarray([1.0, 2.0, 3.0]) foo(arr.data, arr.size) ``` Of course, it's the caller's responsibility to make sure the array is contiguous as needed and/or pass additional shape or stride information. See the [C interoperability](../interop/cpp.md) docs for more information. ## Array ABI The `ndarray[dtype, ndim]` data structure has three fields, in the following order: - `shape`: length-`ndim` tuple of non-negative 64-bit integers representing the array shape - `strides`: length-`ndim` tuple of 64-bit integers representing the stride in bytes along each axis of the array - `data`: pointer of type `dtype` to the array's data For example, `ndarray[np.float32, 3]` would correspond to the following C structure: ``` c struct ndarray_float32_3 { int64_t shape[3]; int64_t strides[3]; float *data; }; ``` This can be used to pass an entire `ndarray` object to a C function without breaking it up into its constituent components. # Performance tips ## Array layouts As with standard NumPy, Codon-NumPy performs best when array data is contiguous in memory, ideally in row-major order (also called "C order"). Most NumPy functions will return C-order arrays, but operations like slicing and transposing arrays can alter contiguity. You can use `numpy.ascontiguousarray()` to create a contiguous array from an arbitrary array. ## Linux huge pages When working with large arrays on Linux, enabling [transparent hugepages](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html) can result in significant performance improvements. You can check if transparent hugepages are enabled via ``` bash cat /sys/kernel/mm/transparent_hugepage/enabled ``` and you can enable them via ``` bash echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled ``` ## Disabling exceptions By default, Codon performs various validation checks at runtime (e.g. bounds checks when indexing an array) just like standard NumPy, and raises an exception if they fail. If you know your program will not raise or catch any exceptions, you can disable these checks through the `-disable-exceptions` compiler flag. Note that when using this flag, raising an exception will terminate the process with a `SIGTRAP`. ## Fast-math You can enable "fast-math" optimizations via the `-fast-math` compiler flag. It is advisable to **use this flag with caution** as it changes floating point semantics and makes assumptions regarding `inf` and `nan` values. For more information, consult LLVM's documentation on [fast-math flags](https://llvm.org/docs/LangRef.html#fast-math-flags). # Not-yet-supported The following features of NumPy are not yet supported, but are planned for the future: - String operations - Masked arrays - Polynomials A few miscellaneous Python-specific functions like `get_include()` are also not supported, as they are not applicable in Codon.