mirror of https://github.com/exaloop/codon.git
539 lines
16 KiB
Markdown
539 lines
16 KiB
Markdown
Codon ships with a feature-complete, fully-compiled native NumPy implementation.
|
|
It uses the same API as NumPy, but re-implements everything in Codon itself,
|
|
allowing for a range of optimizations and performance improvements. Codon-NumPy
|
|
works with Codon's Python interoperability (you can transfer arrays to and from
|
|
regular Python seamlessly), parallel backend (you can do array operations in
|
|
parallel), and GPU backend (you can transfer arrays to and from the GPU seamlessly,
|
|
and operate on them on the GPU).
|
|
|
|
# Getting started
|
|
Importing `numpy` in Codon will use Codon-NumPy (as opposed to
|
|
`from python import numpy`, which would use standard NumPy):
|
|
|
|
``` python
|
|
import numpy as np
|
|
```
|
|
|
|
We can then create and manipulate arrays just like in standard NumPy:
|
|
|
|
``` python
|
|
x = np.arange(15, dtype=np.int64).reshape(3, 5)
|
|
print(x)
|
|
# 0 1 2 3 4
|
|
# 5 6 7 8 9
|
|
# 10 11 12 13 14
|
|
|
|
x[1:, ::2] = -99
|
|
# 0 1 2 3 4
|
|
# -99 6 -99 8 -99
|
|
# -99 11 -99 13 -99
|
|
|
|
y = x.max(axis=1)
|
|
print(y)
|
|
# 4 8 13
|
|
```
|
|
|
|
In Codon-NumPy, any Codon type can be used as the array type. The `numpy`
|
|
module has the same aliases that regular NumPy has, like `np.int64`,
|
|
`np.float32` etc., but these simply refer to the regular Codon types.
|
|
|
|
{% hint style="warning" %}
|
|
Using a string (e.g. `"i4"` or `"f8"`) for the dtype is not yet supported.
|
|
{% endhint %}
|
|
|
|
# Codon array type
|
|
The Codon array type is parameterized by the array data type ("`dtype`")
|
|
and the array dimension ("`ndim`"). That means that, in Codon-NumPy, the
|
|
array dimension is a property of the type, so a 1-d array is a different
|
|
type than a 2-d array and so on:
|
|
|
|
``` python
|
|
import numpy as np
|
|
|
|
arr = np.array([[1.1, 2.2], [3.3, 4.4]])
|
|
print(arr.__class__.__name__) # ndarray[float,2]
|
|
|
|
arr = np.arange(10)
|
|
print(arr.__class__.__name__) # ndarray[int,1]
|
|
```
|
|
|
|
The array dimension must also be known at compile-time. This allows the
|
|
compiler to perform a wider range of optimizations on array operations.
|
|
Usually, this has no impact on the code as the NumPy functions can
|
|
determine input and output dimensions automatically. However, the dimension
|
|
(and dtype) must be given when, for instance, reading arrays from disk:
|
|
|
|
``` python
|
|
# 'dtype' argument specifies array type
|
|
# 'ndim' argument specifies array dimension
|
|
arr = np.load('arr.npy', dtype=float, ndim=3)
|
|
```
|
|
|
|
A very limited number of NumPy functions return an array whose dimension
|
|
cannot be deduced from its inputs. One such example is `squeeze()`, which
|
|
removes axes of length 1; since the number of axes of length 1 is not
|
|
determinable at compile-time, this function requires an extra argument that
|
|
indicates which axes to remove.
|
|
|
|
# Python interoperability
|
|
Codon's `ndarray` type supports Codon's standard Python interoperability API
|
|
(i.e. `__to_py__` and `__from_py__` methods), so arrays can be transferred to
|
|
and from Python seamlessly.
|
|
|
|
## PyTorch integration
|
|
Because PyTorch tensors and NumPy arrays are interchangeable without copying
|
|
data, it is easy to use Codon to efficiently manipulate or operate on PyTorch
|
|
tensors. This can be achieved either via Codon's just-in-time (JIT) compilation
|
|
mode or via its Python extension mode.
|
|
|
|
### Using Codon JIT
|
|
Here is an example showing initializing a $$128 \times 128 \times 128$$ tensor
|
|
$$A$$ such that $$A_{i,j,k} = i + j + k$$:
|
|
|
|
``` python
|
|
import numpy as np
|
|
import time
|
|
import codon
|
|
import torch
|
|
|
|
@codon.jit
|
|
def initialize(arr):
|
|
for i in range(128):
|
|
for j in range(128):
|
|
for k in range(128):
|
|
arr[i, j, k] = i + j + k
|
|
|
|
# first call JIT-compiles; subsequent calls use cached JIT'd code
|
|
tensor = torch.empty(128, 128, 128)
|
|
initialize(tensor.numpy())
|
|
|
|
tensor = torch.empty(128, 128, 128)
|
|
t0 = time.time()
|
|
initialize(tensor.numpy())
|
|
t1 = time.time()
|
|
|
|
print(tensor)
|
|
print(t1 - t0, 'seconds')
|
|
```
|
|
|
|
Timings on an M1 MacBook Pro:
|
|
|
|
- Without `@codon.jit`: 0.1645 seconds
|
|
- With `@codon.jit`: 0.001485 seconds (*110x speedup*)
|
|
|
|
For more information, see the [Codon JIT docs](../interop/decorator.md).
|
|
|
|
### Using Codon Python extensions
|
|
Codon can compile directly to a Python extension module, similar to writing a C
|
|
extension for CPython or using Cython.
|
|
|
|
Taking the same example, we can create a file `init.py`:
|
|
|
|
``` python
|
|
import numpy as np
|
|
import numpy.pybridge
|
|
|
|
def initialize(arr: np.ndarray[np.float32, 3]):
|
|
for i in range(128):
|
|
for j in range(128):
|
|
for k in range(128):
|
|
arr[i, j, k] = i + j + k
|
|
```
|
|
|
|
Note that extension module functions need to specify argument types. In this case,
|
|
the argument is a 3-dimensional array of type `float32`, which is expressed as
|
|
`np.ndarray[np.float32, 3]` in Codon.
|
|
|
|
Now we can use a setup script `setup.py` to create the extension module as described
|
|
in the [Codon Python extension docs](../interop/pyext.md):
|
|
|
|
``` bash
|
|
python3 setup.py build_ext --inplace # setup.py from docs linked above
|
|
```
|
|
|
|
Finally, we can call the function from Python:
|
|
|
|
``` python
|
|
from codon_initialize import initialize
|
|
import torch
|
|
|
|
tensor = torch.empty(128, 128, 128)
|
|
initialize(tensor.numpy())
|
|
print(tensor)
|
|
```
|
|
|
|
Note that there is no compilation happening at runtime with this approach. Instead,
|
|
everything is compiled ahead of time when creating the extension. The timing is the same
|
|
as the first approach.
|
|
|
|
You can also use any Codon compilation flags with this approach by adding them to the
|
|
`spawn` call in the setup script. For example, you can use the `-disable-exceptions`
|
|
flag to disable runtime exceptions, which can yield performance improvements and generate
|
|
more streamlined code.
|
|
|
|
# Parallel processing
|
|
Unlike Python, Codon has no global interpreter lock ("GIL") and supports full
|
|
multithreading, meaning NumPy code can be parallelized. For example:
|
|
|
|
``` python
|
|
import numpy as np
|
|
import numpy.random as rnd
|
|
import time
|
|
|
|
N = 100000000
|
|
n = 10
|
|
|
|
rng = rnd.default_rng(seed=0)
|
|
x = rng.normal(size=(N,n))
|
|
y = np.empty(n)
|
|
|
|
t0 = time.time()
|
|
|
|
@par(num_threads=n)
|
|
for i in range(n):
|
|
y[i] = x[:,i].sum()
|
|
|
|
t1 = time.time()
|
|
|
|
print(y)
|
|
print(t1 - t0, 'seconds')
|
|
# no par - 1.4s
|
|
# w/ par - 0.4s
|
|
```
|
|
|
|
# GPU processing
|
|
Codon-NumPy supports seamless GPU processing: arrays can be passed to and
|
|
from the GPU, and array operations can be performed on the GPU using Codon's
|
|
GPU backend. Here's an example that computes the Mandelbrot set:
|
|
|
|
``` python
|
|
import numpy as np
|
|
import gpu
|
|
|
|
MAX = 1000 # maximum Mandelbrot iterations
|
|
N = 4096 # width and height of image
|
|
pixels = np.empty((N, N), int)
|
|
|
|
def scale(x, a, b):
|
|
return a + (x/N)*(b - a)
|
|
|
|
@gpu.kernel
|
|
def mandelbrot(pixels):
|
|
i = (gpu.block.x * gpu.block.dim.x) + gpu.thread.x
|
|
j = (gpu.block.y * gpu.block.dim.y) + gpu.thread.y
|
|
c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12))
|
|
z = 0j
|
|
iteration = 0
|
|
|
|
while abs(z) <= 2 and iteration < MAX:
|
|
z = z**2 + c
|
|
iteration += 1
|
|
|
|
pixels[i, j] = 255 * iteration/MAX
|
|
|
|
mandelbrot(pixels, grid=(N//32, N//32), block=(32, 32))
|
|
```
|
|
|
|
Here is the same code using GPU-parallelized `for`-loops:
|
|
|
|
``` python
|
|
import numpy as np
|
|
import gpu
|
|
|
|
MAX = 1000 # maximum Mandelbrot iterations
|
|
N = 4096 # width and height of image
|
|
pixels = np.empty((N, N), int)
|
|
|
|
def scale(x, a, b):
|
|
return a + (x/N)*(b - a)
|
|
|
|
@par(gpu=True, collapse=2) # <--
|
|
for i in range(N):
|
|
for j in range(N):
|
|
c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12))
|
|
z = 0j
|
|
iteration = 0
|
|
|
|
while abs(z) <= 2 and iteration < MAX:
|
|
z = z**2 + c
|
|
iteration += 1
|
|
|
|
pixels[i, j] = 255 * iteration/MAX
|
|
```
|
|
|
|
# Linear algebra
|
|
Codon-NumPy fully supports the NumPy linear algebra module which provides
|
|
a comprehensive set of functions for linear algebra operations. Importing
|
|
the linear algebra module, just like in standard NumPy:
|
|
|
|
``` python
|
|
import numpy.linalg as LA
|
|
```
|
|
|
|
For example, the `eig()` function computes the eigenvalues and eigenvectors
|
|
of a square matrix:
|
|
|
|
``` python
|
|
eigenvalues, eigenvectors = LA.eig(np.diag((1, 2, 3)))
|
|
print(eigenvalues)
|
|
# 1.+0.j 2.+0.j 3.+0.j
|
|
|
|
print(eigenvectors)
|
|
# [[1.+0.j 0.+0.j 0.+0.j]
|
|
# [0.+0.j 1.+0.j 0.+0.j]
|
|
# [0.+0.j 0.+0.j 1.+0.j]]
|
|
```
|
|
|
|
Just like standard NumPy, Codon will use an optimized BLAS library under the
|
|
hood to implement many linear algebra operations. This defaults to OpenBLAS
|
|
on Linux and Apple's Accelerate framework on macOS.
|
|
|
|
Because Codon supports full multithreading, it's possible to use outer-loop
|
|
parallelism to perform linear algebra operations in parallel. Here's an example
|
|
that multiplies several matrices in parallel:
|
|
|
|
``` python
|
|
import numpy as np
|
|
import numpy.random as rnd
|
|
import time
|
|
|
|
N = 5000
|
|
n = 10
|
|
rng = rnd.default_rng(seed=0)
|
|
a = rng.normal(size=(n, N, N))
|
|
b = rng.normal(size=(n, N, N))
|
|
y = np.empty((n, N, N))
|
|
t0 = time.time()
|
|
|
|
@par(num_threads=n)
|
|
for i in range(n):
|
|
y[i, :, :] = a[i, :, :] @ b[i, :, :]
|
|
|
|
t1 = time.time()
|
|
print(y.sum())
|
|
print(t1 - t0, 'seconds') # Python - 53s
|
|
# Codon - 6s
|
|
```
|
|
|
|
{% hint style="warning" %}
|
|
When using Codon's outer-loop parallelism, make sure to set the environment
|
|
variable `OPENBLAS_NUM_THREADS` to 1 (i.e. `export OPENBLAS_NUM_THREADS=1`)
|
|
to avoid conflicts with OpenBLAS multithreading.
|
|
{% endhint %}
|
|
|
|
# NumPy-specific compiler optimizations
|
|
Codon includes compiler passes that optimize NumPy code through methods like
|
|
operator fusion, which combine distinct operations so that they can be executed
|
|
during a single pass through the argument arrays, saving both execution time and
|
|
memory (since intermediate arrays no longer need to be allocated).
|
|
|
|
To showcase this, here's a simple NumPy program that approximates $$\pi$$.
|
|
The code below generates two random vectors $$x$$ and $$y$$ with entries in the
|
|
range $$[0, 1)$$ and computes the fraction of pairs of points that lie in the
|
|
circle of radius $$0.5$$ centered at $$(0.5, 0.5)$$, which is approximately
|
|
$$\pi \over 4$$.
|
|
|
|
``` python
|
|
import time
|
|
import numpy as np
|
|
|
|
rng = np.random.default_rng(seed=0)
|
|
x = rng.random(500_000_000)
|
|
y = rng.random(500_000_000)
|
|
|
|
t0 = time.time()
|
|
# pi ~= 4 x (fraction of points in circle)
|
|
pi = ((x-1)**2 + (y-1)**2 < 1).sum() * (4 / len(x))
|
|
t1 = time.time()
|
|
|
|
print(pi)
|
|
print(t1 - t0, 'seconds')
|
|
```
|
|
|
|
The expression `(x-1)**2 + (y-1)**2 < 1` gets fused by Codon so that it is
|
|
executed in just a single pass over the `x` and `y` arrays, rather than in
|
|
multiple passes for each sub-expression `x-1`, `y-1` etc. as is the case with
|
|
standard NumPy.
|
|
|
|
Here are the resulting timings on an M1 MacBook Pro:
|
|
|
|
- Python / standard NumPy: 2.4 seconds
|
|
- Codon: 0.42 seconds (*6x speedup*)
|
|
|
|
You can display information about fused expressions by using the `-npfuse-verbose`
|
|
flag of `codon`, as in `codon run -release -npfuse-verbose pi.py`. Here's the
|
|
output for the program above:
|
|
|
|
```
|
|
Optimizing expression at pi.py:10:7
|
|
lt <array[bool, 1]> [cost=6]
|
|
add <array[f64, 1]> [cost=5]
|
|
pow <array[f64, 1]> [cost=2]
|
|
sub <array[f64, 1]> [cost=1]
|
|
a0 <array[f64, 1]>
|
|
a1 <i64>
|
|
a2 <i64>
|
|
pow <array[f64, 1]> [cost=2]
|
|
sub <array[f64, 1]> [cost=1]
|
|
a3 <array[f64, 1]>
|
|
a4 <i64>
|
|
a5 <i64>
|
|
a6 <i64>
|
|
|
|
-> static fuse:
|
|
lt <array[bool, 1]> [cost=6]
|
|
add <array[f64, 1]> [cost=5]
|
|
pow <array[f64, 1]> [cost=2]
|
|
sub <array[f64, 1]> [cost=1]
|
|
a0 <array[f64, 1]>
|
|
a1 <i64>
|
|
a2 <i64>
|
|
pow <array[f64, 1]> [cost=2]
|
|
sub <array[f64, 1]> [cost=1]
|
|
a3 <array[f64, 1]>
|
|
a4 <i64>
|
|
a5 <i64>
|
|
a6 <i64>
|
|
```
|
|
|
|
As shown, the optimization pass employs a cost model to decide how to best handle
|
|
a given expression, be it by fusing or evaluating sequentially. You can adjust the
|
|
fusion cost thresholds via the following flags:
|
|
|
|
- `-npfuse-always <cost1>`: Expression cost below which to always fuse a given
|
|
expression (default: `10`).
|
|
- `-npfuse-never <cost2>`: Expression cost above which (>) to never fuse a given
|
|
expression (default: `50`).
|
|
|
|
Given an expression cost `C`, the logic implemented in the pass is to:
|
|
|
|
- Always fuse expressions where `C <= cost1`.
|
|
- Fuse expressions where `cost1 < C <= cost2` if there is no broadcasting involved.
|
|
- Never fuse expressions where `C > cost2` and instead evaluate them sequentially.
|
|
|
|
This logic is applied recursively to a given expression to determine the optimal
|
|
evaluation strategy.
|
|
|
|
You can disable these optimizations altogether by disabling the corresponding
|
|
compiler pass via the flag `-disable-opt core-numpy-fusion`.
|
|
|
|
# I/O
|
|
Codon-NumPy supports most of NumPy's I/O API. One important difference, however,
|
|
is that I/O functions must specify the dtype and dimension of arrays being read,
|
|
since Codon-NumPy array types are parameterized by dtype and dimension:
|
|
|
|
``` python
|
|
import numpy as np
|
|
|
|
a = np.arange(27, dtype=np.int16).reshape(3, 3, 3)
|
|
np.save('arr.npy', a)
|
|
|
|
# Notice the 'dtype' and 'ndim' arguments:
|
|
b = np.load('arr.npy', dtype=np.int16, ndim=3)
|
|
```
|
|
|
|
Writing arrays has no such requirement.
|
|
|
|
# Datetimes
|
|
Codon-NumPy fully supports NumPy's datetime types: `datetime64` and `timedelta64`.
|
|
One difference from standard NumPy is how these types are specified. Here's an example:
|
|
|
|
``` python
|
|
# datetime64 type with units of "1 day"
|
|
# same as "dtype='datetime64[D]'" in standard NumPy
|
|
dt = np.array(['2020-01-02', '2021-09-15', '2022-07-01'],
|
|
dtype=np.datetime64['D', 1])
|
|
|
|
# timedelta64 type with units of "15 minutes"
|
|
# same as "dtype='timedelta64[15m]'" in standard NumPy
|
|
td = np.array([100, 200, 300], dtype=np.timedelta64['m', 15])
|
|
```
|
|
|
|
# Passing array data to C/C++
|
|
You can pass an `ndarray`'s underlying data pointer to a C/C++ function by using
|
|
the `data` attribute of the array. For example:
|
|
|
|
``` python
|
|
from C import foo(p: Ptr[float], n: int)
|
|
|
|
arr = np.ndarray([1.0, 2.0, 3.0])
|
|
foo(arr.data, arr.size)
|
|
```
|
|
|
|
Of course, it's the caller's responsibility to make sure the array is contiguous
|
|
as needed and/or pass additional shape or stride information. See the
|
|
[C interoperability](../interop/cpp.md) docs for more information.
|
|
|
|
## Array ABI
|
|
|
|
The `ndarray[dtype, ndim]` data structure has three fields, in the following order:
|
|
|
|
- `shape`: length-`ndim` tuple of non-negative 64-bit integers representing the array
|
|
shape
|
|
- `strides`: length-`ndim` tuple of 64-bit integers representing the stride in bytes
|
|
along each axis of the array
|
|
- `data`: pointer of type `dtype` to the array's data
|
|
|
|
For example, `ndarray[np.float32, 3]` would correspond to the following C structure:
|
|
|
|
``` c
|
|
struct ndarray_float32_3 {
|
|
int64_t shape[3];
|
|
int64_t strides[3];
|
|
float *data;
|
|
};
|
|
```
|
|
|
|
This can be used to pass an entire `ndarray` object to a C function without breaking
|
|
it up into its constituent components.
|
|
|
|
# Performance tips
|
|
|
|
## Array layouts
|
|
As with standard NumPy, Codon-NumPy performs best when array data is contiguous in
|
|
memory, ideally in row-major order (also called "C order"). Most NumPy functions will
|
|
return C-order arrays, but operations like slicing and transposing arrays can alter
|
|
contiguity. You can use `numpy.ascontiguousarray()` to create a contiguous array from
|
|
an arbitrary array.
|
|
|
|
## Linux huge pages
|
|
When working with large arrays on Linux, enabling
|
|
[transparent hugepages](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html)
|
|
can result in significant performance improvements.
|
|
|
|
You can check if transparent hugepages are enabled via
|
|
|
|
``` bash
|
|
cat /sys/kernel/mm/transparent_hugepage/enabled
|
|
```
|
|
|
|
and you can enable them via
|
|
|
|
``` bash
|
|
echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
|
|
```
|
|
|
|
## Disabling exceptions
|
|
By default, Codon performs various validation checks at runtime (e.g. bounds checks when
|
|
indexing an array) just like standard NumPy, and raises an exception if they fail. If
|
|
you know your program will not raise or catch any exceptions, you can disable these
|
|
checks through the `-disable-exceptions` compiler flag.
|
|
|
|
Note that when using this flag, raising an exception will terminate the process with a
|
|
`SIGTRAP`.
|
|
|
|
## Fast-math
|
|
You can enable "fast-math" optimizations via the `-fast-math` compiler flag. It is
|
|
advisable to **use this flag with caution** as it changes floating point semantics and
|
|
makes assumptions regarding `inf` and `nan` values. For more information, consult LLVM's
|
|
documentation on [fast-math flags](https://llvm.org/docs/LangRef.html#fast-math-flags).
|
|
|
|
# Not-yet-supported
|
|
The following features of NumPy are not yet supported, but are planned for the future:
|
|
- String operations
|
|
- Masked arrays
|
|
- Polynomials
|
|
|
|
A few miscellaneous Python-specific functions like `get_include()` are also not
|
|
supported, as they are not applicable in Codon.
|