mirror of
https://github.com/exaloop/codon.git
synced 2025-06-03 15:03:52 +08:00
* Add nvptx pass * Fix spaces * Don't change name * Add runtime support * Add init call * Add more runtime functions * Add launch function * Add intrinsics * Fix codegen * Run GPU pass between general opt passes * Set data layout * Create context * Link libdevice * Add function remapping * Fix linkage * Fix libdevice link * Fix linking * Fix personality * Fix linking * Fix linking * Fix linking * Add internalize pass * Add more math conversions * Add more re-mappings * Fix conversions * Fix __str__ * Add decorator attribute for any decorator * Update kernel decorator * Fix kernel decorator * Fix kernel decorator * Fix kernel decorator * Fix kernel decorator * Remove old decorator * Fix pointer calc * Fix fill-in codegen * Fix linkage * Add comment * Update list conversion * Add more conversions * Add dict and set conversions * Add float32 type to IR/LLVM * Add float32 * Add float32 stdlib * Keep required global values in PTX module * Fix PTX module pruning * Fix malloc * Set will-return * Fix name cleanup * Fix access * Fix name cleanup * Fix function renaming * Update dimension API * Fix args * Clean up API * Move GPU transformations to end of opt pipeline * Fix alloc replacements * Fix naming * Target PTX 4.2 * Fix global renaming * Fix early return in static blocks; Add __realized__ function * Format * Add __llvm_name__ for functions * Add vector type to IR * SIMD support [wip] * Update kernel naming * Fix early returns; Fix SIMD calls * Fix kernel naming * Fix IR matcher * Remove module print * Update realloc * Add overloads for 32-bit float math ops * Add gpu.Pointer type for working with raw pointers * Add float32 conversion * Add to_gpu and from_gpu * clang-format * Add f32 reduction support to OpenMP * Fix automatic GPU class conversions * Fix conversion functions * Fix conversions * Rename self * Fix tuple conversion * Fix conversions * Fix conversions * Update PTX filename * Fix filename * Add raw function * Add GPU docs * Allow nested object conversions * Add tests (WIP) * Update SIMD * Add staticrange and statictuple loop support * SIMD updates * Add new Vec constructors * Fix UInt conversion * Fix size-0 allocs * Add more tests * Add matmul test * Rename gpu test file * Add more tests * Add alloc cache * Fix object_to_gpu * Fix frees * Fix str conversion * Fix set conversion * Fix conversions * Fix class conversion * Fix str conversion * Fix byte conversion * Fix list conversion * Fix pointer conversions * Fix conversions * Fix conversions * Update tests * Fix conversions * Fix tuple conversion * Fix tuple conversion * Fix auto conversions * Fix conversion * Fix magics * Update tests * Support GPU in JIT mode * Fix GPU+JIT * Fix kernel filename in JIT mode * Add __static_print__; Add earlyDefines; Various domination bugfixes; SimplifyContext RAII base handling * Fix global static handling * Fix float32 tests * FIx gpu module * Support OpenMP "collapse" option * Add more collapse tests * Capture generics and statics * TraitVar handling * Python exceptions / isinstance [wip; no_ci] * clang-format * Add list comparison operators * Support empty raise in IR * Add dict 'or' operator * Fix repr * Add copy module * Fix spacing * Use sm_30 * Python exceptions * TypeTrait support; Fix defaultDict * Fix earlyDefines * Add defaultdict * clang-format * Fix invalid canonicalizations * Fix empty raise * Fix copyright * Add Python numerics option * Support py-numerics in math module * Update docs * Add static Python division / modulus * Add static py numerics tests * Fix staticrange/tuple; Add KwTuple.__getitem__ * clang-format * Add gpu parameter to par * Fix globals * Don't init loop vars on loop collapse * Add par-gpu tests * Update gpu docs * Fix isinstance check * Remove invalid test * Add -libdevice to set custom path [skip ci] * Add release notes; bump version [skip ci] * Add libdevice docs [skip ci] Co-authored-by: Ibrahim Numanagić <ibrahimpasa@gmail.com>
187 lines
4.8 KiB
Markdown
187 lines
4.8 KiB
Markdown
Codon supports parallelism and multithreading via OpenMP out of the box.
|
|
Here\'s an example:
|
|
|
|
``` python
|
|
@par
|
|
for i in range(10):
|
|
import threading as thr
|
|
print('hello from thread', thr.get_ident())
|
|
```
|
|
|
|
By default, parallel loops will use all available threads, or use the
|
|
number of threads specified by the `OMP_NUM_THREADS` environment
|
|
variable. A specific thread number can be given directly on the `@par`
|
|
line as well:
|
|
|
|
``` python
|
|
@par(num_threads=5)
|
|
for i in range(10):
|
|
import threading as thr
|
|
print('hello from thread', thr.get_ident())
|
|
```
|
|
|
|
`@par` supports several OpenMP parameters, including:
|
|
|
|
- `num_threads` (int): the number of threads to use when running the
|
|
loop
|
|
- `schedule` (str): either *static*, *dynamic*, *guided*, *auto* or
|
|
*runtime*
|
|
- `chunk_size` (int): chunk size when partitioning loop iterations
|
|
- `ordered` (bool): whether the loop iterations should be executed in
|
|
the same order
|
|
- `collapse` (int): number of loop nests to collapse into a single
|
|
iteration space
|
|
|
|
Other OpenMP parameters like `private`, `shared` or `reduction`, are
|
|
inferred automatically by the compiler. For example, the following loop
|
|
|
|
``` python
|
|
a = 0
|
|
@par
|
|
for i in range(N):
|
|
a += foo(i)
|
|
```
|
|
|
|
will automatically generate a reduction for variable `a`.
|
|
|
|
{% hint style="warning" %}
|
|
Modifying shared objects like lists or dictionaries within a parallel
|
|
section needs to be done with a lock or critical section. See below
|
|
for more details.
|
|
{% endhint %}
|
|
|
|
Here is an example that finds the sum of prime numbers up to a
|
|
user-defined limit, using a parallel loop on 16 threads with a dynamic
|
|
schedule and chunk size of 100:
|
|
|
|
``` python
|
|
from sys import argv
|
|
|
|
def is_prime(n):
|
|
factors = 0
|
|
for i in range(2, n):
|
|
if n % i == 0:
|
|
factors += 1
|
|
return factors == 0
|
|
|
|
limit = int(argv[1])
|
|
total = 0
|
|
|
|
@par(schedule='dynamic', chunk_size=100, num_threads=16)
|
|
for i in range(2, limit):
|
|
if is_prime(i):
|
|
total += 1
|
|
|
|
print(total)
|
|
```
|
|
|
|
Static schedules work best when each loop iteration takes roughly the
|
|
same amount of time, whereas dynamic schedules are superior when each
|
|
iteration varies in duration. Since counting the factors of an integer
|
|
takes more time for larger integers, we use a dynamic schedule here.
|
|
|
|
`@par` also supports C/C++ OpenMP pragma strings. For example, the
|
|
`@par` line in the above example can also be written as:
|
|
|
|
``` python
|
|
# same as: @par(schedule='dynamic', chunk_size=100, num_threads=16)
|
|
@par('schedule(dynamic, 100) num_threads(16)')
|
|
```
|
|
|
|
# Different kinds of loops
|
|
|
|
`for`-loops can iterate over arbitrary generators, but OpenMP\'s
|
|
parallel loop construct only applies to *imperative* for-loops of the
|
|
form `for i in range(a, b, c)` (where `c` is constant). For general
|
|
parallel for-loops of the form `for i in some_generator()`, a task-based
|
|
approach is used instead, where each loop iteration is executed as an
|
|
independent task.
|
|
|
|
The Codon compiler also converts iterations over lists
|
|
(`for a in some_list`) to imperative for-loops, meaning these loops can
|
|
be executed using OpenMP\'s loop parallelism.
|
|
|
|
# Custom reductions
|
|
|
|
Codon can automatically generate efficient reductions for `int` and
|
|
`float` values. For other data types, user-defined reductions can be
|
|
specified. A class that supports reductions must include:
|
|
|
|
- A default constructor that represents the *zero value*
|
|
- An `__add__` method (assuming `+` is used as the reduction operator)
|
|
|
|
Here is an example for reducing a new `Vector` type:
|
|
|
|
``` python
|
|
@tuple
|
|
class Vector:
|
|
x: int
|
|
y: int
|
|
|
|
def __new__():
|
|
return Vector(0, 0)
|
|
|
|
def __add__(self, other: Vector):
|
|
return Vector(self.x + other.x, self.y + other.y)
|
|
|
|
v = Vector()
|
|
@par
|
|
for i in range(100):
|
|
v += Vector(i,i)
|
|
print(v) # (x: 4950, y: 4950)
|
|
```
|
|
|
|
# OpenMP constructs
|
|
|
|
All of OpenMP\'s API functions are accessible directly in Codon. For
|
|
example:
|
|
|
|
``` python
|
|
import openmp as omp
|
|
print(omp.get_num_threads())
|
|
omp.set_num_threads(32)
|
|
```
|
|
|
|
OpenMP\'s *critical*, *master*, *single* and *ordered* constructs can be
|
|
applied via the corresponding decorators:
|
|
|
|
``` python
|
|
import openmp as omp
|
|
|
|
@omp.critical
|
|
def only_run_by_one_thread_at_a_time():
|
|
print('critical!', omp.get_thread_num())
|
|
|
|
@omp.master
|
|
def only_run_by_master_thread():
|
|
print('master!', omp.get_thread_num())
|
|
|
|
@omp.single
|
|
def only_run_by_single_thread():
|
|
print('single!', omp.get_thread_num())
|
|
|
|
@omp.ordered
|
|
def run_ordered_by_iteration(i):
|
|
print('ordered!', i)
|
|
|
|
@par(ordered=True)
|
|
for i in range(100):
|
|
only_run_by_one_thread_at_a_time()
|
|
only_run_by_master_thread()
|
|
only_run_by_single_thread()
|
|
run_ordered_by_iteration(i)
|
|
```
|
|
|
|
For finer-grained locking, consider using the locks from the `threading`
|
|
module:
|
|
|
|
``` python
|
|
from threading import Lock
|
|
lock = Lock() # or RLock for re-entrant lock
|
|
|
|
@par
|
|
for i in range(100):
|
|
with lock:
|
|
print('only one thread at a time allowed here')
|
|
```
|