codon/docs/advanced/parallel.md

Codon supports parallelism and multithreading via OpenMP out of the box.
Here\'s an example:

``` python
@par
for i in range(10):
    import threading as thr
    print('hello from thread', thr.get_ident())
```

By default, parallel loops will use all available threads, or use the
number of threads specified by the `OMP_NUM_THREADS` environment
variable. A specific thread number can be given directly on the `@par`
line as well:

``` python
@par(num_threads=5)
for i in range(10):
    import threading as thr
    print('hello from thread', thr.get_ident())
```

`@par` supports several OpenMP parameters, including:

-   `num_threads` (int): the number of threads to use when running the
    loop
-   `schedule` (str): either *static*, *dynamic*, *guided*, *auto* or
    *runtime*
-   `chunk_size` (int): chunk size when partitioning loop iterations
-   `ordered` (bool): whether the loop iterations should be executed in
    the same order
-   `collapse` (int): number of loop nests to collapse into a single
    iteration space

Other OpenMP parameters like `private`, `shared` or `reduction`, are
inferred automatically by the compiler. For example, the following loop

``` python
a = 0
@par
for i in range(N):
    a += foo(i)
```

will automatically generate a reduction for variable `a`.

{% hint style="warning" %}
Modifying shared objects like lists or dictionaries within a parallel
section needs to be done with a lock or critical section. See below
for more details.
{% endhint %}

Here is an example that finds the number of primes up to a
user-defined limit, using a parallel loop on 16 threads with a dynamic
schedule and chunk size of 100:

``` python
from sys import argv

def is_prime(n):
    factors = 0
    for i in range(2, n):
        if n % i == 0:
            factors += 1
    return factors == 0

limit = int(argv[1])
total = 0

@par(schedule='dynamic', chunk_size=100, num_threads=16)
for i in range(2, limit):
    if is_prime(i):
        total += 1

print(total)
```

Static schedules work best when each loop iteration takes roughly the
same amount of time, whereas dynamic schedules are superior when each
iteration varies in duration. Since counting the factors of an integer
takes more time for larger integers, we use a dynamic schedule here.

`@par` also supports C/C++ OpenMP pragma strings. For example, the
`@par` line in the above example can also be written as:

``` python
# same as: @par(schedule='dynamic', chunk_size=100, num_threads=16)
@par('schedule(dynamic, 100) num_threads(16)')
```

# Different kinds of loops

`for`-loops can iterate over arbitrary generators, but OpenMP\'s
parallel loop construct only applies to *imperative* for-loops of the
form `for i in range(a, b, c)` (where `c` is constant). For general
parallel for-loops of the form `for i in some_generator()`, a task-based
approach is used instead, where each loop iteration is executed as an
independent task.

The Codon compiler also converts iterations over lists
(`for a in some_list`) to imperative for-loops, meaning these loops can
be executed using OpenMP\'s loop parallelism.

# Custom reductions

Codon can automatically generate efficient reductions for `int` and
`float` values. For other data types, user-defined reductions can be
specified. A class that supports reductions must include:

-   A default constructor that represents the *zero value*
-   An `__add__` method (assuming `+` is used as the reduction operator)

Here is an example for reducing a new `Vector` type:

``` python
@tuple
class Vector:
    x: int
    y: int

    def __new__():
        return Vector(0, 0)

    def __add__(self, other: Vector):
        return Vector(self.x + other.x, self.y + other.y)

v = Vector()
@par
for i in range(100):
    v += Vector(i,i)
print(v)  # (x: 4950, y: 4950)
```

# OpenMP constructs

All of OpenMP\'s API functions are accessible directly in Codon. For
example:

``` python
import openmp as omp
print(omp.get_num_threads())
omp.set_num_threads(32)
```

OpenMP\'s *critical*, *master*, *single* and *ordered* constructs can be
applied via the corresponding decorators:

``` python
import openmp as omp

@omp.critical
def only_run_by_one_thread_at_a_time():
    print('critical!', omp.get_thread_num())

@omp.master
def only_run_by_master_thread():
    print('master!', omp.get_thread_num())

@omp.single
def only_run_by_single_thread():
    print('single!', omp.get_thread_num())

@omp.ordered
def run_ordered_by_iteration(i):
    print('ordered!', i)

@par(ordered=True)
for i in range(100):
    only_run_by_one_thread_at_a_time()
    only_run_by_master_thread()
    only_run_by_single_thread()
    run_ordered_by_iteration(i)
```

For finer-grained locking, consider using the locks from the `threading`
module:

``` python
from threading import Lock
lock = Lock()  # or RLock for reentrant lock

@par
for i in range(100):
    with lock:
        print('only one thread at a time allowed here')
```
Update docs (#28) * Update docs * Update docs * Update docs * GitBook: [#4] Add hint * Update primer * Re-organize docs * Fix table * Fix link * GitBook: [#5] No subject * GitBook: [#6] No subject * Cleanup and doc fix * Add IR docs * Add ir docs * Fix spelling error * More IR docs * Update README.md * Update README.md * Fix warning * Update intro * Update README.md * Update docs * Fix table * Don't build docs * Update docs * Add Jupyter docs * FIx snippet * Update README.md * Fix images * Fix code block * Update docs, update cmake * Break up tutorial * Update pipeline.svg * Update docs for new version * Add differences with Python docs 2022-07-26 16:08:42 -04:00			`Codon supports parallelism and multithreading via OpenMP out of the box.`
			`Here\'s an example:`

			``` python
			`@par`
			`for i in range(10):`
			`import threading as thr`
			`print('hello from thread', thr.get_ident())`
			```

			`By default, parallel loops will use all available threads, or use the`
			number of threads specified by the `OMP_NUM_THREADS` environment
			variable. A specific thread number can be given directly on the `@par`
			`line as well:`

			``` python
			`@par(num_threads=5)`
			`for i in range(10):`
			`import threading as thr`
			`print('hello from thread', thr.get_ident())`
			```

			`@par` supports several OpenMP parameters, including:

			- `num_threads` (int): the number of threads to use when running the
			`loop`
			- `schedule` (str): either static, dynamic, guided, auto or
			`runtime`
			- `chunk_size` (int): chunk size when partitioning loop iterations
			- `ordered` (bool): whether the loop iterations should be executed in
			`the same order`
GPU and other updates (#52) * Add nvptx pass * Fix spaces * Don't change name * Add runtime support * Add init call * Add more runtime functions * Add launch function * Add intrinsics * Fix codegen * Run GPU pass between general opt passes * Set data layout * Create context * Link libdevice * Add function remapping * Fix linkage * Fix libdevice link * Fix linking * Fix personality * Fix linking * Fix linking * Fix linking * Add internalize pass * Add more math conversions * Add more re-mappings * Fix conversions * Fix __str__ * Add decorator attribute for any decorator * Update kernel decorator * Fix kernel decorator * Fix kernel decorator * Fix kernel decorator * Fix kernel decorator * Remove old decorator * Fix pointer calc * Fix fill-in codegen * Fix linkage * Add comment * Update list conversion * Add more conversions * Add dict and set conversions * Add float32 type to IR/LLVM * Add float32 * Add float32 stdlib * Keep required global values in PTX module * Fix PTX module pruning * Fix malloc * Set will-return * Fix name cleanup * Fix access * Fix name cleanup * Fix function renaming * Update dimension API * Fix args * Clean up API * Move GPU transformations to end of opt pipeline * Fix alloc replacements * Fix naming * Target PTX 4.2 * Fix global renaming * Fix early return in static blocks; Add __realized__ function * Format * Add __llvm_name__ for functions * Add vector type to IR * SIMD support [wip] * Update kernel naming * Fix early returns; Fix SIMD calls * Fix kernel naming * Fix IR matcher * Remove module print * Update realloc * Add overloads for 32-bit float math ops * Add gpu.Pointer type for working with raw pointers * Add float32 conversion * Add to_gpu and from_gpu * clang-format * Add f32 reduction support to OpenMP * Fix automatic GPU class conversions * Fix conversion functions * Fix conversions * Rename self * Fix tuple conversion * Fix conversions * Fix conversions * Update PTX filename * Fix filename * Add raw function * Add GPU docs * Allow nested object conversions * Add tests (WIP) * Update SIMD * Add staticrange and statictuple loop support * SIMD updates * Add new Vec constructors * Fix UInt conversion * Fix size-0 allocs * Add more tests * Add matmul test * Rename gpu test file * Add more tests * Add alloc cache * Fix object_to_gpu * Fix frees * Fix str conversion * Fix set conversion * Fix conversions * Fix class conversion * Fix str conversion * Fix byte conversion * Fix list conversion * Fix pointer conversions * Fix conversions * Fix conversions * Update tests * Fix conversions * Fix tuple conversion * Fix tuple conversion * Fix auto conversions * Fix conversion * Fix magics * Update tests * Support GPU in JIT mode * Fix GPU+JIT * Fix kernel filename in JIT mode * Add __static_print__; Add earlyDefines; Various domination bugfixes; SimplifyContext RAII base handling * Fix global static handling * Fix float32 tests * FIx gpu module * Support OpenMP "collapse" option * Add more collapse tests * Capture generics and statics * TraitVar handling * Python exceptions / isinstance [wip; no_ci] * clang-format * Add list comparison operators * Support empty raise in IR * Add dict 'or' operator * Fix repr * Add copy module * Fix spacing * Use sm_30 * Python exceptions * TypeTrait support; Fix defaultDict * Fix earlyDefines * Add defaultdict * clang-format * Fix invalid canonicalizations * Fix empty raise * Fix copyright * Add Python numerics option * Support py-numerics in math module * Update docs * Add static Python division / modulus * Add static py numerics tests * Fix staticrange/tuple; Add KwTuple.__getitem__ * clang-format * Add gpu parameter to par * Fix globals * Don't init loop vars on loop collapse * Add par-gpu tests * Update gpu docs * Fix isinstance check * Remove invalid test * Add -libdevice to set custom path [skip ci] * Add release notes; bump version [skip ci] * Add libdevice docs [skip ci] Co-authored-by: Ibrahim Numanagić <ibrahimpasa@gmail.com> 2022-09-15 15:40:00 -04:00			- `collapse` (int): number of loop nests to collapse into a single
			`iteration space`
Update docs (#28) * Update docs * Update docs * Update docs * GitBook: [#4] Add hint * Update primer * Re-organize docs * Fix table * Fix link * GitBook: [#5] No subject * GitBook: [#6] No subject * Cleanup and doc fix * Add IR docs * Add ir docs * Fix spelling error * More IR docs * Update README.md * Update README.md * Fix warning * Update intro * Update README.md * Update docs * Fix table * Don't build docs * Update docs * Add Jupyter docs * FIx snippet * Update README.md * Fix images * Fix code block * Update docs, update cmake * Break up tutorial * Update pipeline.svg * Update docs for new version * Add differences with Python docs 2022-07-26 16:08:42 -04:00
			Other OpenMP parameters like `private`, `shared` or `reduction`, are
			`inferred automatically by the compiler. For example, the following loop`

			``` python
			`a = 0`
			`@par`
			`for i in range(N):`
			`a += foo(i)`
			```

			will automatically generate a reduction for variable `a`.

			`{% hint style="warning" %}`
			`Modifying shared objects like lists or dictionaries within a parallel`
			`section needs to be done with a lock or critical section. See below`
			`for more details.`
			`{% endhint %}`

Better Jupyter support & Polymorphism improvements (#363) * Remove vtables from objects (use __id__ only); Add static itemgetter * Migrate to XEUS 3 * Fix XEUS patches * Fix docs [skip ci] * Use tuples for RTTI classes * clang-format * Add polymorphic to IR RefType * Pass isPolymorphic to IR * Update codegen for new poly ref types * New RTTI memory layout; Fix #346 * Ellipsis class; handle ellipses * Move all generated magics to Codon (__magic__) * Fix vars_types * Update polymorphic setter * Fix compiler warnings * Fix pyext assert * Update ellipsis * Update pure/derives tags * Update dataclass order magics * Add pure/derives tags * Fix partial printing * Add extra tuple test --------- Co-authored-by: A. R. Shajii <ars@ars.me> 2023-05-10 06:28:25 -07:00			`Here is an example that finds the number of primes up to a`
Update docs (#28) * Update docs * Update docs * Update docs * GitBook: [#4] Add hint * Update primer * Re-organize docs * Fix table * Fix link * GitBook: [#5] No subject * GitBook: [#6] No subject * Cleanup and doc fix * Add IR docs * Add ir docs * Fix spelling error * More IR docs * Update README.md * Update README.md * Fix warning * Update intro * Update README.md * Update docs * Fix table * Don't build docs * Update docs * Add Jupyter docs * FIx snippet * Update README.md * Fix images * Fix code block * Update docs, update cmake * Break up tutorial * Update pipeline.svg * Update docs for new version * Add differences with Python docs 2022-07-26 16:08:42 -04:00			`user-defined limit, using a parallel loop on 16 threads with a dynamic`
			`schedule and chunk size of 100:`

			``` python
			`from sys import argv`

			`def is_prime(n):`
			`factors = 0`
			`for i in range(2, n):`
			`if n % i == 0:`
			`factors += 1`
			`return factors == 0`

			`limit = int(argv[1])`
			`total = 0`

			`@par(schedule='dynamic', chunk_size=100, num_threads=16)`
			`for i in range(2, limit):`
			`if is_prime(i):`
			`total += 1`

			`print(total)`
			```

			`Static schedules work best when each loop iteration takes roughly the`
			`same amount of time, whereas dynamic schedules are superior when each`
			`iteration varies in duration. Since counting the factors of an integer`
			`takes more time for larger integers, we use a dynamic schedule here.`

			`@par` also supports C/C++ OpenMP pragma strings. For example, the
			`@par` line in the above example can also be written as:

			``` python
			`# same as: @par(schedule='dynamic', chunk_size=100, num_threads=16)`
			`@par('schedule(dynamic, 100) num_threads(16)')`
			```

			`# Different kinds of loops`

			`for`-loops can iterate over arbitrary generators, but OpenMP\'s
			`parallel loop construct only applies to imperative for-loops of the`
			form `for i in range(a, b, c)` (where `c` is constant). For general
			parallel for-loops of the form `for i in some_generator()`, a task-based
			`approach is used instead, where each loop iteration is executed as an`
			`independent task.`

			`The Codon compiler also converts iterations over lists`
			(`for a in some_list`) to imperative for-loops, meaning these loops can
			`be executed using OpenMP\'s loop parallelism.`

			`# Custom reductions`

			Codon can automatically generate efficient reductions for `int` and
			`float` values. For other data types, user-defined reductions can be
			`specified. A class that supports reductions must include:`

			`- A default constructor that represents the zero value`
			- An `__add__` method (assuming `+` is used as the reduction operator)

			Here is an example for reducing a new `Vector` type:

			``` python
			`@tuple`
			`class Vector:`
			`x: int`
			`y: int`

			`def __new__():`
			`return Vector(0, 0)`

			`def __add__(self, other: Vector):`
			`return Vector(self.x + other.x, self.y + other.y)`

			`v = Vector()`
			`@par`
			`for i in range(100):`
			`v += Vector(i,i)`
			`print(v) # (x: 4950, y: 4950)`
			```

			`# OpenMP constructs`

			`All of OpenMP\'s API functions are accessible directly in Codon. For`
			`example:`

			``` python
			`import openmp as omp`
			`print(omp.get_num_threads())`
			`omp.set_num_threads(32)`
			```

			`OpenMP\'s critical, master, single and ordered constructs can be`
			`applied via the corresponding decorators:`

			``` python
			`import openmp as omp`

			`@omp.critical`
			`def only_run_by_one_thread_at_a_time():`
			`print('critical!', omp.get_thread_num())`

			`@omp.master`
			`def only_run_by_master_thread():`
			`print('master!', omp.get_thread_num())`

			`@omp.single`
			`def only_run_by_single_thread():`
			`print('single!', omp.get_thread_num())`

			`@omp.ordered`
			`def run_ordered_by_iteration(i):`
			`print('ordered!', i)`

			`@par(ordered=True)`
			`for i in range(100):`
			`only_run_by_one_thread_at_a_time()`
			`only_run_by_master_thread()`
			`only_run_by_single_thread()`
			`run_ordered_by_iteration(i)`
			```

			For finer-grained locking, consider using the locks from the `threading`
			`module:`

			``` python
			`from threading import Lock`
Spelling (#276) * spelling: about Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: adopted Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: between Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: convenience Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: ellipsis Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: improve Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: indicates Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: intermediate Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: into Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: multiple Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: quartiles Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: reentrant Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: relevant Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: supports Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: reproducible Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> * spelling: wrappers Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> --------- Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com> 2023-03-20 19:13:39 -04:00			`lock = Lock() # or RLock for reentrant lock`
Update docs (#28) * Update docs * Update docs * Update docs * GitBook: [#4] Add hint * Update primer * Re-organize docs * Fix table * Fix link * GitBook: [#5] No subject * GitBook: [#6] No subject * Cleanup and doc fix * Add IR docs * Add ir docs * Fix spelling error * More IR docs * Update README.md * Update README.md * Fix warning * Update intro * Update README.md * Update docs * Fix table * Don't build docs * Update docs * Add Jupyter docs * FIx snippet * Update README.md * Fix images * Fix code block * Update docs, update cmake * Break up tutorial * Update pipeline.svg * Update docs for new version * Add differences with Python docs 2022-07-26 16:08:42 -04:00
			`@par`
			`for i in range(100):`
			`with lock:`
			`print('only one thread at a time allowed here')`
			```