Update docs (#1)

* Convert docs to Codon
pull/3/head
Mark Henderson 2021-10-05 10:12:56 -05:00 committed by GitHub
parent 2556d75343
commit e880f5848a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
30 changed files with 479 additions and 2259 deletions

View File

@ -1,10 +1,10 @@
# Contributing to Seq
# Contributing to Codon
Thank you for considering contributing to Seq! This document contains some helpful information for getting started. The best place to ask questions or get feedback is [our Gitter chatroom](https://gitter.im/seq-lang/Seq). For a high-level outline of the features we aim to add in the future, check the [roadmap](https://github.com/seq-lang/seq/wiki/Roadmap).
Thank you for considering contributing to Codon! This document contains some helpful information for getting started. The best place to ask questions or get feedback is [our Gitter chatroom](https://gitter.im/seq-lang/Seq). For a high-level outline of the features we aim to add in the future, check the [roadmap](https://github.com/seq-lang/seq/wiki/Roadmap).
## Development workflow
All development is done on the [`develop`](https://github.com/seq-lang/seq/tree/develop) branch. Just before release, we bump the version number, merge into [`master`](https://github.com/seq-lang/seq/tree/master) and tag the build with a tag of the form `vX.Y.Z` where `X`, `Y` and `Z` are the [SemVer](https://semver.org) major, minor and patch numbers, respectively. Our Travis CI build script automatically builds and deploys tagged commits as a new GitHub release via our trusty [@SeqBot](https://github.com/seqbot). It also builds and deploys the documentation to our website.
All development is done on the [`develop`](https://github.com/exaloop/codon/tree/develop) branch. Just before release, we bump the version number, merge into [`master`](https://github.com/exaloop/codon/tree/master) and tag the build with a tag of the form `vX.Y.Z` where `X`, `Y` and `Z` are the [SemVer](https://semver.org) major, minor and patch numbers, respectively. Our Travis CI build script automatically builds and deploys tagged commits as a new GitHub release via our trusty [@SeqBot](https://github.com/seqbot). It also builds and deploys the documentation to our website.
## Coding standards
@ -12,7 +12,7 @@ All C++ code should be formatted with [ClangFormat](https://clang.llvm.org/docs/
## Writing tests
Tests are written as Seq programs. The [`test/core/`](https://github.com/seq-lang/seq/tree/master/test/core) directory contains some examples. If you add a new test file, be sure to add it to [`test/main.cpp`](https://github.com/seq-lang/seq/blob/master/test/main.cpp) so that it will be executed as part of the test suite. There are two ways to write tests for Seq:
Tests are written as Codon programs. The [`test/core/`](https://github.com/exaloop/codon/tree/master/test/core) directory contains some examples. If you add a new test file, be sure to add it to [`test/main.cpp`](https://github.com/exaloop/codon/blob/master/test/main.cpp) so that it will be executed as part of the test suite. There are two ways to write tests for Codon:
#### New style
@ -39,7 +39,7 @@ print(2 + 2) # EXPECT: 4
## Pull requests
Pull requests should generally be based on the `develop` branch. Before submitting a pull request, pleace make sure...
Pull requests should generally be based on the `develop` branch. Before submitting a pull request, please make sure...
- ... to provide a clear description of the purpose of the pull request.
- ... to include tests for any new or changed code.
@ -49,12 +49,12 @@ Please be patient with pull request reviews, as our throughput is limited!
## Issues
We use [GitHub's issue tracker](https://github.com/seq-lang/seq/issues), so that's where you'll find the most recent list of open bugs, feature requests and general issues. If applicable, we try to tag each issue with at least one of the following tags:
We use [GitHub's issue tracker](https://github.com/exaloop/codon/issues), so that's where you'll find the most recent list of open bugs, feature requests, and general issues. If applicable, we try to tag each issue with at least one of the following tags:
- `Build`: Issues related to building Seq
- `Build`: Issues related to building Codon
- `Codegen`: Issues related to code generation (i.e. after parsing and type checking)
- `Parser`: Issues related to lexing/parsing
- `Library`: Issues related to the Seq standard library
- `Library`: Issues related to the Codon standard library
- `Interop`: Issues related to interoperability with other languages or systems
- `Docs`: Issues related to documentation
- `Feature`: New language feature proposals

View File

@ -1,24 +1,24 @@
<p align="center">
<img src="docs/sphinx/logo.png?raw=true" width="200" alt="Seq"/>
<img src="docs/sphinx/logo.png?raw=true" width="200" alt="Codon"/>
</p>
<h1 align="center"> Seq — a language for bioinformatics</h1>
<h1 align="center"> Codon</h1>
<p align="center">
<a href="https://github.com/seq-lang/seq/actions?query=branch%3Adevelop">
<img src="https://github.com/seq-lang/seq/workflows/Seq%20CI/badge.svg?branch=develop"
<a href="https://github.com/exaloop.io/codon/actions?query=branch%3Adevelop">
<img src="https://github.com/exaloop.io/codon/workflows/Codon%20CI/badge.svg?branch=develop"
alt="Build Status">
</a>
<a href="https://gitter.im/seq-lang/seq?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge">
<img src="https://badges.gitter.im/Join%20Chat.svg"
alt="Gitter">
</a>
<a href="https://github.com/seq-lang/seq/releases/latest">
<img src="https://img.shields.io/github/v/release/seq-lang/seq?sort=semver"
<a href="https://github.com/exaloop.io/codon/releases/latest">
<img src="https://img.shields.io/github/v/release/exaloop.io/codon?sort=semver"
alt="Version">
</a>
<a href="https://github.com/seq-lang/seq/blob/master/LICENSE">
<img src="https://img.shields.io/github/license/seq-lang/seq"
<a href="https://github.com/exaloop.io/codon/blob/master/LICENSE">
<img src="https://img.shields.io/github/license/exaloop.io/codon"
alt="License">
</a>
</p>
@ -27,17 +27,17 @@
> **A strongly-typed and statically-compiled high-performance Pythonic language!**
Seq is a programming language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
Codon is a programming language for computationally intensive workloads. With a Python-compatible syntax and a host of domain-specific features and optimizations, Codon makes writing high-performance software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
**Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.**
**Think of Codon as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.**
Seq is able to outperform Python code by up to 160x. Seq can further beat equivalent C/C++ code by up to 2x without any manual interventions, and also natively supports parallelism out of the box. Implementation details and benchmarks are discussed [in our paper](https://dl.acm.org/citation.cfm?id=3360551).
Codon is able to outperform Python code by up to 160x. Codon can further beat equivalent C/C++ code by up to 2x without any manual interventions, and also natively supports parallelism out of the box. Implementation details and benchmarks are discussed [in our paper](https://dl.acm.org/citation.cfm?id=3360551).
Learn more by following the [tutorial](https://docs.seq-lang.org/tutorial) or from the [cookbook](https://docs.seq-lang.org/cookbook).
Learn more by following the [tutorial](https://docs.exaloop.io/tutorial).
## Examples
Seq is a Python-compatible language, and many Python programs should work with few if any modifications:
Codon is a Python-compatible language, and many Python programs should work with few if any modifications:
```python
def fib(n):
@ -49,7 +49,7 @@ def fib(n):
fib(1000)
```
This prime counting example showcases Seq's [OpenMP](https://www.openmp.org/) support, enabled with the addition of one line. The `@par` annotation tells the compiler to parallelize the following for-loop, in this case using a dynamic schedule, chunk size of 100, and 16 threads.
This prime counting example showcases Codon's [OpenMP](https://www.openmp.org/) support, enabled with the addition of one line. The `@par` annotation tells the compiler to parallelize the following for-loop, in this case using a dynamic schedule, chunk size of 100, and 16 threads.
```python
from sys import argv
@ -72,30 +72,11 @@ for i in range(2, limit):
print(total)
```
Here is an example showcasing some of Seq's bioinformatics features, which include native sequence and k-mer types.
```python
from bio import *
s = s'ACGTACGT' # sequence literal
print(s[2:5]) # subsequence
print(~s) # reverse complement
kmer = Kmer[8](s) # convert to k-mer
# iterate over length-3 subsequences
# with step 2
for sub in s.split(3, step=2):
print(sub[-1]) # last base
# iterate over 2-mers with step 1
for kmer in sub.kmers(step=1, k=2):
print(~kmer) # '~' also works on k-mers
```
## Install
### Pre-built binaries
Pre-built binaries for Linux and macOS on x86_64 are available alongside [each release](https://github.com/seq-lang/seq/releases). We also have a script for downloading and installing pre-built versions:
Pre-built binaries for Linux and macOS on x86_64 are available alongside [each release](https://github.com/exaloop/codon/releases). We also have a script for downloading and installing pre-built versions:
```bash
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
@ -107,35 +88,4 @@ See [Building from Source](docs/sphinx/build.rst).
## Documentation
Please check [docs.seq-lang.org](https://docs.seq-lang.org) for in-depth documentation.
## Citing Seq
If you use Seq in your research, please cite:
> Ariya Shajii, Ibrahim Numanagić, Riyadh Baghdadi, Bonnie Berger, and Saman Amarasinghe. 2019. Seq: a high-performance language for bioinformatics. *Proc. ACM Program. Lang.* 3, OOPSLA, Article 125 (October 2019), 29 pages. DOI: https://doi.org/10.1145/3360551
BibTeX:
```
@article{Shajii:2019:SHL:3366395.3360551,
author = {Shajii, Ariya and Numanagi\'{c}, Ibrahim and Baghdadi, Riyadh and Berger, Bonnie and Amarasinghe, Saman},
title = {Seq: A High-performance Language for Bioinformatics},
journal = {Proc. ACM Program. Lang.},
issue_date = {October 2019},
volume = {3},
number = {OOPSLA},
month = oct,
year = {2019},
issn = {2475-1421},
pages = {125:1--125:29},
articleno = {125},
numpages = {29},
url = {http://doi.acm.org/10.1145/3360551},
doi = {10.1145/3360551},
acmid = {3360551},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Python, bioinformatics, computational biology, domain-specific language, optimization, programming language},
}
```
Please check [docs.exaloop.io](https://docs.exaloop.io) for in-depth documentation.

View File

@ -1,4 +1,4 @@
PROJECT_NAME = "seq"
PROJECT_NAME = "codon"
XML_OUTPUT = xml
INPUT = ../../compiler/include ../../compiler/lang ../../compiler/types ../../compiler/util
STRIP_FROM_PATH = ../..

View File

@ -4,7 +4,7 @@
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = seq
SPHINXPROJ = codon
SOURCEDIR = .
BUILDDIR = _build

View File

@ -0,0 +1,157 @@
from collections import defaultdict
import re
from typing import Any,Dict,Iterable,Iterator,List,NamedTuple,Tuple
from typing import cast
from docutils.parsers.rst import directives
from sphinx import addnodes
from sphinx.directives import ObjectDescription
from sphinx.domains import Domain
from sphinx.domains import Index
from sphinx.roles import XRefRole
from sphinx.util.nodes import make_refnode
from sphinx.locale import _,__
from docutils import nodes
from docutils.nodes import Element,Node
from docutils.parsers.rst import directives
from sphinx import addnodes
from sphinx.addnodes import pending_xref,desc_signature
from sphinx.application import Sphinx
from sphinx.builders import Builder
from sphinx.directives import ObjectDescription
from sphinx.domains import Domain,ObjType,Index,IndexEntry
from sphinx.environment import BuildEnvironment
from sphinx.locale import _,__
from sphinx.pycode.ast import ast,parse as ast_parse
from sphinx.roles import XRefRole
from sphinx.util import logging
from sphinx.util.docfields import Field,GroupedField,TypedField
from sphinx.util.docutils import SphinxDirective
from sphinx.util.inspect import signature_from_str
from sphinx.util.nodes import make_id,make_refnode
from sphinx.util.typing import TextlikeNode
import sphinx.domains.python
codon_sig_re=re.compile(r'''^([\w.]*\.)?(\w+)\s*(?:\[\s*(.*)\s*\])?\s*(?:\(\s*(.*)\s*\)(?:\s*->\s*(.*))?)?$''',re.VERBOSE)
def handle_signature(self,sig: str,signode: desc_signature) -> Tuple[str,str]:
m=codon_sig_re.match(sig)
if m is None:
raise ValueError
prefix,name,generics,arglist,retann=m.groups()
# determine module and class name (if applicable), as well as full name
modname=self.options.get('module',self.env.ref_context.get('py:module'))
classname=self.env.ref_context.get('py:class')
isextension=self.objtype=='extension'
if classname:
add_module=False
if prefix and (prefix==classname or prefix.startswith(classname+".")):
fullname=prefix+name
# class name is given again in the signature
prefix=prefix[len(classname):].lstrip('.')
elif prefix:
# class name is given in the signature, but different
# (shouldn't happen)
fullname=classname+'.'+prefix+name
else:
# class name is not given in the signature
fullname=classname+'.'+name
else:
add_module=True
if prefix:
classname=prefix.rstrip('.')
fullname=prefix+name
else:
classname=''
fullname=name
signode['module']=modname
if modname.startswith('..'): # HACK: no idea why this happens
modname=modname[2:]
signode['class']=classname
signode['fullname']=fullname
sig_prefix=self.get_signature_prefix(sig)
if sig_prefix:
signode+=addnodes.desc_annotation(sig_prefix,sig_prefix)
if not isextension:
if prefix:
signode+=addnodes.desc_addname(prefix,prefix)
elif add_module and self.env.config.add_module_names:
if modname and modname!='exceptions':
# exceptions are a special case, since they are documented in the
# 'exceptions' module.
nodetext=modname+'.'
signode+=addnodes.desc_addname(nodetext,nodetext)
signode+=addnodes.desc_name(name,name)
else:
ref=type_to_xref(str(name),self.env)
signode+=addnodes.desc_name(name,name)
if generics:
signode+=addnodes.desc_name("generics","["+generics+"]")
if arglist:
try:
signode+=sphinx.domains.python._parse_arglist(arglist,self.env)
except SyntaxError:
# fallback to parse arglist original parser.
# it supports to represent optional arguments (ex. "func(foo [, bar])")
sphinx.domains.python._pseudo_parse_arglist(signode,arglist)
except NotImplementedError as exc:
logger.warning("could not parse arglist (%r): %s",arglist,exc,location=signode)
sphinx.domains.python._pseudo_parse_arglist(signode,arglist)
else:
if self.needs_arglist():
# for callables, add an empty parameter list
signode+=addnodes.desc_parameterlist()
if retann:
children=sphinx.domains.python._parse_annotation(retann,self.env)
signode+=addnodes.desc_returns(retann,'',*children)
anno=self.options.get('annotation')
if anno:
signode+=addnodes.desc_annotation(' '+anno,' '+anno)
return fullname,prefix
sphinx.domains.python.PyObject.handle_signature=handle_signature
from sphinx.domains.python import *
logger=logging.getLogger(__name__)
class CodonDomain(PythonDomain):
name='codon'
label='Codon'
object_types={'function':ObjType(_('function'),'func','obj'),# 'data': ObjType(_('data'), 'data', 'obj'),
'class':ObjType(_('class'),'class','exc','obj'),'type':ObjType(_('class'),'exc','class','obj'),'extension':ObjType(_('class'),'exc','class','obj'),
'exception':ObjType(_('exception'),'exc','class','obj'),'method':ObjType(_('method'),'meth','obj'),# 'classmethod': ObjType(_('class method'), 'meth', 'obj'),
# 'staticmethod': ObjType(_('static method'), 'meth', 'obj'),
# 'attribute': ObjType(_('attribute'), 'attr', 'obj'),
'module':ObjType(_('module'),'mod','obj'),} # type: Dict[str, ObjType]
directives={'function':PyFunction,'data':PyVariable,'class':PyClasslike,'exception':PyClasslike,'type':PyClasslike,'extension':PyClasslike,'method':PyMethod,'classmethod':PyClassMethod,
'staticmethod':PyStaticMethod,'attribute':PyAttribute,'module':PyModule,'currentmodule':PyCurrentModule,'decorator':PyDecoratorFunction,'decoratormethod':PyDecoratorMethod,}
roles={'data':PyXRefRole(),'exc':PyXRefRole(),'func':PyXRefRole(fix_parens=True),'class':PyXRefRole(),# 'const': PyXRefRole(),
'attr':PyXRefRole(),'meth':PyXRefRole(fix_parens=True),'mod':PyXRefRole(),'obj':PyXRefRole(),}
initial_data={'objects':{}, # fullname -> docname, objtype
'modules':{}, # modname -> docname, synopsis, platform, deprecated
} # type: Dict[str, Dict[str, Tuple[Any]]]
indices=[PythonModuleIndex,]
def setup(app):
app.add_domain(CodonDomain)
# app.connect('object-description-simplify', filter_meta_fields)
return {'version':'0.1','parallel_read_safe':True,'parallel_write_safe':True,}

View File

@ -1,13 +1,13 @@
Building from Source
====================
Unless you really need to build Seq for whatever reason, we strongly
Unless you really need to build Codon for whatever reason, we strongly
recommend using pre-built binaries if possible.
Dependencies
------------
Seq depends on LLVM 12, which can be installed via most package managers. To
Codon depends on LLVM 12, which can be installed via most package managers. To
build LLVM 12 yourself, you can do the following:
.. code-block:: bash
@ -28,7 +28,7 @@ build LLVM 12 yourself, you can do the following:
Build
-----
The following can generally be used to build Seq. The build process will automatically
The following can generally be used to build Codon. The build process will automatically
download and build several smaller dependencies.
.. code-block:: bash
@ -40,5 +40,5 @@ download and build several smaller dependencies.
-DCMAKE_CXX_COMPILER=clang++)
cmake --build build --config Release
This should produce the ``seqc`` executable in the ``build`` directory, as well as
``seqtest`` which runs the test suite.
This should produce the ``codon`` executable in the ``build`` directory, as well as
``codon_test`` which runs the test suite.

View File

@ -16,17 +16,13 @@ import os
import sys
sys.path.insert(0, os.path.abspath("./_ext"))
def setup(sphinx):
sys.path.insert(0, os.path.abspath('.'))
from seqlex import SeqLexer
sphinx.add_lexer("seq", SeqLexer)
# -- Project information -----------------------------------------------------
project = u'Seq'
copyright = u'2019-2021, seq-lang'
author = u'seq-lang'
project = u'Codon'
copyright = u'2019-2021, codon'
author = u'codon'
# The short X.Y version
version = u'0.11'
@ -50,9 +46,9 @@ extensions = [
'sphinx.ext.mathjax',
'sphinx.ext.githubpages',
'sphinx.ext.intersphinx',
'seq'
# 'breathe',
# 'exhale'
# 'exhale',
'codon',
]
# Add any paths that contain templates here, relative to this directory.
@ -95,7 +91,7 @@ html_theme = 'sphinx_book_theme'
# documentation.
#
html_theme_options = {
"repository_url": "https://github.com/seq-lang/seq",
"repository_url": "https://github.com/exaloop/codon",
"repository_branch": "develop",
"path_to_docs": "docs/sphinx/",
"use_repository_button": True,
@ -120,7 +116,7 @@ html_sidebars = { '**': ['globaltoc.html', 'relations.html', 'searchbox.html'],
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'seqdoc'
htmlhelp_basename = 'codondoc'
# -- Options for LaTeX output ------------------------------------------------
@ -147,7 +143,7 @@ latex_elements = {
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'seq.tex', u'seq Documentation',
(master_doc, 'codon.tex', u'Codon Documentation',
u'arshajii', 'manual'),
]
@ -157,7 +153,7 @@ latex_documents = [
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'seq', u'Seq Documentation',
(master_doc, 'codon', u'Codon Documentation',
[author], 1)
]
@ -168,18 +164,18 @@ man_pages = [
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'seq', u'Seq Documentation',
author, 'seq', 'a language for bioinformatics',
(master_doc, 'codon', u'Codon Documentation',
author, 'codon', 'a high-performance language',
'Miscellaneous'),
]
# -- Extension configuration -------------------------------------------------
breathe_projects = {
'Seq': '../doxygen/xml'
'Codon': '../doxygen/xml'
}
breathe_default_project = 'Seq'
breathe_default_project = 'Codon'
exhale_args = {
# These arguments are required

View File

@ -1,273 +0,0 @@
Cookbook
========
.. contents::
Subsequence extraction
----------------------
.. code-block:: seq
myseq = s'CAATAGAGACTAAGCATTAT'
sublen = 5
stride = 2
# explicit for-loop
for subseq in myseq.split(sublen, stride):
print(subseq)
# pipelined
myseq |> split(sublen, stride) |> print
k-mer extraction
----------------
.. code-block:: seq
myseq = s'CAATAGAGACTAAGCATTAT'
stride = 2
# explicit for-loop
for subseq in myseq.kmers(stride, k=5):
print(subseq)
# pipelined
myseq |> kmers(stride, k=5) |> print
Reverse complementation
-----------------------
.. code-block:: seq
# sequences
s = s'GGATC'
print(~s) # GATCC
# k-mers
k = k'GGATC'
print(~k) # GATCC
k-mer Hamming distance
----------------------
.. code-block:: seq
k1 = k'ACGTC'
k2 = k'ACTTA'
# ^ ^
print(abs(k1 - k2)) # Hamming distance = 2
k-mer Hamming neighbors
-----------------------
.. code-block:: seq
def neighbors(kmer):
for i in range(len(kmer)):
for b in (k'A', k'C', k'G', k'T'):
if kmer[i] != b:
yield kmer |> base(i, b)
print(list(neighbors(k'AGC'))) # CGC, GGC, etc.
k-mer minimizer
---------------
.. code-block:: seq
def minimizer(s, k: Static[int]):
assert len(s) >= k
kmer_min = Kmer[k](s[:k])
for kmer in s[1:].kmers(k=k, step=1):
kmer = min(kmer, ~kmer)
if kmer < kmer_min: kmer_min = kmer
return kmer_min
print(minimizer(s'ACGTACGTACGT', 10))
de Bruijn edge
--------------
.. code-block:: seq
def de_bruijn_edge(a, b):
a = a |> base(0, k'A') # reset first base: [T]GAG -> [A]GAG
b = b >> s'A' # shift right to A: [GAG]C -> A[GAG]
return a == b # suffix of a == prefix of b
print(de_bruijn_edge(k'TGAG', k'GAGC')) # True
print(de_bruijn_edge(k'TCAG', k'GAGC')) # False
Count bases
-----------
.. code-block:: seq
@tuple
class BaseCount:
A: int
C: int
G: int
T: int
def __add__(self, other: BaseCount):
a1, c1, g1, t1 = self
a2, c2, g2, t2 = other
return (a1 + a2, c1 + c2, g1 + g2, t1 + t2)
def count_bases(s):
match s:
case 'A*': return count_bases(s[1:]) + (1,0,0,0)
case 'C*': return count_bases(s[1:]) + (0,1,0,0)
case 'G*': return count_bases(s[1:]) + (0,0,1,0)
case 'T*': return count_bases(s[1:]) + (0,0,0,1)
case _: return BaseCount(0,0,0,0)
print(count_bases(s'ACCGGGTTTT')) # (A: 1, C: 2, G: 3, T: 4)
Spaced seed search
------------------
.. code-block:: seq
def has_spaced_acgt(s):
match s:
case 'A_C_G_T*':
return True
case t if len(t) >= 8:
return has_spaced_acgt(s[1:])
case _:
return False
print(has_spaced_acgt(s'AAATCTGTTAAA')) # True
print(has_spaced_acgt(s'ACGTACGTACGT')) # False
Reverse-complement palindrome
-----------------------------
.. code-block:: seq
def is_own_revcomp(s):
match s:
case 'A*T' | 'T*A' | 'C*G' | 'G*C':
return is_own_revcomp(s[1:-1])
case '':
return True
case _:
return False
print(is_own_revcomp(s'ACGT')) # True
print(is_own_revcomp(s'ATTA')) # False
Sequence alignment
------------------
.. code-block:: seq
# default parameters
s1 = s'CGCGAGTCTT'
s2 = s'CGCAGAGTT'
aln = s1 @ s2
print(aln.cigar, aln.score) # 3M1I6M -3
# custom parameters
# match = 2; mismatch = 4; gap1(k) = 2k + 4; gap2(k) = k + 13
aln = s1.align(s2, a=2, b=4, gapo=4, gape=2, gapo2=13, gape2=1)
print(aln.cigar, aln.score) # 3M1D3M2I2M 2
Reading FASTA/FASTQ
-------------------
.. code-block:: seq
# iterate over everything
for r in FASTA('genome.fa'):
print(r.name)
print(r.seq)
# iterate over sequences
for s in FASTA('genome.fa') |> seqs:
print(s)
# iterate over everything
for r in FASTQ('reads.fq'):
print(r.name)
print(r.read)
print(r.qual)
# iterate over sequences
for s in FASTQ('reads.fq') |> seqs:
print(s)
Reading paired-end FASTQ
------------------------
.. code-block:: seq
for r1, r2 in zip(FASTQ('reads_1.fq'), FASTQ('reads_2.fq')):
print(r1.name, r2.name)
print(r1.read, r2.read)
print(r1.qual, r2.qual)
Parallel FASTQ processing
-------------------------
.. code-block:: seq
def process(s: seq):
...
# OMP_NUM_THREADS environment variable controls threads
FASTQ('reads.fq') |> iter ||> process
# Sometimes batching reads into blocks can improve performance,
# especially if each is quick to process.
FASTQ('reads.fq') |> blocks(size=1000) ||> iter |> process
Reading SAM/BAM/CRAM
--------------------
.. code-block:: seq
# iterate over everything
for r in SAM('alignments.sam'):
print(r.name)
print(r.read)
print(r.pos)
print(r.mapq)
print(r.cigar)
print(r.reversed)
# etc.
for r in BAM('alignments.bam'):
# ...
for r in CRAM('alignments.cram'):
# ...
# iterate over sequences
for s in SAM('alignments.sam') |> seqs:
print(s)
for s in BAM('alignments.bam') |> seqs:
print(s)
for s in CRAM('alignments.cram') |> seqs:
print(s)
DNA to protein translation
--------------------------
.. code-block:: seq
dna = s'AGGTCTAACGGC'
protein = dna |> translate
print(protein) # RSNG
Reading protein sequences from FASTA
------------------------------------
.. code-block:: seq
for s in pFASTA('seqs.fasta') |> seqs:
print(s)

View File

@ -17,23 +17,23 @@ root=os.path.abspath(sys.argv[1])
print(f"Generating documentation for {root}...")
# 1. Call seqc -docstr and get a documentation in JSON format
# 1. Call codon -docstr and get a documentation in JSON format
def load_json(directory):
# Get all seq files in the directory
# Get all codon files in the directory
files=[]
for root,_,items in os.walk(directory):
for f in items:
if f.endswith('.seq'):
if f.endswith('.codon'):
files.append(os.path.abspath(os.path.join(root,f)))
files='\n'.join(files)
s=sp.run(['../../build/seqc','doc'],stdout=sp.PIPE,input=files.encode('utf-8'))
s=sp.run(['../../build/codon','doc'],stdout=sp.PIPE,input=files.encode('utf-8'))
if s.returncode!=0:
raise ValueError('seqc failed')
raise ValueError('codon failed')
return json.loads(s.stdout.decode('utf-8'))
j=load_json(root)
print(f" - Done with seqc")
print(f" - Done with codon")
# with open('x.json','w') as f:
# json.dump(j,f,indent=2)
@ -48,8 +48,8 @@ for mid,module in modules.items():
print(root,mid,module)
directory=os.path.relpath(directory,root) # remove the prefix
os.makedirs(f"stdlib/{directory}",exist_ok=True)
if name.endswith('.seq'):
name=name[:-4]
if name.endswith('.codon'):
name=name[:-6] # drop suffix
if name!='__init__':
parsed_modules[directory].add((name,mid))
module=os.path.split(module)[0]
@ -57,7 +57,7 @@ for directory,modules in parsed_modules.items():
module=directory.replace('/','.')
with open(f'stdlib/{directory}/index.rst','w') as f:
if module!='.':
print(f".. seq:module:: {module}\n",file=f)
print(f".. codon:module:: {module}\n",file=f)
print(f"{module}",file=f)
else:
print("Standard Library Reference",file=f)
@ -143,11 +143,11 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
if name=='__init__':
file,mode=f'stdlib/{directory}/index.rst','a'
with open(file,mode) as f:
print(f".. seq:module:: {module}\n",file=f)
print(f":seq:mod:`{module}`",file=f)
print("-"*(len(module)+11)+"\n",file=f)
print(f".. codon:module:: {module}\n",file=f)
print(f":codon:mod:`{module}`",file=f)
print("-"*(len(module)+13)+"\n",file=f)
directory_prefix=directory+'/' if directory!='.' else ''
print(f"Source code: `{directory_prefix}{name}.seq <https://github.com/seq-lang/seq/blob/master/stdlib/{directory}/{name}.seq>`_\n",file=f)
print(f"Source code: `{directory_prefix}{name}.codon <https://github.com/exaloop.io/codon/blob/master/stdlib/{directory}/{name}.codon>`_\n",file=f)
if 'doc' in j[mid]:
print(parse_docstr(j[mid]['doc']),file=f)
@ -162,13 +162,13 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
if v['kind']=='class':
if v['name'].endswith('Error'):
v["type"]="exception"
f.write(f'.. seq:{v["type"]}:: {v["name"]}')
f.write(f'.. codon:{v["type"]}:: {v["name"]}')
if 'generics' in v and v['generics']:
f.write(f'[{",".join(v["generics"])}]')
elif v['kind']=='function':
f.write(f'.. seq:function:: {v["name"]}{parse_fn(v)}')
f.write(f'.. codon:function:: {v["name"]}{parse_fn(v)}')
elif v['kind']=='variable':
f.write(f'.. seq:data:: {v["name"]}')
f.write(f'.. codon:data:: {v["name"]}')
# if v['kind'] == 'class' and v['type'] == 'extension':
# f.write(f'**`{getLink(v["parent"])}`**')
# else:
@ -202,7 +202,7 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
print(' **Properties:**\n',file=f)
for c in props:
v=j[c]
f.write(f' .. seq:attribute:: {v["name"]}\n')
f.write(f' .. codon:attribute:: {v["name"]}\n')
if 'doc' in v:
f.write("\n"+parse_docstr(v['doc'],4)+"\n\n")
f.write("\n")
@ -212,7 +212,7 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
print(' **Magic methods:**\n',file=f)
for c in magics:
v=j[c]
f.write(f' .. seq:method:: {v["name"]}{parse_fn(v,True)}\n')
f.write(f' .. codon:method:: {v["name"]}{parse_fn(v,True)}\n')
f.write(' :noindex:\n')
if 'doc' in v:
f.write("\n"+parse_docstr(v['doc'],4)+"\n\n")
@ -222,7 +222,7 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
print(' **Methods:**\n',file=f)
for c in methods:
v=j[c]
f.write(f' .. seq:method:: {v["name"]}{parse_fn(v,True)}\n')
f.write(f' .. codon:method:: {v["name"]}{parse_fn(v,True)}\n')
if 'doc' in v:
f.write("\n"+parse_docstr(v['doc'],4)+"\n\n")
f.write("\n")

View File

@ -1,9 +1,9 @@
Calling Seq from C/C++
======================
Calling Codon from C/C++
========================
Calling C/C++ from Seq is quite easy with ``from C import``, but Seq can also be called from C/C++ code. To make a Seq function externally visible, simply annotate it with ``@export``:
Calling C/C++ from Codon is quite easy with ``from C import``, but Codon can also be called from C/C++ code. To make a Codon function externally visible, simply annotate it with ``@export``:
.. code-block:: seq
.. code-block:: python
@export
def foo(n: int):
@ -11,14 +11,14 @@ Calling C/C++ from Seq is quite easy with ``from C import``, but Seq can also be
print(i * i)
return n * n
Note that only top-level, non-generic functions can be exported. Now we can create a shared library containing ``foo`` (assuming source file *foo.seq*):
Note that only top-level, non-generic functions can be exported. Now we can create a shared library containing ``foo`` (assuming source file *foo.codon*):
.. code-block:: bash
seqc build -o foo.o foo.seq
gcc -shared -lseqrt -lomp foo.o -o libfoo.so
codon build -o foo.o foo.codon
gcc -shared -lcodonrt -lomp foo.o -o libfoo.so
(The last command might require an additional ``-L/path/to/seqrt/lib/`` argument if ``libseqrt`` is not installed on a standard path.)
(The last command might require an additional ``-L/path/to/codonrt/lib/`` argument if ``libcodonrt`` is not installed on a standard path.)
Now we can call ``foo`` from a C program:
@ -39,22 +39,21 @@ Compile:
gcc -o foo -L. -lfoo foo.c
Now running ``./foo`` should invoke ``foo()`` as defined in Seq, with an argument of ``10``.
Now running ``./foo`` should invoke ``foo()`` as defined in Codon, with an argument of ``10``.
Converting types
----------------
The following table shows the conversions between Seq and C/C++ types:
The following table shows the conversions between Codon and C/C++ types:
============ ============
Seq C/C++
Codon C/C++
------------ ------------
``int`` ``int64_t``
``float`` ``double``
``bool`` ``bool``
``byte`` ``int8_t``
``str`` ``{int64_t, char*}``
``seq`` ``{int64_t, char*}``
``class`` Pointer to corresponding tuple
``@tuple`` Struct of fields
============ ============

View File

@ -1,7 +1,22 @@
**Seq** — a Python implementation for bioinformatics
**Codon** - a high-performance Python implementation
====================================================
Seq is a Pythonic language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
What is Codon?
--------------
Codon is a high-performance Python implementation that compiles Python code to native machine code without any runtime overhead.
The Codon framework grew out of the `Seq <https://seq-lang.org>`_ project; while Seq focuses on genomics and bioinformatics, Codon
can be applied in a number of different areas and is extensible via a plugin system.
Typical speedups over Python are on the order of 10-100x or more, on a single thread. Codon supports native multithreading which can lead
to speedups many times higher still.
What isn't Codon?
-----------------
While Codon supports nearly all of Python's syntax, it is not a drop-in replacement, and large codebases might require modifications
to be run through the Codon compiler. For example, some of Python's modules are not yet implemented within Codon, and a few of Python's
dynamic features are disallowed. The Codon compiler produces detailed error messages to help identify and resolve any incompatibilities.
Questions, comments or suggestions? Visit our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_.
@ -13,62 +28,36 @@ Questions, comments or suggestions? Visit our `Gitter chatroom <https://gitter.i
parallel
python
embed
cookbook
build
stdlib/index
News
----
What's new in 0.11?
^^^^^^^^^^^^^^^^^^^
Version 0.11 again includes a number of upgrades to the parser, type system and compiler backend:
- Parallelism and multithreading support with OpenMP (e.g. ``@par for i in range(10): ...``)
- New PEG parser
- Improved compile-time static evaluation
- Upgrade to LLVM 12 (from LLVM 6)
- A few changes to the ``bio`` module, particularly with ``seq.kmers()``, which now takes a length argument as in ``s.kmers(k=12, step=1)``
What's new in 0.10?
^^^^^^^^^^^^^^^^^^^
Version 0.10 brings a slew of improvements to the language and compiler, including:
- Nearly all of Python's syntax is now supported, including empty collections (``[]``, ``{}``), lambda functions (``lambda``), ``*args``/``**kwargs``, ``None`` and much more
- Compiler error messages now pinpoint exactly where an error occurred with compile-time backtraces
- Runtime exceptions now include backtraces with file names and line numbers in debug mode
- GDB and LLDB support
- Various syntax updates to further close the gap with Python
- Numerous standard library improvements
.. caution::
The default compilation and execution mode is now "debug", which disables most optimizations. Pass the ``-release`` argument to ``seqc`` to enable optimizations.
Frequently Asked Questions
--------------------------
*Can I use Seq for general-purpose computing?*
*What is the goal of Codon?*
Yes! While the Seq project started with a narrow focus on bioinformatics, it has grown to encompass much of Python's syntax, semantics and modules, making it a useful tool beyond just bioinformatics, particularly when large datasets need to be processed with Python.
One of the main focuses of Codon is to bridge the gap between usability and performance. Codon aims to make writing high-performance software
substantially easier, and to provide a common, unified framework for the development of such software across a range of domains.
*What is the goal of Seq?*
*How does Codon compare to other Python implementations?*
One of the main focuses of Seq is to bridge the gap between usability and performance in the fields of bioinformatics and computational genomics, which have an unfortunate reputation for hard-to-use, buggy or generally poorly-written software. Seq aims to make writing high-performance genomics or bioinformatics software substantially easier, and to provide a common, unified framework for the development of such software.
*Why do we need a whole new language? Why not a library?*
There are many great bioinformatics libraries on the market today, including `Biopython <https://biopython.org>`_ for Python, `SeqAn <https://www.seqan.de>`_ for C++ and `BioJulia <https://biojulia.net>`_ for Julia. In fact, Seq offers a lot of the same functionality found in these libraries. The advantages of having a domain-specific language and compiler, however, are the higher-level constructs and optimizations like :ref:`pipeline`, :ref:`match`, :ref:`interalign` and :ref:`prefetch`, which are difficult to replicate in a library, as they often involve large-scale program transformations/optimizations. A domain-specific language also allows us to explore different backends like GPU, TPU or FPGA in a systematic way, in conjunction with these various constructs/optimizations, which is ongoing work.
Unlike other performance-oriented Python implementations, such as PyPy or Numba, Codon is a standalone system implemented entirely independently
of regular Python. Since it does not need to interoperate with CPython's runtime, Codon has far greater flexibility to generate optimized code.
In fact, Codon will frequently generate the same code as that from an equivalent C or C++ program. This design choice also allows Codon to circumvent
issues like Python's global interpretter lock, and thereby to take full advantage of parallelism and multithreading.
*What about interoperability with other languages and frameworks?*
Interoperability is and will continue to be a priority for the Seq project. We don't want using Seq to render you unable to use all the other great frameworks and libraries that exist. Seq already supports interoperability with C/C++ and Python (see :ref:`interop`).
Interoperability is and will continue to be a priority for the Codon project. We don't want using Codon to render you unable to use all the other great
frameworks and libraries that exist. Codon already supports interoperability with C/C++ and Python (see :ref:`interop`).
*I want to contribute! How do I get started?*
Great! Check out our `contribution guidelines <https://github.com/seq-lang/seq/blob/master/CONTRIBUTING.md>`_ and `open issues <https://github.com/seq-lang/seq/issues>`_ to get started. Also don't hesitate to drop by our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_ if you have any questions.
Great! Check out our `contribution guidelines <https://github.com/exaloop/codon/blob/master/CONTRIBUTING.md>`_ and `open issues <https://github.com/exaloop/codon/issues>`_
to get started. Also don't hesitate to drop by our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_
if you have any questions.
*What is planned for the future?*
See the `roadmap <https://github.com/seq-lang/seq/wiki/Roadmap>`_ for information about this.
See the `roadmap <https://github.com/exaloop/codon/wiki/Roadmap>`_ for information about this.

View File

@ -7,13 +7,13 @@ Install
Pre-built binaries
^^^^^^^^^^^^^^^^^^
Pre-built binaries for Linux and macOS on x86_64 are available alongside `each release <https://github.com/seq-lang/seq/releases>`_. We also have a script for downloading and installing pre-built versions:
Pre-built binaries for Linux and macOS on x86_64 are available alongside `each release <https://github.com/exaloop/codon/releases>`_. We also have a script for downloading and installing pre-built versions:
.. code-block:: bash
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
This will install Seq in a new ``.seq`` directory within your home directory.
This will install Codon in a new ``.codon`` directory within your home directory.
Building from source
^^^^^^^^^^^^^^^^^^^^
@ -23,56 +23,55 @@ See `Building from Source <build.html>`_.
Usage
-----
The ``seqc`` program can either directly run Seq source in JIT mode:
The ``codon`` program can either directly ``run`` Codon source in JIT mode:
.. code-block:: bash
seqc run myprogram.seq
codon run myprogram.codon
The default compilation and run mode is *debug* (``-debug``). Compile and run with optimizations with the ``-release`` option:
.. code-block:: bash
seqc run -release myprogram.seq
codon run -release myprogram.codon
``seqc`` can also produce executables (ensure you have ``clang`` installed, as it is used for linking):
``codon`` can also ``build`` executables (ensure you have ``clang`` installed, as it is used for linking):
.. code-block:: bash
# generate 'myprogram' executable
seqc build -exe myprogram.seq
codon build -exe myprogram.codon
# generate 'foo' executable
seqc build -o foo myprogram.seq
codon build -o foo myprogram.codon
``seqc`` can produce object files:
``codon`` can produce object files:
.. code-block:: bash
# generate 'myprogram.o' object file
seqc build -obj myprogram.seq
codon build -obj myprogram.codon
# generate 'foo.o' object file
seqc build -o foo.o myprogram.seq
codon build -o foo.o myprogram.codon
``seqc`` can produce LLVM IR:
``codon`` can produce LLVM IR:
.. code-block:: bash
# generate 'myprogram.ll' object file
seqc build -llvm myprogram.seq
codon build -llvm myprogram.codon
# generate 'foo.ll' object file
seqc build -o foo.ll myprogram.seq
codon build -o foo.ll myprogram.codon
Compile-time definitions
------------------------
``seqc`` allows for compile-time definitions via the ``-D`` flag. For example, in the following code:
``codon`` allows for compile-time definitions via the ``-D`` flag. For example, in the following code:
.. code-block:: seq
.. code-block:: python
from bio import *
print(Kmer[SEED_LEN]())
print(Int[BIT_WIDTH]())
``SEED_LEN`` can be specified on the command line as such: ``seqc run -DSEED_LEN=10 myprogram.seq``.
``BIT_WIDTH`` can be specified on the command line as such: ``codon run -DBIT_WIDTH=10 myprogram.codon``.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 208 KiB

After

Width:  |  Height:  |  Size: 22 KiB

View File

@ -9,7 +9,7 @@ if "%SPHINXBUILD%" == "" (
)
set SOURCEDIR=.
set BUILDDIR=_build
set SPHINXPROJ=seq
set SPHINXPROJ=codon
if "%1" == "" goto help

View File

@ -1,9 +1,11 @@
.. _parallelism:
Parallelism and Multithreading
==============================
Seq supports parallelism and multithreading via OpenMP out of the box. Here's an example:
Codon supports parallelism and multithreading via OpenMP out of the box. Here's an example:
.. code-block:: seq
.. code-block:: python
@par
for i in range(10):
@ -14,7 +16,7 @@ By default, parallel loops will use all available threads, or use the number of
specified by the ``OMP_NUM_THREADS`` environment variable. A specific thread number can
be given directly on the ``@par`` line as well:
.. code-block:: seq
.. code-block:: python
@par(num_threads=5)
for i in range(10):
@ -31,7 +33,7 @@ be given directly on the ``@par`` line as well:
Other OpenMP parameters like ``private``, ``shared`` or ``reduction``, are inferred
automatically by the compiler. For example, the following loop
.. code-block:: seq
.. code-block:: python
total = 0
@par
@ -43,7 +45,7 @@ will automatically generate a reduction for variable ``a``.
Here is an example that finds the sum of prime numbers up to a user-defined limit, using
a parallel loop on 16 threads with a dynamic schedule and chunk size of 100:
.. code-block:: seq
.. code-block:: python
from sys import argv
@ -71,7 +73,7 @@ the factors of an integer takes more time for larger integers, we use a dynamic
``@par`` also supports C/C++ OpenMP pragma strings. For example, the ``@par`` line in the
above example can also be written as:
.. code-block:: seq
.. code-block:: python
# same as: @par(schedule='dynamic', chunk_size=100, num_threads=16)
@par('schedule(dynamic, 100) num_threads(16)')
@ -84,13 +86,13 @@ applies to *imperative* for-loops of the form ``for i in range(a, b, c)`` (where
For general parallel for-loops of the form ``for i in some_generator()``, a task-based approach is
used instead, where each loop iteration is executed as an independent task.
The Seq compiler also converts iterations over lists (``for a in some_list``) to imperative
The Codon compiler also converts iterations over lists (``for a in some_list``) to imperative
for-loops, meaning these loops can be executed using OpenMP's loop parallelism.
Custom reductions
-----------------
Seq can automatically generate efficient reductions for ``int`` and ``float`` values. For other
Codon can automatically generate efficient reductions for ``int`` and ``float`` values. For other
data types, user-defined reductions can be specified. A class that supports reductions must
include:
@ -99,7 +101,7 @@ include:
Here is an example for reducing a new ``Vector`` type:
.. code-block:: seq
.. code-block:: python
@tuple
class Vector:
@ -121,9 +123,9 @@ Here is an example for reducing a new ``Vector`` type:
OpenMP constructs
-----------------
All of OpenMP's API functions are accessible directly in Seq. For example:
All of OpenMP's API functions are accessible directly in Codon. For example:
.. code-block:: seq
.. code-block:: python
import openmp as omp
print(omp.get_num_threads())
@ -132,7 +134,7 @@ All of OpenMP's API functions are accessible directly in Seq. For example:
OpenMP's *critical*, *master*, *single* and *ordered* constructs can be applied via the
corresponding decorators:
.. code-block:: seq
.. code-block:: python
import openmp as omp
@ -161,7 +163,7 @@ corresponding decorators:
For finer-grained locking, consider using the locks from the ``threading`` module:
.. code-block:: seq
.. code-block:: python
from threading import Lock
lock = Lock() # or RLock for re-entrant lock

View File

@ -1,17 +1,17 @@
Calling Python from Seq
=======================
Calling Python from Codon
=========================
Calling Python from Seq is possible in two ways:
Calling Python from Codon is possible in two ways:
- ``from python import`` allows importing and calling Python functions from existing Python modules.
- ``@python`` allows writing Python code directly in Seq.
- ``@python`` allows writing Python code directly in Codon.
In order to use these features, the ``SEQ_PYTHON`` environment variable must be set to the appropriate
In order to use these features, the ``CODON_PYTHON`` environment variable must be set to the appropriate
Python shared library:
.. code-block:: bash
export SEQ_PYTHON=/path/to/libpython.X.Y.so
export CODON_PYTHON=/path/to/libpython.X.Y.so
For example, with a ``brew``-installed Python 3.9 on macOS, this might be
@ -21,8 +21,8 @@ For example, with a ``brew``-installed Python 3.9 on macOS, this might be
Note that only Python versions 3.6 and later are supported.
``from python import``
----------------------
from python import
------------------
Let's say we have a Python function defined in *mymodule.py*:
@ -31,22 +31,22 @@ Let's say we have a Python function defined in *mymodule.py*:
def multiply(a, b):
return a * b
We can call this function in Seq using ``from python import`` and indicating the appropriate
We can call this function in Codon using ``from python import`` and indicating the appropriate
call and return types:
.. code-block:: seq
.. code-block:: python
from python import mymodule.multiply(int, int) -> int
print(multiply(3, 4)) # 12
(Be sure the ``PYTHONPATH`` environment variable includes the path of *mymodule.py*!)
``@python``
-----------
@python
-------
Seq programs can contain functions that will be executed by Python via ``pydef``:
Codon programs can contain functions that will be executed by Python via ``pydef``:
.. code-block:: seq
.. code-block:: python
@python
def multiply(a: int, b: int) -> int:
@ -56,7 +56,7 @@ Seq programs can contain functions that will be executed by Python via ``pydef``
This makes calling Python modules like NumPy very easy:
.. code-block:: seq
.. code-block:: python
@python
def myrange(n: int) -> List[int]:

View File

@ -1,245 +0,0 @@
# Seq Pygments lexer based on Python's
import re
from pygments.lexer import Lexer, RegexLexer, include, bygroups, using, \
default, words, combined, do_insertions
from pygments.util import get_bool_opt, shebang_matches
from pygments.token import Text, Comment, Operator, Keyword, Name, String, \
Number, Punctuation, Generic, Other, Error
from pygments import unistring as uni
__all__ = ['SeqLexer']
line_re = re.compile('.*?\n')
class SeqLexer(RegexLexer):
name = 'Seq'
aliases = ['seq']
filenames = ['*.seq']
mimetypes = []
flags = re.MULTILINE | re.UNICODE
uni_name = "[%s][%s]*" % (uni.xid_start, uni.xid_continue)
def innerstring_rules(ttype):
return [
# the old style '%s' % (...) string formatting (still valid in Py3)
(r'%(\(\w+\))?[-#0 +]*([0-9]+|[*])?(\.([0-9]+|[*]))?'
'[hlL]?[E-GXc-giorsux%]', String.Interpol),
# the new style '{}'.format(...) string formatting
(r'\{'
'((\w+)((\.\w+)|(\[[^\]]+\]))*)?' # field name
'(\![sra])?' # conversion
'(\:(.?[<>=\^])?[-+ ]?#?0?(\d+)?,?(\.\d+)?[E-GXb-gnosx%]?)?'
'\}', String.Interpol),
# backslashes, quotes and formatting signs must be parsed one at a time
(r'[^\\\'"%{\n]+', ttype),
(r'[\'"\\]', ttype),
# unhandled string formatting sign
(r'%|(\{{1,2})', ttype)
# newlines are an error (use "nl" state)
]
tokens = {
'root': [
(r'\n', Text),
(r'^(\s*)([rRuUbB]{,2})("""(?:.|\n)*?""")',
bygroups(Text, String.Affix, String.Doc)),
(r"^(\s*)([rRuUbB]{,2})('''(?:.|\n)*?''')",
bygroups(Text, String.Affix, String.Doc)),
(r'[^\S\n]+', Text),
(r'\A#!.+$', Comment.Hashbang),
(r'#.*$', Comment.Single),
(r'[]{}:(),;[]', Punctuation),
(r'\\\n', Text),
(r'\\', Text),
(r'(in|is|and|or|not)\b', Operator.Word),
(r'!=|==|<<|>>|[-~+/*%=<>&^|.]', Operator),
include('keywords'),
(r'(def)((?:\s|\\\s)+)', bygroups(Keyword, Text), 'funcname'),
(r'(class)((?:\s|\\\s)+)', bygroups(Keyword, Text), 'classname'),
(r'(from)((?:\s|\\\s)+)', bygroups(Keyword.Namespace, Text),
'fromimport'),
(r'(import)((?:\s|\\\s)+)', bygroups(Keyword.Namespace, Text),
'import'),
include('builtins'),
include('magicfuncs'),
include('magicvars'),
include('backtick'),
('([skprR]|[uUbB][rR]|[rR][uUbB])(""")',
bygroups(String.Affix, String.Double), 'tdqs'),
("([skprR]|[uUbB][rR]|[rR][uUbB])(''')",
bygroups(String.Affix, String.Single), 'tsqs'),
('([skprR]|[uUbB][rR]|[rR][uUbB])(")',
bygroups(String.Affix, String.Double), 'dqs'),
("([skprR]|[uUbB][rR]|[rR][uUbB])(')",
bygroups(String.Affix, String.Single), 'sqs'),
('([uUbB]?)(""")', bygroups(String.Affix, String.Double),
combined('stringescape', 'tdqs')),
("([uUbB]?)(''')", bygroups(String.Affix, String.Single),
combined('stringescape', 'tsqs')),
('([uUbB]?)(")', bygroups(String.Affix, String.Double),
combined('stringescape', 'dqs')),
("([uUbB]?)(')", bygroups(String.Affix, String.Single),
combined('stringescape', 'sqs')),
include('name'),
include('numbers'),
],
'funcname': [
include('magicfuncs'),
('[a-zA-Z_]\w*', Name.Function, '#pop'),
default('#pop'),
],
'stringescape': [
(r'\\([\\abfnrtv"\']|\n|N\{.*?\}|u[a-fA-F0-9]{4}|'
r'U[a-fA-F0-9]{8}|x[a-fA-F0-9]{2}|[0-7]{1,3})', String.Escape)
],
'strings-single': innerstring_rules(String.Single),
'strings-double': innerstring_rules(String.Double),
'dqs': [
(r'"', String.Double, '#pop'),
(r'\\\\|\\"|\\\n', String.Escape), # included here for raw strings
include('strings-double')
],
'sqs': [
(r"'", String.Single, '#pop'),
(r"\\\\|\\'|\\\n", String.Escape), # included here for raw strings
include('strings-single')
],
'tdqs': [
(r'"""', String.Double, '#pop'),
include('strings-double'),
(r'\n', String.Double)
],
'tsqs': [
(r"'''", String.Single, '#pop'),
include('strings-single'),
(r'\n', String.Single)
],
}
tokens['keywords'] = [
(words((
'assert', 'async', 'await', 'break', 'continue', 'del', 'elif',
'else', 'except', 'finally', 'for', 'global', 'if', 'lambda', 'pass',
'raise', 'nonlocal', 'return', 'try', 'while', 'yield', 'yield from',
'as', 'with', 'match', 'case', 'pydef',
'type', 'extend', 'print', 'cimport', 'pyimport'), suffix=r'\b'),
Keyword),
(words((
'True', 'False', 'None'), suffix=r'\b'),
Keyword.Constant),
]
seqwords = ['__import__', 'abs', 'all', 'any', 'bin', 'bool', 'bytearray', 'bytes',
'chr', 'classmethod', 'cmp', 'compile', 'complex', 'delattr', 'dict',
'dir', 'divmod', 'enumerate', 'eval', 'filter', 'float', 'format',
'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'hex', 'id',
'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'list',
'locals', 'map', 'max', 'memoryview', 'min', 'object', 'oct',
'open', 'ord', 'pow', 'property', 'range', 'repr', 'reversed',
'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str',
'sum', 'super', 'tuple', 'vars', 'zip', 'seq', 'byte', 'ptr', 'array',
'Kmer', 'Int', 'UInt', 'optional']
tokens['builtins'] = [
(words(seqwords, prefix=r'(?<!\.)', suffix=r'\b'),
Name.Builtin),
(r'(?<!\.)(self|Ellipsis|NotImplemented|cls)\b', Name.Builtin.Pseudo),
(words((
'ArithmeticError', 'AssertionError', 'AttributeError',
'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning',
'EOFError', 'EnvironmentError', 'Exception', 'FloatingPointError',
'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError',
'ImportWarning', 'IndentationError', 'IndexError', 'KeyError',
'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError',
'NotImplementedError', 'OSError', 'OverflowError',
'PendingDeprecationWarning', 'ReferenceError', 'ResourceWarning',
'RuntimeError', 'RuntimeWarning', 'StopIteration',
'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError',
'TypeError', 'UnboundLocalError', 'UnicodeDecodeError',
'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError',
'UnicodeWarning', 'UserWarning', 'ValueError', 'VMSError', 'Warning',
'WindowsError', 'ZeroDivisionError',
# new builtin exceptions from PEP 3151
'BlockingIOError', 'ChildProcessError', 'ConnectionError',
'BrokenPipeError', 'ConnectionAbortedError', 'ConnectionRefusedError',
'ConnectionResetError', 'FileExistsError', 'FileNotFoundError',
'InterruptedError', 'IsADirectoryError', 'NotADirectoryError',
'PermissionError', 'ProcessLookupError', 'TimeoutError'),
prefix=r'(?<!\.)', suffix=r'\b'),
Name.Exception),
]
tokens['magicfuncs'] = [
(words((
'__abs__', '__add__', '__aenter__', '__aexit__', '__aiter__', '__and__',
'__anext__', '__await__', '__bool__', '__bytes__', '__call__',
'__complex__', '__contains__', '__del__', '__delattr__', '__delete__',
'__delitem__', '__dir__', '__divmod__', '__enter__', '__eq__', '__exit__',
'__float__', '__floordiv__', '__format__', '__ge__', '__get__',
'__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__',
'__iadd__', '__iand__', '__ifloordiv__', '__ilshift__', '__imatmul__',
'__imod__', '__import__', '__imul__', '__index__', '__init__',
'__instancecheck__', '__int__', '__invert__', '__ior__', '__ipow__',
'__irshift__', '__isub__', '__iter__', '__itruediv__', '__ixor__',
'__le__', '__len__', '__length_hint__', '__lshift__', '__lt__',
'__matmul__', '__missing__', '__mod__', '__mul__', '__ne__', '__neg__',
'__new__', '__next__', '__or__', '__pos__', '__pow__', '__prepare__',
'__radd__', '__rand__', '__rdivmod__', '__repr__', '__reversed__',
'__rfloordiv__', '__rlshift__', '__rmatmul__', '__rmod__', '__rmul__',
'__ror__', '__round__', '__rpow__', '__rrshift__', '__rshift__',
'__rsub__', '__rtruediv__', '__rxor__', '__set__', '__setattr__',
'__setitem__', '__str__', '__sub__', '__subclasscheck__', '__truediv__',
'__xor__'), suffix=r'\b'),
Name.Function.Magic),
]
tokens['magicvars'] = [
(words((
'__annotations__', '__bases__', '__class__', '__closure__', '__code__',
'__defaults__', '__dict__', '__doc__', '__file__', '__func__',
'__globals__', '__kwdefaults__', '__module__', '__mro__', '__name__',
'__objclass__', '__qualname__', '__self__', '__slots__', '__weakref__'),
suffix=r'\b'),
Name.Variable.Magic),
]
tokens['numbers'] = [
(r'(\d+\.\d*|\d*\.\d+)([eE][+-]?[0-9]+)?', Number.Float),
(r'\d+[eE][+-]?[0-9]+j?', Number.Float),
(r'0[oO][0-7]+', Number.Oct),
(r'0[bB][01]+', Number.Bin),
(r'0[xX][a-fA-F0-9]+', Number.Hex),
(r'\d+', Number.Integer)
]
tokens['backtick'] = []
tokens['name'] = [
(r'@\w+', Name.Decorator),
(r'@', Operator), # new matrix multiplication operator
(uni_name, Name),
]
tokens['funcname'] = [
(uni_name, Name.Function, '#pop')
]
tokens['classname'] = [
(uni_name, Name.Class, '#pop')
]
tokens['import'] = [
(r'(\s+)(as)(\s+)', bygroups(Text, Keyword, Text)),
(r'\.', Name.Namespace),
(uni_name, Name.Namespace),
(r'(\s*)(,)(\s*)', bygroups(Text, Operator, Text)),
default('#pop') # all else: go back
]
tokens['fromimport'] = [
(r'(\s+)(import)\b', bygroups(Text, Keyword), '#pop'),
(r'\.', Name.Namespace),
(uni_name, Name.Namespace),
default('#pop'),
]
tokens['strings-single'] = innerstring_rules(String.Single)
tokens['strings-double'] = innerstring_rules(String.Double)
def analyse_text(text):
return shebang_matches(text, r'seq')

View File

@ -6,5 +6,4 @@ Tutorial
setup
primer
tutorial
workshop
pipelines

View File

@ -0,0 +1,52 @@
Pipelines
=========
Pipelines
---------
Codon extends the core Python language with a pipe operator. You can chain multiple functions and generators to form a pipeline. Pipeline stages can be regular functions or generators. In the case of standard functions, the function is simply applied to the input data and the result is carried to the remainder of the pipeline, akin to F#'s functional piping. If, on the other hand, a stage is a generator, the values yielded by the generator are passed lazily to the remainder of the pipeline, which in many ways mirrors how piping is implemented in Bash. Note that Codon ensures that generator pipelines do not collect any data unless explicitly requested, thus allowing the processing of terabytes of data in a streaming fashion with no memory and minimal CPU overhead.
.. code:: python
def add1(x):
return x + 1
2 |> add1 # 3; equivalent to add1(2)
def calc(x, y):
return x + y**2
2 |> calc(3) # 11; equivalent to calc(2, 3)
2 |> calc(..., 3) # 11; equivalent to calc(2, 3)
2 |> calc(3, ...) # 7; equivalent to calc(3, 2)
def gen(i):
for i in range(i):
yield i
5 |> gen |> print # prints 0 1 2 3 4 separated by newline
range(1, 4) |> iter |> gen |> print(end=' ') # prints 0 0 1 0 1 2 without newline
[1, 2, 3] |> print # prints [1, 2, 3]
range(100000000) |> print # prints range(0, 100000000)
range(100000000) |> iter |> print # not only prints all those numbers, but it uses almost no memory at all
Codon will chain anything that implements ``__iter__``; however, use
generators as much as possible because the compiler will optimize out
generators whenever possible. Combinations of pipes and generators can be
used to implement efficient streaming pipelines.
.. caution::
The Codon compiler may perform optimizations that change the order of elements passed through a pipeline. Therefore, it is best to not rely on order when using pipelines. If order needs to be maintained, consider using a regular loop or passing an index alongside each element sent through the pipeline.
Parallel Pipelines
------------------
CPython and many other implementations alike cannot take advantage of parallelism due to the infamous global interpreter lock, a mutex that protects accesses to Python objects, preventing multiple threads from executing Python bytecode at once. Unlike CPython, Codon has no such restriction and supports full multithreading. To this end, Codon supports a *parallel* pipe operator ``||>``, which is semantically similar to the standard pipe operator except that it allows the elements sent through it to be processed in parallel by the remainder of the pipeline. Hence, turning a serial program into a parallel one often requires the addition of just a single character in Codon. Further, a single pipeline can contain multiple parallel pipes, resulting in nested parallelism.
.. code:: python
range(100000) |> iter ||> print # prints all these numbers, probably in random order
range(100000) |> iter ||> process ||> clean # runs process in parallel, and then cleans data in parallel
Codon will automatically schedule the ``process`` and ``clean`` functions to execute as soon as possible. You can control the number of threads via the ``OMP_NUM_THREADS`` environment variable.
Internally, the Codon compiler uses an OpenMP task backend to generate code for parallel pipelines. Logically, parallel pipe operators are similar to parallel-for loops: the portion of the pipeline after the parallel pipe is outlined into a new function that is called by the OpenMP runtime task spawning routines (as in ``#pragma omp task`` in C++), and a synchronization point (``#pragma omp taskwait``) is added after the outlined segment. Lastly, the entire program is implicitly placed in an OpenMP parallel region (``#pragma omp parallel``) that is guarded by a "single" directive (``#pragma omp single``) so that the serial portions are still executed by one thread (this is required by OpenMP as tasks must be bound to an enclosing parallel region).

View File

@ -1,14 +1,14 @@
Language Primer
===============
If you know Python, you already know 95% of Seq. The following primer
If you know Python, you already know 95% of Codon. The following primer
assumes some familiarity with Python or at least one "modern"
programming language (QBASIC doesn't count).
Printing
--------
.. code:: seq
.. code:: python
print('hello world')
@ -18,9 +18,9 @@ Printing
Comments
--------
.. code:: seq
.. code:: python
# Seq comments start with "# 'and go until the end of the line
# Codon comments start with "# 'and go until the end of the line
"""
There are no multi-line comments. You can (ab)use the docstring operator (''')
@ -30,10 +30,10 @@ Comments
Literals
--------
Seq is a strongly typed language like C++, Java, or C#. That means each
Codon is a strongly typed language like C++, Java, or C#. That means each
expression must have a type that can be inferred at the compile-time.
.. code:: seq
.. code:: python
# Booleans
True # type: bool
@ -41,7 +41,7 @@ expression must have a type that can be inferred at the compile-time.
# Numbers
a = 1 # type: int. It's 64-bit signed integer.
b = 1.12 # type: float. Seq's float is identical to C's double.
b = 1.12 # type: float. Codon's float is identical to C's double.
c = 5u # type: int, but unsigned
d = Int[8](12) # 8-bit signed integer; you can go all the way to Int[2048]
e = UInt[8](200) # 8-bit unsigned integer
@ -68,15 +68,10 @@ expression must have a type that can be inferred at the compile-time.
# \\, \', \", \a, \b, \f, \n, \r, \t, \v,
# \xHHH (HHH is hex code), \OOO (OOO is octal code)
# Sequence types
dna = s"ACGT" # type: seq. These are DNA sequences.
prt = p"MYX" # type: bio.pseq. These are protein sequences.
kmer = k"ACGT" # type: Kmer[4]. Note that Kmer[5] is different than Kmer[12].
Tuples
~~~~~~
.. code:: seq
.. code:: python
# Tuples
t = (1, 2.3, 'hi') # type: Tuple[int, float, str].
@ -87,10 +82,10 @@ As all types must be known at compile time, tuple indexing works
only if a tuple is homogenous (all types are the same) or if the value
of the index is known at compile-time.
You can, however, iterate over heterogenous tuples in Seq. This is achieved
You can, however, iterate over heterogenous tuples in Codon. This is achieved
by unrolling the loop to accommodate the different types.
.. code:: seq
.. code:: python
t = (1, 2.3, 'hi')
t[1] # works because 1 is a constant int
@ -116,7 +111,7 @@ by unrolling the loop to accommodate the different types.
Containers
~~~~~~~~~~
.. code:: seq
.. code:: python
l = [1, 2, 3] # type: List[int]; a list of integers
s = {1.1, 3.3, 2.2, 3.3} # type: Set[float]; a set of floats
@ -127,9 +122,9 @@ Containers
dn = {} # an empty dict whose type is inferred based on usage
dn = Dict[int, float]() # an empty dictionary with explicit element types
Because Seq is strongly typed, these won't compile:
Because Codon is strongly typed, these won't compile:
.. code:: seq
.. code:: python
l = [1, 's'] # is it a List[int] or List[str]? you cannot mix-and-match types
d = {1: 'hi'}
@ -138,11 +133,11 @@ Because Seq is strongly typed, these won't compile:
t = (1, 2.2) # Tuple[int, float]
lt = list(t) # compile error: t is not homogenous
lp = [1, 2.1, 3, 5] # compile error: Seq will not automatically cast a float to an int
lp = [1, 2.1, 3, 5] # compile error: Codon will not automatically cast a float to an int
This will work, though:
.. code:: seq
.. code:: python
u = (1, 2, 3)
lu = list(u) # works: u is homogenous
@ -156,7 +151,7 @@ This will work, though:
Assignments and operators
-------------------------
.. code:: seq
.. code:: python
a = 1 + 2 # this is 3
a = (1).__add__(2) # you can use a function call instead of an operator; this is also 3
@ -178,7 +173,6 @@ Operator Magic method Description
``**`` ``__pow__`` exponentiation
``%`` ``__mod__`` modulo
``@`` ``__matmul__`` matrix multiplication;
sequence alignment
``&`` ``__and__`` bitwise and
``|`` ``__or__`` bitwise or
``^`` ``__xor__`` bitwise xor
@ -195,7 +189,7 @@ Operator Magic method Description
``or`` none boolean or (short-circuits)
======== ================ ==================================================
Seq also has the following unary operators:
Codon also has the following unary operators:
======== ================ =============================
Operator Magic method Description
@ -211,9 +205,9 @@ Operator Magic method Description
Tuple unpacking
~~~~~~~~~~~~~~~
Seq supports most of Python's tuple unpacking syntax:
Codon supports most of Python's tuple unpacking syntax:
.. code:: seq
.. code:: python
x, y = 1, 2 # x is 1, y is 2
(x, (y, z)) = 1, (2, 3) # x is 1, y is 2, z is 3
@ -240,9 +234,9 @@ Control flow
Conditionals
~~~~~~~~~~~~
Seq supports the standard Python conditional syntax:
Codon supports the standard Python conditional syntax:
.. code:: seq
.. code:: python
if a or b or some_cond():
print(1)
@ -255,10 +249,10 @@ Seq supports the standard Python conditional syntax:
a = b if sth() else c # ternary conditional operator
Seq extends the Python conditional syntax with a ``match`` statement, which is
Codon extends the Python conditional syntax with a ``match`` statement, which is
inspired by Rust's:
.. code:: seq
.. code:: python
match a + some_heavy_expr(): # assuming that the type of this expression is int
case 1: # is it 1?
@ -300,12 +294,6 @@ inspired by Rust's:
case [...]: # any other list
...
match sequence: # of type seq
case 'ACGT': ...
case 'AC_T': ... # _ is a wildcard character and it can be anything
case 'A_C_T_': ... # a spaced k-mer AxCxTx
case 'AC*T': ... # matches a sequence that starts with AC and ends with T
You can mix, match and chain match rules as long as the match type
matches the expression type.
@ -314,7 +302,7 @@ Loops
Standard fare:
.. code:: seq
.. code:: python
a = 10
while a > 0: # prints even numbers from 9 to 1
@ -340,7 +328,7 @@ Comprehensions
Technically, comprehensions are not statements (they are expressions).
Comprehensions are a nifty, Pythonic way to create collections:
.. code:: seq
.. code:: python
l = [i for i in range(5)] # type: List[int]; l is [0, 1, 2, 3, 4]
l = [i for i in range(15) if i % 2 == 1 if i > 10] # type: List[int]; l is [11, 13]
@ -352,7 +340,7 @@ Comprehensions are a nifty, Pythonic way to create collections:
You can also use collections to create generators (more about them later
on):
.. code:: seq
.. code:: python
g = (i for i in range(10))
print(list(g)) # prints number from 0 to 9, inclusive
@ -361,9 +349,9 @@ Exception handling
~~~~~~~~~~~~~~~~~~
Again, if you know how to do this in Python, you know how to do it in
Seq:
Codon:
.. code:: seq
.. code:: python
def throwable():
raise ValueError("doom and gloom")
@ -378,14 +366,14 @@ Seq:
print("whatever, it's done")
.. note::
Right now, Seq cannot catch multiple exceptions in one
Right now, Codon cannot catch multiple exceptions in one
statement. Thus ``catch (Exc1, Exc2, Exc3) as var`` will not compile.
If you have an object that implements ``__enter__`` and ``__exit__``
methods to manage its lifetime (say, a ``File``), you can use a ``with``
statement to make your life easy:
.. code:: seq
.. code:: python
with open('foo.txt') as f, open('foo_copy.txt', 'w') as fo:
for l in f:
@ -394,20 +382,20 @@ statement to make your life easy:
Variables and scoping
---------------------
You have probably noticed by now that blocks in Seq are defined by their
You have probably noticed by now that blocks in Codon are defined by their
indentation level (as in Python). We recommend using 2 or 4 spaces to
indent blocks. Do not mix indentation levels, and do not mix tabs and spaces;
stick to any *consistent* way of indenting your code.
One of the major differences between Seq and Python lies in variable
scoping rules. Seq variables cannot *leak* to outer blocks. Every
One of the major differences between Codon and Python lies in variable
scoping rules. Codon variables cannot *leak* to outer blocks. Every
variable is accessible only within its own block (after the variable is
defined, of course), and within any block that is nested within the
variable's own block.
That means that the following Pythonic pattern won't compile:
.. code:: seq
.. code:: python
if cond():
x = 1
@ -419,9 +407,9 @@ That means that the following Pythonic pattern won't compile:
...
print(i) # error: i is only accessible within the for loop block
In Seq, you can rewrite that as:
In Codon, you can rewrite that as:
.. code:: seq
.. code:: python
x = 2
if cond():
@ -432,10 +420,10 @@ In Seq, you can rewrite that as:
print(x)
Another important difference between Seq and Python is that, in Seq, variables
Another important difference between Codon and Python is that, in Codon, variables
cannot change types. So this won't compile:
.. code:: seq
.. code:: python
a = 's'
a = 1 # error: expected string, but got int
@ -444,7 +432,7 @@ A note about function scoping: functions cannot modify variables that
are not defined within the function block. You need to use ``global`` to
modify such variables:
.. code:: seq
.. code:: python
g = 5
def foo():
@ -468,24 +456,24 @@ are not defined within the function.
Imports
-------
You can import functions and classes from another Seq module by doing:
You can import functions and classes from another Codon module by doing:
.. code:: seq
.. code:: python
# Create foo.seq with a bunch of useful methods
# Create foo.codon with a bunch of useful methods
import foo
foo.useful1()
p = foo.FooType()
# Create bar.seq with a bunch of useful methods
# Create bar.codon with a bunch of useful methods
from bar import x, y
x(y)
from bar import z as bar_z
bar_z()
``import foo`` looks for ``foo.seq`` or ``foo/__init__.seq`` in the
``import foo`` looks for ``foo.codon`` or ``foo/__init__.codon`` in the
current directory.
Functions
@ -493,7 +481,7 @@ Functions
Functions are defined as follows:
.. code:: seq
.. code:: python
def foo(a, b, c):
return a + b + c
@ -501,7 +489,7 @@ Functions are defined as follows:
How about procedures? Well, don't return anything meaningful:
.. code:: seq
.. code:: python
def proc(a, b):
print(a, 'followed by', b)
@ -514,10 +502,10 @@ How about procedures? Well, don't return anything meaningful:
proc2(1, 's')
proc2(5, 's') # this prints nothing
Seq is a strongly-typed language, so you can restrict argument and
Codon is a strongly-typed language, so you can restrict argument and
return types:
.. code:: seq
.. code:: python
def fn(a: int, b: float):
return a + b # this works because int implements __add__(float)
@ -538,7 +526,7 @@ return types:
Default arguments? Named arguments? You bet:
.. code:: seq
.. code:: python
def foo(a, b: int, c: float = 1.0, d: str = 'hi'):
print(a, b, c, d)
@ -547,7 +535,7 @@ Default arguments? Named arguments? You bet:
How about optional arguments? We support that too:
.. code:: seq
.. code:: python
# type of b promoted to Optional[int]
def foo(a, b: int = None):
@ -559,29 +547,29 @@ How about optional arguments? We support that too:
Generics
~~~~~~~~
As we've said several times: Seq is a strongly typed language. As
As we've said several times: Codon is a strongly typed language. As
such, it is not as flexible as Python when it comes to types (e.g. you
can't have lists with elements of different types). However,
Seq tries to mimic Python's *"I don't care about types until I do"*
Codon tries to mimic Python's *"I don't care about types until I do"*
attitude as much as possible by utilizing a technique known as
*monomorphization*. If there is a function that has an argument
without a type definition, Seq will consider it a *generic* function,
without a type definition, Codon will consider it a *generic* function,
and will generate different functions for each invocation of
that generic function:
.. code:: seq
.. code:: python
def foo(x):
print(x) # print relies on typeof(x).__str__(x) method to print the representation of x
foo(1) # Seq automatically generates foo(x: int) and calls int.__str__ when needed
foo('s') # Seq automatically generates foo(x: str) and calls str.__str__ when needed
foo([1, 2]) # Seq automatically generates foo(x: List[int]) and calls List[int].__str__ when needed
foo(1) # Codon automatically generates foo(x: int) and calls int.__str__ when needed
foo('s') # Codon automatically generates foo(x: str) and calls str.__str__ when needed
foo([1, 2]) # Codon automatically generates foo(x: List[int]) and calls List[int].__str__ when needed
But what if you need to mix type definitions and generic types? Say, your
function can take a list of *anything*? Well, you can use generic
specifiers:
.. code:: seq
.. code:: python
def foo(x: List[T], T: type):
print(x)
@ -604,9 +592,9 @@ specifiers:
Generators
~~~~~~~~~~
Seq supports generators, and they are fast! Really, really fast!
Codon supports generators, and they are fast! Really, really fast!
.. code:: seq
.. code:: python
def gen(i):
while i < 10:
@ -619,7 +607,7 @@ You can also use ``yield`` to implement coroutines: ``yield``
suspends the function, while ``(yield)`` (yes, parentheses are required)
receives a value, as in Python.
.. code:: seq
.. code:: python
def mysum[T](start: T):
m = start
@ -635,62 +623,16 @@ receives a value, as in Python.
iadder.send(i) # send a value to coroutine
print(iadder.send(-1)) # prints 45
Pipelines
~~~~~~~~~
Seq extends the core Python language with a pipe operator, which is
similar to bash pipes (or F#'s ``|>`` operator). You can chain multiple
functions and generators to form a pipeline:
.. code:: seq
def add1(x):
return x + 1
2 |> add1 # 3; equivalent to add1(2)
def calc(x, y):
return x + y**2
2 |> calc(3) # 11; equivalent to calc(2, 3)
2 |> calc(..., 3) # 11; equivalent to calc(2, 3)
2 |> calc(3, ...) # 7; equivalent to calc(3, 2)
def gen(i):
for i in range(i):
yield i
5 |> gen |> print # prints 0 1 2 3 4 separated by newline
range(1, 4) |> iter |> gen |> print(end=' ') # prints 0 0 1 0 1 2 without newline
[1, 2, 3] |> print # prints [1, 2, 3]
range(100000000) |> print # prints range(0, 100000000)
range(100000000) |> iter |> print # not only prints all those numbers, but it uses almost no memory at all
Seq will chain anything that implements ``__iter__``; however, use
generators as much as possible because the compiler will optimize out
generators whenever possible. Combinations of pipes and generators can be
used to implement efficient streaming pipelines.
Seq can also execute pipelines in parallel via the parallel pipe (``||>``)
operator:
.. code:: seq
range(100000) |> iter ||> print # prints all these numbers, probably in random order
range(100000) |> iter ||> process ||> clean # runs process in parallel, and then cleans data in parallel
In the last example, Seq will automatically schedule the ``process`` and
``clean`` functions to execute as soon as possible. You can control the
number of threads via the ``OMP_NUM_THREADS`` environment variable.
.. _interop:
Foreign function interface (FFI)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Seq can easily call functions from C and Python.
Codon can easily call functions from C and Python.
Let's import some C functions:
.. code:: seq
.. code:: python
from C import pow(float) -> float
pow(2.0) # 4.0
@ -698,15 +640,15 @@ Let's import some C functions:
# Import and rename function
from C import puts(cobj) -> void as print_line # type cobj is C's pointer (void*, char*, etc.)
print_line("hi!".c_str()) # prints "hi!".
# Note .ptr at the end of string--- needed to cast Seq's string to char*.
# Note .ptr at the end of string--- needed to cast Codon's string to char*.
``from C import`` only works if the symbol is available to the program. If you
are running your programs via ``seqc``, you can link dynamic libraries
by running ``seqc run -l path/to/dynamic/library.so ...``.
are running your programs via ``codon``, you can link dynamic libraries
by running ``codon run -l path/to/dynamic/library.so ...``.
Hate linking? You can also use dyld library loading as follows:
.. code:: seq
.. code:: python
LIBRARY = "mycoollib.so"
@ -716,21 +658,21 @@ Hate linking? You can also use dyld library loading as follows:
foo2 = my2(4, 3.2)
.. note::
When importing external non-Seq functions, you must explicitly specify
When importing external non-Codon functions, you must explicitly specify
argument and return types.
How about Python? If you have set the ``SEQ_PYTHON`` environment variable as
How about Python? If you have set the ``CODON_PYTHON`` environment variable as
described in the first section, you can do:
.. code:: seq
.. code:: python
from python import mymodule.myfunction(str) -> int as foo
print(foo("bar"))
Often you want to execute more complex Python code within Seq. To that
end, you can use Seq's ``@python`` annotation:
Often you want to execute more complex Python code within Codon. To that
end, you can use Codon's ``@python`` annotation:
.. code:: seq
.. code:: python
@python
def scipy_here_i_come(i: List[List[float]]) -> List[float]:
@ -743,18 +685,18 @@ end, you can use Seq's ``@python`` annotation:
return list(eigenvalues)
print(scipy_here_i_come([[1.0, 2.0], [3.0, 4.0]])) # [-0.372281, 5.37228] with some warnings...
Seq will automatically bridge any object that implements the ``__to_py__``
and ``__from_py__`` magic methods. All standard Seq types already
Codon will automatically bridge any object that implements the ``__to_py__``
and ``__from_py__`` magic methods. All standard Codon types already
implement these methods.
Classes and types
-----------------
Of course, Seq supports classes! However, you must declare class members
Of course, Codon supports classes! However, you must declare class members
and their types in the preamble of each class (like you would do with
Python's dataclasses).
.. code:: seq
.. code:: python
class Foo:
x: int
@ -770,11 +712,11 @@ Python's dataclasses).
f.method() # prints "1 2"
.. note::
Seq does not (yet!) support inheritance and polymorphism.
Codon does not (yet!) support inheritance and polymorphism.
Unlike Python, Seq supports method overloading:
Unlike Python, Codon supports method overloading:
.. code:: seq
.. code:: python
class Foo:
x: int
@ -799,7 +741,7 @@ Unlike Python, Seq supports method overloading:
Classes can also be generic:
.. code:: seq
.. code:: python
class Container[T]:
l: List[T]
@ -809,7 +751,7 @@ Classes can also be generic:
Classes create objects that are passed by reference:
.. code:: seq
.. code:: python
class Point:
x: int
@ -824,9 +766,9 @@ Classes create objects that are passed by reference:
If you need to copy an object's contents, implement the ``__copy__`` magic
method and use ``q = copy(p)`` instead.
Seq also supports pass-by-value types via the ``@tuple`` annotation:
Codon also supports pass-by-value types via the ``@tuple`` annotation:
.. code:: seq
.. code:: python
@tuple
class Point:
@ -840,7 +782,7 @@ Seq also supports pass-by-value types via the ``@tuple`` annotation:
However, **by-value objects are immutable!**. The following code will
not compile:
.. code:: seq
.. code:: python
p = Point(1, 2)
p.x = 2 # error! immutable type
@ -850,7 +792,7 @@ Under the hood, types are basically named tuples (equivalent to Python's
You can also add methods to types:
.. code:: seq
.. code:: python
@tuple
class Point:
@ -869,17 +811,16 @@ You can also add methods to types:
Type extensions
~~~~~~~~~~~~~~~
Suppose you have a class that lacks a method or an operator that might
be really useful. You can extend that class and add the method at compile time:
Suppose you have a class that lacks a method or an operator that might be really useful. Codon provides an ``@extend`` annotation that allows you to add and modify methods of various types at compile time, including built-in types like ``int`` or ``str``. This allows much of the functionality of built-in types to be implemented in Codon as type extensions in the standard library.
.. code:: seq
.. code:: python
class Foo:
...
f = Foo(...)
# we need foo.cool() but it does not exist... not a problem for Seq
# We need foo.cool() but it does not exist... not a problem for Codon
@extend
class Foo:
def cool(self: Foo):
@ -887,14 +828,16 @@ be really useful. You can extend that class and add the method at compile time:
f.cool() # works!
# how about we add support for adding integers and strings?
# How about we add support for adding integers and strings:
@extend
class int:
def __add__(self: int, other: str) -> int:
def __add__(self: int, other: str):
return self + int(other)
print(5 + '4') # 9
Note that all type extensions are performed strictly at compile time and incur no runtime overhead.
Magic methods
~~~~~~~~~~~~~
@ -915,13 +858,22 @@ operators overload unary and binary operators (see :ref:`operators`)
``__str__`` support printing and ``str`` method
================ =============================================
Other types
~~~~~~~~~~~
Codon provides arbitrary-width signed and unsigned integers, e.g. ``Int[32]`` is a signed 32-bit integer while ``UInt[128]`` is an unsigned 128-bit integer, respectively (note that ``int`` is an ``Int[64]``). Typedefs for common bit widths are provided in the standard library, such as ``i8``, ``i16``, ``u32``, ``u64`` etc.
The ``Ptr[T]`` type in Codon also corresponds to a raw C pointer (e.g. ``Ptr[byte]`` is equivalent to ``char*`` in C). The ``Array[T]`` type represents a fixed-length array (essentially a pointer with a length).
Codon also provides ``__ptr__`` for obtaining a pointer to a variable (as in ``__ptr__(myvar)``) and ``__array__`` for declaring stack-allocated arrays (as in ``__array__[int](10)``).
LLVM functions
~~~~~~~~~~~~~~
In certain cases, you might want to use LLVM features that are not directly
accessible with Seq. This can be done with the ``@llvm`` attribute:
accessible with Codon. This can be done with the ``@llvm`` attribute:
.. code:: seq
.. code:: python
@llvm
def llvm_add[T](a: T, b: T) -> T:
@ -933,4 +885,4 @@ accessible with Seq. This can be done with the ``@llvm`` attribute:
--------------
Issues, feedback, or comments regarding this tutorial? Let us know `on GitHub <https://github.com/seq-lang/seq>`__.
Issues, feedback, or comments regarding this tutorial? Let us know `on GitHub <https://github.com/exaloop/codon>`__.

View File

@ -11,22 +11,22 @@ Simple!
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
If you want to use Python interop, you also need to point
``SEQ_PYTHON`` to the Python library (typically called
``libpython3.9m.so`` or similar). The Seq repository contains a
`Python script <https://github.com/seq-lang/seq/blob/develop/test/python/find-python-library.py>`_
``CODON_PYTHON`` to the Python library (typically called
``libpython3.9m.so`` or similar). The Codon repository contains a
`Python script <https://github.com/exaloop/codon/blob/develop/test/python/find-python-library.py>`_
that will identify and print the path to this library.
Usage
-----
Assuming that Seq was properly installed, you can use it as follows:
Assuming that Codon was properly installed, you can use it as follows:
.. code:: bash
seqc run foo.seq # Compile and run foo.seq
seqc run -release foo.seq # Compile and run foo.seq with optimizations
seqc build -exe file.seq # Compile foo.seq to executable "foo"
codon run foo.codon # Compile and run foo.codon
codon run -release foo.codon # Compile and run foo.codon with optimizations
codon build -exe file.codon # Compile foo.codon to executable "foo"
Note that the ``-exe`` option requires ``clang`` to be installed, and
the ``LIBRARY_PATH`` environment variable to point to the Seq runtime
library (installed by default at ``~/.seq/lib/seq``).
the ``LIBRARY_PATH`` environment variable to point to the Codon runtime
library (installed by default at ``~/.codon/lib/codon``).

View File

@ -1,411 +0,0 @@
Tutorial
========
Bio-specific features
---------------------
Seq's ``bio`` module contains all the following functions, types, and methods. The code snippets below should be preceded with ``from bio import *``, although we omit that line for simplicity below.
Genomic types
^^^^^^^^^^^^^
Seq's namesake type is indeed the sequence type: ``seq``. A ``seq`` object represents a DNA sequence of any length and---on top of general-purpose string functionality---provides methods for performing common sequence operations such as splitting into subsequences, reverse complementation and :math:`k`-mer extraction. Alongside the ``seq`` type are :math:`k`-mer types, where e.g. ``Kmer[1]`` represents a 1-mer, ``Kmer[2]`` a 2-mer and so on, up to ``Kmer[1024]``.
Sequences can be seamlessly converted between these various types:
.. code-block:: seq
dna = s'ACGTACGTACGT' # sequence literal
# (a) split into subsequences of length 3
# with a step of 2
for sub in dna.split(k=3, step=2):
print(sub)
print(~sub) # reverse complement
# (b) split into 5-mers with step 1 (default)
for kmer in dna.kmers(k=5):
print(kmer)
print(~kmer) # reverse complement
# (c) convert entire sequence to 12-mer
kmer = Kmer[12](dna)
Seq also supports a ``pseq`` type for protein sequences:
.. code-block:: seq
protein = p'HEAGAWGHE' # pseq literal
print(list(protein.split(3, 3))) # [HEA, GAW, GHE]
print(s'ACCATGACA' |> translate) # TMT
.. Note:: What's the difference between sequences and :math:`k`-mers in Seq? Sequences have arbitrary length and allow for ambiguous bases like ``N``. :math:`k`-mers, on the other hand, have a length that is fixed and must be known at compile time, only allowing for ``ACGT`` bases. :math:`k`-mers can therefore be represented internally as 2-bit encoded integers, making them compact and very efficient to manipulate.
In practice, sequences would be read from e.g. a FASTQ file:
.. code-block:: seq
for record in FASTQ('input.fq'):
print('processing', record.name)
process(record.seq)
If you only care about the sequences, you can also do this:
.. code-block:: seq
for read in FASTQ('input.fq') |> seqs:
process(read)
Common formats like FASTQ, FASTA, SAM, BAM and CRAM are supported. The ``FASTQ`` and ``FASTA`` parsers support several additional options:
- ``validate`` (``True`` by default): Perform data validation as sequences are read
- ``gzip`` (``True`` by default): Perform I/O using zlib, supporting gzip'd files (note that plain text files will still work with this enabled)
- ``fai`` (``True`` by default; FASTA only): Look for a ``.fai`` file to determine sequence lengths before reading
For example:
.. code-block:: seq
for read in FASTQ('input.fq', validate=False, gzip=False) |> seqs:
process(read)
To read protein sequences, you can use ``pFASTA``, which has the same interface as ``FASTA`` (but does not support ``fai``):
.. code-block:: seq
for p in pFASTA('input.fa') |> seqs:
process(p)
.. _match:
Sequence matching
^^^^^^^^^^^^^^^^^
Seq provides the conventional ``match`` construct, which works on integers, lists, strings and tuples. Here's a simple example:
.. code-block:: seq
def describe(n: int):
match n:
case m if m < 0:
print('negative')
case 0:
print('zero')
case m if 0 < m < 10:
print('small')
case _:
print('large')
A novel aspect of Seq's ``match`` statement is that it also works on sequences, and allows for concise recursive representations of several sequence operations such as subsequence search, reverse complementation tests and base counting, as shown in this example:
.. code-block:: seq
# (a)
def has_spaced_acgt(s: seq):
match s:
case 'A_C_G_T*':
return True
case t if len(t) >= 8:
return has_spaced_acgt(s[1:])
case _:
return False
# (b)
def is_own_revcomp(s: seq):
match s:
case 'A*T' or 'T*A' or 'C*G' or 'G*C':
return is_own_revcomp(s[1:-1])
case s'':
return True
case _:
return False
# (c)
@tuple
class BaseCount:
A: int
C: int
G: int
T: int
def __add__(self, other: BaseCount):
a1, c1, g1, t1 = self
a2, c2, g2, t2 = other
return (a1 + a2, c1 + c2, g1 + g2, t1 + t2)
def count_bases(s):
match s:
case 'A*': return count_bases(s[1:]) + (1,0,0,0)
case 'C*': return count_bases(s[1:]) + (0,1,0,0)
case 'G*': return count_bases(s[1:]) + (0,0,1,0)
case 'T*': return count_bases(s[1:]) + (0,0,0,1)
case _: return BaseCount(0,0,0,0)
- Example (a) checks if a given sequence contains the subsequence ``A_C_G_T``, where ``_`` is a wildcard base.
- Example (b) checks if the given sequence is its own reverse complement.
- Example (c) counts how many times each base appears in the given sequence.
Sequence patterns consist of literal ``ACGT`` characters, single-base wildcards (``_``) or "zero or more" wildcards (``...``) that match zero or more of any base.
.. _pipeline:
Pipelines
^^^^^^^^^
Pipelining is a natural model for thinking about processing genomic data, as sequences are typically processed in stages (e.g. read from input file, split into :math:`k`-mers, query :math:`k`-mers in index, perform full dynamic programming alignment, output results to file), and are almost always independent of one another as far as this processing is concerned. Because of this, Seq supports a pipe operator: ``|>``, similar to F#'s pipe and R's ``magrittr`` (``%>%``).
Pipeline stages in Seq can be regular functions or generators. In the case of standard functions, the function is simply applied to the input data and the result is carried to the remainder of the pipeline, akin to F#'s functional piping. If, on the other hand, a stage is a generator, the values yielded by the generator are passed lazily to the remainder of the pipeline, which in many ways mirrors how piping is implemented in Bash. Note that Seq ensures that generator pipelines do not collect any data unless explicitly requested, thus allowing the processing of terabytes of data in a streaming fashion with no memory and minimal CPU overhead.
Here's an example of pipeline usage, which shows the same two loops from above, but as pipelines:
.. code-block:: seq
dna = s'ACGTACGTACGT' # sequence literal
# (a) split into subsequences of length 3
# with a stride of 2
dna |> split(..., k=3, step=2) |> print
# (b) split into 5-mers with stride 1
def f(kmer):
print(kmer)
print(~kmer)
dna |> kmers(k=5, step=1) |> f
First, note that ``split`` is a Seq standard library function that takes three arguments: the sequence to split, the subsequence length and the stride; ``split(..., k=3, step=2)`` is a partial call of ``split`` that produces a new single-argument function ``f(x)`` which produces ``split(x, k=3, step=2)``. The undefined argument(s) in a partial call can be implicit, as in the second example: ``kmers`` (also a standard library function) is parameterized by the target :math:`k`-mer type and takes as arguments the sequence to :math:`k`-merize, the :math:`k`-mer length, and the stride; since just two of the three arguments are provided, the first is implicitly replaced by ``...`` to produce a partial call (i.e. the expression is equivalent to ``kmers(..., k=5, step=1)``). Both ``split`` and ``kmers`` are themselves generators that yield subsequences and :math:`k`-mers respectively, which are passed sequentially to the last stage of the enclosing pipeline in the two examples.
.. caution::
The Seq compiler may perform optimizations that change the order of elements passed through a pipeline. Therefore, it is best to not rely on order when using pipelines. If order needs to be maintained, consider using a regular loop or passing an index alongside each element sent through the pipeline.
Sequence alignment
^^^^^^^^^^^^^^^^^^
Aligning sequences is very straightforward in Seq, and supports numerous options/variants:
.. code-block:: seq
# default parameters
s1 = s'CGCGAGTCTT'
s2 = s'CGCAGAGTT'
aln = s1 @ s2
print(aln.cigar, aln.score)
# custom parameters
# match = 2; mismatch = 4; gap1(k) = 2k + 4; gap2(k) = k + 13
aln = s1.align(s2, a=2, b=4, gapo=4, gape=2, gapo2=13, gape2=1)
print(aln.cigar, aln.score)
Here is the list of options supported by the ``align()`` method; all are optional (default is global alignment):
- ``a``: match score
- ``b``: mismatch score
- ``ambig``: ambiguous (i.e. N) match score
- ``gapo``: gap open cost
- ``gape``: gap extension cost
- ``gapo2``: 2nd gap open cost for dual gap cost function
- ``gape2``: 2nd gap extension cost for dual gap cost function
- ``bandwidth``: bandwidth for DP alignment
- ``zdrop``: off-diagonal drop-off to stop extension
- ``score_only``: if true, don't compute CIGAR
- ``right``: if true, right-align gaps
- ``approx_max``: if true, approximate max
- ``approx_drop``: if true, approximate Z-drop
- ``rev_cigar``: if true, reverse CIGAR in output
- ``ext_only``: if true, perform extension alignment
- ``splice``: if true, perform spliced alignment
Note that all costs/scores are positive by convention.
.. _interalign:
Inter-sequence alignment
""""""""""""""""""""""""
Seq uses `ksw2 <https://github.com/lh3/ksw2>`_ as its default alignment kernel. ksw2 does a good job of applying SIMD parallelization to align a single pair of sequences, which is referred to as *intra-sequence* alignment. However, we can often get better performance by aligning multiple sequences at once, referred to as *inter-sequence* alignment. Inter-sequence alignment is usually more cumbersome to program in general-purpose languages because many sequences need to be batched before performing the alignment. However, in Seq, inter-sequence alignment is as easy as intra-sequence, using the ``@inter_align`` annotation:
.. code-block:: seq
@inter_align
def process(t):
query, target = t
score = query.align(target, a=1, b=2, ambig=0, gapo=2, gape=1, zdrop=100, bandwidth=100, end_bonus=5)
print(query, target, score)
zip(seqs('queries.txt'), seqs('targets.txt')) |> process
Internally, the Seq compiler performs pipeline transformations when sequence alignment is performed within a function tagged ``@inter_align``, so as to suspend execution of the calling function, batch sequences that need to be aligned, perform inter-sequence alignment and return the results to the suspended functions. Note that the inter-sequence alignment kernel used by Seq is adapted from `BWA-MEM2 <https://github.com/bwa-mem2/bwa-mem2>`_.
.. _prefetch:
Genomic index prefetching
^^^^^^^^^^^^^^^^^^^^^^^^^
Large genomic indices---ranging from several to tens or even hundreds of gigabytes---used in many applications result in extremely poor cache performance and, ultimately, a substantial fraction of stalled memory-bound cycles. For this reason, Seq performs pipeline optimizations to enable data prefetching and to hide memory latencies. You, the user, must provide just:
- a ``__prefetch__`` magic method definition in the index class, which is logically similar to ``__getitem__`` (indexing construct) but performs a prefetch instead of actually loading the requested value (and can simply delegate to ``__prefetch__`` methods of built-in types);
- a one-line ``@prefetch`` annotation on functions that should perform prefetching.
In particular, a typical prefetch-friendly index class would look like this:
.. code-block:: seq
class MyIndex: # abstract k-mer index
...
def __getitem__(self, kmer: Kmer[20]):
# standard __getitem__
def __prefetch__(self, kmer: Kmer[20]):
# similar to __getitem__, but performs prefetch
Now, if we were to process data in a pipeline as such:
.. code-block:: seq
@prefetch
def process(read: seq, index: MyIndex):
...
for kmer in read.kmers(k=20, step=step):
hits_fwd = index[kmer]
hits_rev = index[~kmer]
...
return x
FASTQ("reads.fq") |> seqs |> process(index) |> postprocess
The Seq compiler will perform pipeline transformations to overlap cache misses in ``MyIndex`` with other useful work, increasing overall throughput. In our benchmarks, we often find these transformations to improve performance by 50% to 2×. However, the improvement is dataset- and application-dependent (and can potentially even decrease performance, although we rarely observed this), so users are encouraged to experiment with it for their own use case.
As a concrete example, consider Seq's built-in FM-index type, ``FMIndex``, and a toy application that counts occurences of 20-mers from an input FASTQ. ``FMIndex`` provides end-to-end search methods like ``locate()`` and ``count()``, but we can take advantage of Seq's prefetch optimization by working with FM-index intervals:
.. code-block:: seq
from bio.fmindex import FMIndex
fmi = FMIndex('/path/to/genome.fa')
k, step, n = 20, 20, 0
def update(count: int):
global n
n += count
@prefetch
def find(s: seq, fmi: FMIndex):
intv = fmi.interval(s[-1]) # initial FM-index interval
s = s[:-1] # trim off last base of sequence
while s and intv:
intv = fmi[intv, s[-1]] # backwards extend FM-index interval
s = s[:-1] # trim off last base of sequence
return len(intv) # return count of sequence in index
FASTQ('/path/to/reads.fq') |> seqs |> split(k, step) |> find(fmi) |> update
print('total:', n)
That single ``@prefetch`` line can have a significant impact, especially for larger ``k``. Here is a graph of the performance of this exact snippet for various ``k`` using hg19 as the reference:
.. image:: ../../images/prefetch.png
:width: 500px
:align: center
:alt: prefetch performance
Other features
--------------
Parallelism
^^^^^^^^^^^
CPython and many other implementations alike cannot take advantage of parallelism due to the infamous global interpreter lock, a mutex that protects accesses to Python objects, preventing multiple threads from executing Python bytecode at once. Unlike CPython, Seq has no such restriction and supports full multithreading. To this end, Seq supports a *parallel* pipe operator ``||>``, which is semantically similar to the standard pipe operator except that it allows the elements sent through it to be processed in parallel by the remainder of the pipeline. Hence, turning a serial program into a parallel one often requires the addition of just a single character in Seq. Further, a single pipeline can contain multiple parallel pipes, resulting in nested parallelism. As an example, here are the same two pipelines as above, but parallelized:
.. code-block:: seq
dna = s'ACGTACGTACGT' # sequence literal
# (a) split into subsequences of length 3
# with a stride of 2
dna |> split(..., k=3, step=2) ||> print
# (b) split into 5-mers with stride 1
def f(kmer):
print(kmer)
print(~kmer)
dna |> kmers(k=5, step=1) ||> f
Internally, the Seq compiler uses an OpenMP task backend to generate code for parallel pipelines. Logically, parallel pipe operators are similar to parallel-for loops: the portion of the pipeline after the parallel pipe is outlined into a new function that is called by the OpenMP runtime task spawning routines (as in ``#pragma omp task`` in C++), and a synchronization point (``#pragma omp taskwait``) is added after the outlined segment. Lastly, the entire program is implicitly placed in an OpenMP parallel region (``#pragma omp parallel``) that is guarded by a "single" directive (``#pragma omp single``) so that the serial portions are still executed by one thread (this is required by OpenMP as tasks must be bound to an enclosing parallel region).
Type extensions
^^^^^^^^^^^^^^^
Seq provides an ``@extend`` annotation that allows programmers to add and modify methods of various types at compile time, including built-in types like ``int`` or ``str``. This actually allows much of the functionality of built-in types to be implemented in Seq as type extensions in the standard library. Here is an example where the ``int`` type is extended to include a ``to`` method that generates integers in a specified range, as well as to override the ``__mul__`` magic method to "intercept" integer multiplications:
.. code-block:: seq
@extend
class int:
def to(self, other: int):
for i in range(self, other + 1):
yield i
def __truediv__(self, other: int):
print('caught int div!')
return 42
for i in (5).to(10):
print(i) # 5, 6, ..., 10
# prints 'caught int div!' then '42'
print(2 / 3)
Note that all type extensions are performed strictly at compile time and incur no runtime overhead.
Other types
^^^^^^^^^^^
Seq provides arbitrary-width signed and unsigned integers up to ``Int[512]`` and ``UInt[512]``, respectively (note that ``int`` is an ``Int[64]``). Typedefs for common bit widths are provided in the standard library, such as ``i8``, ``i16``, ``u32``, ``u64`` etc.
The ``Ptr[T]`` type in Seq also corresponds to a raw C pointer (e.g. ``Ptr[byte]`` is equivalent to ``char*`` in C). The ``array[T]`` type represents a fixed-length array (essentially a pointer with a length).
Seq also provides ``__ptr__`` for obtaining a pointer to a variable (as in ``__ptr__(myvar)``) and ``__array__`` for declaring stack-allocated arrays (as in ``__array__[int](10)``).
Calling BWA from Seq
--------------------
Seq provides a built-in module for interfacing with BWA. To use this module, simply build a shared BWA library and set ``BWA_LIB`` accordingly:
.. code-block:: bash
git clone https://github.com/lh3/bwa
cd bwa
make
gcc -shared -o libbwa.so *.o -lz
export BWA_LIB=`pwd`/libbwa.so
Now BWA can be used in Seq as such:
.. code-block:: seq
# Implementation of https://github.com/lh3/bwa/blob/master/example.c
from sys import argv
from bio.bwa import *
bwa = BWA(argv[1])
for read in FASTQ(argv[2]):
for reg in bwa.align(read.read):
if reg.secondary >= 0: continue
aln = bwa.reg2aln(read.read, reg)
print(read.name, '-' if aln.rev else '+', bwa.name(aln), aln.pos, aln.mapq, aln.cigar, aln.NM)
This program can be invoked as ``seqc run example.seq /path/to/hg19.fa /path/to/reads.fq``.
BWA options can be passed via ``BWA(options(...), ...)``. For example, to set a mismatch score of 5, use ``BWA(options(mismatch_score=5), "hg19.fa")``. Valid options are:
- ``match_score``
- ``mismatch_score``
- ``open_del``
- ``open_ins``
- ``extend_del``
- ``extend_ins``
- ``bandwidth``
- ``zdrop``
- ``clip_penalty``
- ``unpaired_penalty``
Consult the BWA documentation for a detailed description of each of these.

View File

@ -1,770 +0,0 @@
Workshop
========
In this workshop, we will build a toy read mapper called *SeqMap* to
showcase some of Seq's domain-specific features and optimizations.
Code Listings
-------------
- **Section 1**: :ref:`section1-code` :download:`click to download <../../workshop/section1.seq>`
- **Section 2**: :ref:`section2-code` :download:`click to download <../../workshop/section2.seq>`
- **Section 3**: :ref:`section3-code` :download:`click to download <../../workshop/section3.seq>`
- **Section 4**: :ref:`section4-code` :download:`click to download <../../workshop/section4.seq>`
- **Section 5**: :ref:`section5-code` :download:`click to download <../../workshop/section5.seq>`
- **Section 6**: :ref:`section6-code` :download:`click to download <../../workshop/section6.seq>`
Getting Started
---------------
If you don't have Seq installed already, you can install it with one bash command:
.. code:: bash
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
Be sure to restart the shell for ``seqc`` to be added to your ``PATH`` (or add it manually).
Now, let's create a new directory to house our Seq code and test data:
.. code:: bash
mkdir seqmap
cd seqmap
We'll test *SeqMap* on the following data:
- Chromosome 22 as the reference sequence (``chr22.fa``, 52MB)
- A simulated set of 1 million 101bp reads (``reads.fastq``, 251MB)
Let's download this data into a new folder called ``data``:
.. code:: bash
curl -L https://www.dropbox.com/s/8lsngqvn0tzjn1o/data.tar.gz | tar zxf - -C .
What does this data actually look like? Let's take a look:
.. code:: bash
head data/reads.fq
.. code:: text
@chr22_16993648_16994131_1:0:0_2:0:0_0/1
CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
+
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
@chr22_28253010_28253558_1:0:0_0:0:0_1/1
AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
+
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
@chr22_21656192_21656802_0:0:0_2:0:0_2/1
ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
Each `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ record consists of four lines:
- Read name, starting with ``@``. Notice that since this data is simulated, the read name includes the
location on the genome where the read comes from; this will be useful later!
- Read sequence. This is a DNA sequence consisting of ``ACGT`` bases and possibly ``N``, indicating an
ambiguous base.
- Separator (``+``)
- Read quality scores. This is a string of characters as long as the read sequence, where each character
indicates the "quality" of the corresponding base in the read sequence. We won't be using it here.
What about the reference sequence `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`_ file?
.. code:: bash
head data/chr22.fa
.. code:: text
>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
FASTQ records start with the sequence name (prefixed with ``>``) followed by the sequence itself, split
across multiple lines. But why are all the bases ``N``? Chromosomal sequences often have their starts
and ends masked with ``N``'s to cover repetitive `telomeric <https://en.wikipedia.org/wiki/Telomere>`_ sequences.
Since we usually don't want to include such regions in our analyses, they are masked in the file. Let's
look a bit further down:
.. code:: bash
head -n 1000000 data/chr22.fa | tail -n 10
.. code:: text
tattaaaggaaaaaactgtatgaaatagtacatttctcataattctcatt
ttgtaaaaataaagtacttatctatggacataatgagaaaatgactcaag
gtaccaagagtttagccattagctataccagtggattataagcaaattct
gttACGTGCATGCACTCACCTACGCATGTTCATGTATTCATACATACGTA
CATAATTTTTTAAATTTTCTTTTATAGACAAGCAATAGCTTTATAATCTC
TATAATCAGTAAAAATAAGTAAGTggctggacgcagtggctcacacctgt
aatctcagcactttgggaggctgaggagggcagattatgaggtcagaaga
tcaagaccatcctggctaacacagtgaaaccccatctctactaaaaatac
aaaaaattagccacgcgtggtggcacgcgcctgtagtcccagctactggg
gaggctgaggcaggaaaatcgcttgaacccgggaggcagaggttgcggtg
Now we can see the usual ``ACGT`` bases. The fact that some bases are lowercase indicates that they
are a part of some repetitive element or region. Seq will handle these different uppercase and lowercase
characters automatically, so we don't need to worry about them.
You might notice an additional file called ``chr22.fa.fai``: this is a FASTA index file that includes
information about each sequence contained in the file for easier parsing. We won't use it directly,
but Seq uses it internally to make FASTA parsing more efficient.
Section 1: Reading sequences from disk
--------------------------------------
The first step of processing any kind of sequencing data is to read it from disk.
Seq has builtin support for many of the standard file formats such as FASTA, FASTQ,
SAM, BAM, etc.
Let's write a program to read our FASTQ file and print each record's name and sequence
on a single line:
.. code:: seq
from sys import argv
from bio import *
for record in FASTQ(argv[1]):
print(record.name, record.seq)
Now we can run this Seq program:
.. code:: bash
seqc run section1.seq data/reads.fq > out.txt
and view the results:
.. code:: bash
head out.txt
.. code:: text
chr22_16993648_16994131_1:0:0_2:0:0_0/1 CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
chr22_28253010_28253558_1:0:0_0:0:0_1/1 AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
chr22_21656192_21656802_0:0:0_2:0:0_2/1 ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
chr22_44541236_44541725_0:1:0_0:0:0_3/1 CTCTCTGTCTCTCTCTCTCCCCTAGGTCAGGGTGGTCCCTGGGGAGGCCCCTGGGTTACCCCAAGACAGGTGGGAGGTGCTTCCTACCCGACCCTCTTCCT
chr22_39607671_39608139_0:0:0_2:0:0_4/1 ATTGGCTCAGAGTTCAGCAGGCTGTACCAGCATGGCGCCAGTGTCTGCTCCTGGTGAGGCCTTACGGACGTTACAATAACGGCGGAAGGCAAAGGCGGAGC
chr22_35577703_35578255_3:0:0_1:0:0_5/1 TGCCATGGTGGTTAGCTGCACCCATCAACCTGTCATCTACATTAGGTATTTTTCCTAATGCTATCCCTCCCCTAGCACCCTACCCTCTGATAGGCCCTGGT
chr22_46059124_46059578_1:0:0_1:0:0_6/1 AATCAGTACCAAACAATATATGGATATTATTGGCACTTTGTGCTCCCTCTGCCTGAACTGGGAATTCCTCTATTAGTTTTGACATTATCTGGTATTGAACC
chr22_31651867_31652385_2:0:0_2:0:0_7/1 ATCTAGTGACAGTAAGTGGCTGATAAAGTGAGCTGCCATTACATAGTCATCATCTTTAATAGAAGTTAACACATACTGAGTTTCTACTATATTGGGTCTTT
chr22_24816466_24817026_1:0:0_1:0:0_8/1 CACCTCTAGGGCTCAAGGGGCAGTTCCTCCATTCCTCAGCAGTGGCGCCTGTGGAACTGTGTCCTGAGGCCAGGGGGTGGTCAGGCAGGGCCTGGAGTGGC
chr22_27496272_27496752_1:0:0_1:0:0_9/1 CTTAGCCCCATTAAACACTGGCAGGGCTGAATTGTCTGCTGCCATCCATCACACCTTCTCCCCTAGCCTGGTTTCTTACCTACCTGGAAGCCGTCCCTTTT
Pretty straightforward! FASTA files can be read in a very similar way.
.. _section1-code:
Full code listing
~~~~~~~~~~~~~~~~~
:download:`click to download <../../workshop/section1.seq>`
.. code:: seq
# SeqMap
# Seq workshop -- Section 1
# Reads and prints a FASTQ file.
# Usage: seqc run section1.seq <FASTQ path>
from sys import argv
from bio import *
for record in FASTQ(argv[1]):
print(record.name, record.seq)
Section 2: Building an index
----------------------------
Our goal is to find a "mapping" on the genome for each read. Comparing to every
position on the reference sequence would take far too long. An alternative is
to create an index of the k-mers from the reference sequence and use it to guide
the mapping process.
Let's build a dictionary that maps each k-mer to its position ("locus") on the
reference sequence:
.. code:: seq
from sys import argv
from bio import *
index = {}
for record in FASTA(argv[1]):
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
index[kmer] = pos
Of course, there will be k-mers that appear multiple times, but let's ignore this
detail for now and just store the latest position we see for each k-mer.
Another important issue is *reverse complementation*: some of our reads will map
in the reverse direction rather than in the forward direction. For this reason,
let's build our index in such a way that a k-mer is considered "equal" to its
reverse complement. One easy way to do this is by using "canonical" k-mers, i.e.
the minimum of a k-mer and its reverse complement:
.. code:: seq
from sys import argv
from bio import *
index = {}
for record in FASTA(argv[1]):
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
index[min(kmer, ~kmer)] = pos # <--
(We'll have to use canonical k-mers when querying the index too, of course.)
Now we have our index as a dictionary (``index``), but we don't want to build it
each time we perform read mapping, since it only depends on the (fixed) reference
sequence. So, as a last step, let's dump the index to a file using the ``pickle``
module:
.. code:: seq
import pickle
import gzip
with gzip.open(argv[1] + '.index', 'wb') as jar:
pickle.dump(index, jar)
Run the program:
.. code:: bash
seqc run section2.seq data/chr22.fa
Now we should see a new file ``data/chr22.fa.index`` which stores our
serialized index.
The nice thing is we should only have to build our index once!
.. _section2-code:
Full code listing
~~~~~~~~~~~~~~~~~
:download:`click to download <../../workshop/section2.seq>`
.. code:: seq
# SeqMap
# Seq workshop -- Section 2
# Reads and constructs a hash table index from an input
# FASTA file.
# Usage: seqc run section2.seq <FASTA path>
from sys import argv
from bio import *
import pickle
import gzip
index = {}
for record in FASTA(argv[1]):
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
index[min(kmer, ~kmer)] = pos
with gzip.open(argv[1] + '.index', 'wb') as jar:
pickle.dump(index, jar)
Section 3: Finding k-mer matches
--------------------------------
At this point, we have an index we can load from disk. Let's use it
to find candidate mappings for our reads.
We'll split each read into k-mers and report a mapping if at least two
k-mers support a particular locus.
The first step is to load the index:
.. code:: seq
from sys import argv
from bio import *
import pickle
import gzip
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
Now we can iterate over our reads and query k-mers in the index. We need
a way to keep track of candidate mapping positions as we process the
k-mers of a read: we can do this using a new dictionary, ``candidates``,
which maps candidate alignment positions to the number of k-mers supporting
the given position.
Then, we just iterate over ``candidates`` and output positions supported by
2 or more k-mers. Finally, we clear ``candidates`` before processing the next
read:
.. code:: seq
candidates = {} # position -> count mapping
for record in FASTQ(argv[2]):
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
print(record.name, pos + 1)
candidates.clear()
Run the program:
.. code:: bash
seqc run section3.seq data/chr22.fa data/reads.fq > out.txt
Let's take a look at the output:
.. code:: bash
head out.txt
.. code:: text
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615
Notice that most positions we reported match the position from the read
name (the first integer after the ``_``); not bad!
.. _section3-code:
Full code listing
~~~~~~~~~~~~~~~~~
:download:`click to download <../../workshop/section3.seq>`
.. code:: seq
# SeqMap
# Seq workshop -- Section 3
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings.
# Usage: seqc run section3.seq <FASTA path> <FASTQ path>
from sys import argv
from bio import *
import pickle
import gzip
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
candidates = {} # position -> count mapping
for record in FASTQ(argv[2]):
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
print(record.name, pos + 1)
candidates.clear()
Section 4: Smith-Waterman alignment and CIGAR strings
-----------------------------------------------------
We now have the ability to report mapping *positions* for each read,
but usually we want *alignments*, which include information about
mismatches, insertions and deletions.
Luckily, Seq makes sequence alignment easy: to align sequence ``q``
against sequence ``t``, you can just do:
.. code:: seq
aln = q @ t
``aln`` is a tuple of alignment score and CIGAR string (a *CIGAR string* is
a way of encoding an alignment result, and consists of operations such as ``M``
for match/mismatch, ``I`` for insertion and ``D`` for deletion, accompanied
by the number of associated bases; for example, ``3M2I4M`` indicates 3 (mis)matches
followed by a length-2 insertion followed by 4 (mis)matches).
By default, `Levenshtein distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ is
used, meaning mismatch and gap costs are both 1, while match costs are zero. More
control over alignment parameters can be achieved using the ``align`` method:
.. code:: seq
aln = q.align(t, a=2, b=4, ambig=0, gapo=4, gape=2)
where ``a`` is the match score, ``b`` is the mismatch cost, ``ambig`` is the
ambiguous base (``N``) match score, ``gapo`` is the gap open cost and ``gape``
the gap extension cost (i.e. a gap of length ``k`` costs ``gapo + (k * gape)``).
There are many more parameters as well, controlling factors like alignment bandwidth,
Z-drop, global/extension alignment and more; check the
`standard library reference <https://docs.seq-lang.org/stdlib/bio/align.html#bio.align.seq.align>`_
for further details.
For now, we'll use a simple ``query.align(target)``:
.. code:: seq
candidates = {} # position -> count mapping
for record in FASTQ(argv[2]):
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
# get query, target and align:
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
candidates.clear()
Run the program:
.. code:: bash
seqc run section4.seq data/chr22.fa data/reads.fq > out.txt
And let's take a look at the output once again:
.. code:: bash
head out.txt
.. code:: text
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648 196 101M
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010 196 101M
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236 196 101M
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867 190 101M
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577 196 101M
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563 110 20M1I4M1D76M
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174 134 20M1I4M1D76M
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988 128 20M1I4M1D76M
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316 118 101M
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615 202 101M
Most of the alignments contain only matches or mismatches (``M``), which
is to be expected as insertions and deletions are far less common. In fact,
the three mappings containing indels appear to be incorrect!
A more thorough mapping scheme would also look at alignment scores before
reporting mappings, although for the purposes of this workshop we'll ignore
such improvements.
.. _section4-code:
Full code listing
~~~~~~~~~~~~~~~~~
:download:`click to download <../../workshop/section4.seq>`
.. code:: seq
# SeqMap
# Seq workshop -- Section 4
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Usage: seqc run section4.seq <FASTA path> <FASTQ path>
from sys import argv
from bio import *
import pickle
import gzip
reference = s''
for record in FASTA(argv[1]):
reference = record.seq
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
candidates = {} # position -> count mapping
for record in FASTQ(argv[2]):
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
candidates.clear()
Section 5: Pipelines
--------------------
Pipelines are a very convenient Seq construct for expressing a variety
of algorithms and applications. In fact, *SeqMap* can be thought of as
a pipeline with the following stages:
- read a record from the FASTQ file,
- find candidate alignments by querying the index,
- perform alignment for mappings supported by 2+ k-mers and output results.
We can write this as a pipeline in Seq as follows:
.. code:: seq
def find_candidates(record):
candidates = {} # position -> count mapping
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
yield record, pos
def align_and_output(t):
record, pos = t
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
Notice that ``find_candidates`` *yields* candidate alignments to ``align_and_output``,
which then performs alignment and prints the results. In Seq, all values generated
from one stage of a pipeline are passed to the next. The Seq compiler performs many
domain-specific optimizations on pipelines, one of which we focus on in the next section.
(Optional) Parallelism
~~~~~~~~~~~~~~~~~~~~~~
Parallelism can be achieved using the parallel pipe operator, ``||>``, which
tells the compiler that all subsequent stages can be executed in parallel:
.. code:: seq
FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output
Since the full program also involves loading the index, let's time the main
pipeline using the ``timing`` module:
.. code:: seq
import timing
with timing('mapping'):
FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output
We can try this for different numbers of threads:
.. code:: bash
export OMP_NUM_THREADS=1
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 48.2858s
export OMP_NUM_THREADS=2
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 35.886s
Often, batching reads into larger blocks and processing those blocks in parallel can
yield better performance, especially if each read is quick to process. This is also
very easy to do in Seq:
.. code:: seq
def process_block(block):
block |> iter |> find_candidates |> align_and_output
with timing('mapping'):
FASTQ(argv[2]) |> blocks(size=2000) ||> process_block
And now:
.. code:: bash
export OMP_NUM_THREADS=1
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 48.2858s
export OMP_NUM_THREADS=2
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 25.2648s
.. _section5-code:
Full code listing
~~~~~~~~~~~~~~~~~
:download:`click to download <../../workshop/section5.seq>`
.. code:: seq
# SeqMap
# Seq workshop -- Section 5
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Implemented with Seq pipelines.
# Usage: seqc run section5.seq <FASTA path> <FASTQ path>
from sys import argv
from time import timing
from bio import *
import pickle
import gzip
reference = s''
for record in FASTA(argv[1]):
reference = record.seq
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
def find_candidates(record):
candidates = {} # position -> count mapping
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
yield record, pos
def align_and_output(t):
record, pos = t
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
with timing('mapping'):
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
Section 6: Domain-specific optimizations
----------------------------------------
Seq already performs numerous domain-specific optimizations under the hood.
However, we can give the compiler a hint in this case to perform one more:
*inter-sequence alignment*. This optimization entails batching sequences
prior to alignment, then aligning multiple pairs using a very fast SIMD
optimized alignment kernel.
In Seq, we just need one additional function annotation to tell the compiler
to perform this optimization:
.. code:: seq
@inter_align
def align_and_output(t):
...
Let's run the program with and without this optimization:
.. code:: seq
# without @inter_align
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 43.4457s
# with @inter_align
seqc run section6.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 32.3241s
(The timings with inter-sequence alignment will depend on the SIMD instruction
sets your CPU supports; these numbers are from using AVX2.)
.. _section6-code:
Full code listing
~~~~~~~~~~~~~~~~~
:download:`click to download <../../workshop/section6.seq>`
.. code:: seq
# SeqMap
# Seq workshop -- Section 6
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Implemented with Seq pipelines using inter-seq. alignment.
# Usage: seqc run section6.seq <FASTA path> <FASTQ path>
from sys import argv
from time import timing
from bio import *
import pickle
import gzip
reference = s''
for record in FASTA(argv[1]):
reference = record.seq
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
def find_candidates(record):
candidates = {} # position -> count mapping
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
yield record, pos
@inter_align
def align_and_output(t):
record, pos = t
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
with timing('mapping'):
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output

View File

@ -1,8 +0,0 @@
# SeqMap
# Seq workshop -- Section 1
# Reads and prints a FASTQ file.
# Usage: seqc run section1.seq <FASTQ path>
from sys import argv
from bio import *
for record in FASTQ(argv[1]):
print(record.name, record.seq)

View File

@ -1,18 +0,0 @@
# SeqMap
# Seq workshop -- Section 2
# Reads and constructs a hash table index from an input
# FASTA file.
# Usage: seqc run section2.seq <FASTA path>
from sys import argv
from bio import *
import pickle
import gzip
index = {}
for record in FASTA(argv[1]):
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
index[min(kmer, ~kmer)] = pos
with gzip.open(argv[1] + '.index', 'wb') as jar:
pickle.dump(index, jar)

View File

@ -1,29 +0,0 @@
# SeqMap
# Seq workshop -- Section 3
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings.
# Usage: seqc run section3.seq <FASTA path> <FASTQ path>
from sys import argv
from bio import *
import pickle
import gzip
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
candidates = {} # position -> count mapping
for record in FASTQ(argv[2]):
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
print(record.name, pos + 1)
candidates.clear()

View File

@ -1,36 +0,0 @@
# SeqMap
# Seq workshop -- Section 4
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Usage: seqc run section4.seq <FASTA path> <FASTQ path>
from sys import argv
from bio import *
import pickle
import gzip
reference = s''
for record in FASTA(argv[1]):
reference = record.seq
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
candidates = {} # position -> count mapping
for record in FASTQ(argv[2]):
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
candidates.clear()

View File

@ -1,42 +0,0 @@
# SeqMap
# Seq workshop -- Section 5
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Implemented with Seq pipelines.
# Usage: seqc run section5.seq <FASTA path> <FASTQ path>
from sys import argv
from time import timing
from bio import *
import pickle
import gzip
reference = s''
for record in FASTA(argv[1]):
reference = record.seq
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
def find_candidates(record):
candidates = {} # position -> count mapping
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
yield record, pos
def align_and_output(t):
record, pos = t
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
with timing('mapping'):
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output

View File

@ -1,43 +0,0 @@
# SeqMap
# Seq workshop -- Section 6
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Implemented with Seq pipelines using inter-seq. alignment.
# Usage: seqc run section6.seq <FASTA path> <FASTQ path>
from sys import argv
from time import timing
from bio import *
import pickle
import gzip
reference = s''
for record in FASTA(argv[1]):
reference = record.seq
K: Static[int] = 32
index = None
with gzip.open(argv[1] + '.index', 'rb') as jar:
index = pickle.load(jar, T=Dict[Kmer[K],int])
def find_candidates(record):
candidates = {} # position -> count mapping
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
found = index.get(min(kmer, ~kmer), -1)
if found > 0:
loc = found - pos
candidates[loc] = candidates.get(loc, 0) + 1
for pos,count in candidates.items():
if count > 1:
yield record, pos
@inter_align
def align_and_output(t):
record, pos = t
query = record.read
target = reference[pos:pos + len(query)]
alignment = query.align(target)
print(record.name, pos + 1, alignment.score, alignment.cigar)
with timing('mapping'):
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output

View File

@ -37,7 +37,7 @@ def timing(msg: str = ""):
"""
Example usage:
.. code-block:: seq
.. code-block:: python
from time import timing
with timing('foo function'):