mirror of https://github.com/exaloop/codon.git
parent
2556d75343
commit
e880f5848a
|
@ -1,10 +1,10 @@
|
|||
# Contributing to Seq
|
||||
# Contributing to Codon
|
||||
|
||||
Thank you for considering contributing to Seq! This document contains some helpful information for getting started. The best place to ask questions or get feedback is [our Gitter chatroom](https://gitter.im/seq-lang/Seq). For a high-level outline of the features we aim to add in the future, check the [roadmap](https://github.com/seq-lang/seq/wiki/Roadmap).
|
||||
Thank you for considering contributing to Codon! This document contains some helpful information for getting started. The best place to ask questions or get feedback is [our Gitter chatroom](https://gitter.im/seq-lang/Seq). For a high-level outline of the features we aim to add in the future, check the [roadmap](https://github.com/seq-lang/seq/wiki/Roadmap).
|
||||
|
||||
## Development workflow
|
||||
|
||||
All development is done on the [`develop`](https://github.com/seq-lang/seq/tree/develop) branch. Just before release, we bump the version number, merge into [`master`](https://github.com/seq-lang/seq/tree/master) and tag the build with a tag of the form `vX.Y.Z` where `X`, `Y` and `Z` are the [SemVer](https://semver.org) major, minor and patch numbers, respectively. Our Travis CI build script automatically builds and deploys tagged commits as a new GitHub release via our trusty [@SeqBot](https://github.com/seqbot). It also builds and deploys the documentation to our website.
|
||||
All development is done on the [`develop`](https://github.com/exaloop/codon/tree/develop) branch. Just before release, we bump the version number, merge into [`master`](https://github.com/exaloop/codon/tree/master) and tag the build with a tag of the form `vX.Y.Z` where `X`, `Y` and `Z` are the [SemVer](https://semver.org) major, minor and patch numbers, respectively. Our Travis CI build script automatically builds and deploys tagged commits as a new GitHub release via our trusty [@SeqBot](https://github.com/seqbot). It also builds and deploys the documentation to our website.
|
||||
|
||||
## Coding standards
|
||||
|
||||
|
@ -12,7 +12,7 @@ All C++ code should be formatted with [ClangFormat](https://clang.llvm.org/docs/
|
|||
|
||||
## Writing tests
|
||||
|
||||
Tests are written as Seq programs. The [`test/core/`](https://github.com/seq-lang/seq/tree/master/test/core) directory contains some examples. If you add a new test file, be sure to add it to [`test/main.cpp`](https://github.com/seq-lang/seq/blob/master/test/main.cpp) so that it will be executed as part of the test suite. There are two ways to write tests for Seq:
|
||||
Tests are written as Codon programs. The [`test/core/`](https://github.com/exaloop/codon/tree/master/test/core) directory contains some examples. If you add a new test file, be sure to add it to [`test/main.cpp`](https://github.com/exaloop/codon/blob/master/test/main.cpp) so that it will be executed as part of the test suite. There are two ways to write tests for Codon:
|
||||
|
||||
#### New style
|
||||
|
||||
|
@ -39,7 +39,7 @@ print(2 + 2) # EXPECT: 4
|
|||
|
||||
## Pull requests
|
||||
|
||||
Pull requests should generally be based on the `develop` branch. Before submitting a pull request, pleace make sure...
|
||||
Pull requests should generally be based on the `develop` branch. Before submitting a pull request, please make sure...
|
||||
|
||||
- ... to provide a clear description of the purpose of the pull request.
|
||||
- ... to include tests for any new or changed code.
|
||||
|
@ -49,12 +49,12 @@ Please be patient with pull request reviews, as our throughput is limited!
|
|||
|
||||
## Issues
|
||||
|
||||
We use [GitHub's issue tracker](https://github.com/seq-lang/seq/issues), so that's where you'll find the most recent list of open bugs, feature requests and general issues. If applicable, we try to tag each issue with at least one of the following tags:
|
||||
We use [GitHub's issue tracker](https://github.com/exaloop/codon/issues), so that's where you'll find the most recent list of open bugs, feature requests, and general issues. If applicable, we try to tag each issue with at least one of the following tags:
|
||||
|
||||
- `Build`: Issues related to building Seq
|
||||
- `Build`: Issues related to building Codon
|
||||
- `Codegen`: Issues related to code generation (i.e. after parsing and type checking)
|
||||
- `Parser`: Issues related to lexing/parsing
|
||||
- `Library`: Issues related to the Seq standard library
|
||||
- `Library`: Issues related to the Codon standard library
|
||||
- `Interop`: Issues related to interoperability with other languages or systems
|
||||
- `Docs`: Issues related to documentation
|
||||
- `Feature`: New language feature proposals
|
||||
|
|
82
README.md
82
README.md
|
@ -1,24 +1,24 @@
|
|||
<p align="center">
|
||||
<img src="docs/sphinx/logo.png?raw=true" width="200" alt="Seq"/>
|
||||
<img src="docs/sphinx/logo.png?raw=true" width="200" alt="Codon"/>
|
||||
</p>
|
||||
|
||||
<h1 align="center"> Seq — a language for bioinformatics</h1>
|
||||
<h1 align="center"> Codon</h1>
|
||||
|
||||
<p align="center">
|
||||
<a href="https://github.com/seq-lang/seq/actions?query=branch%3Adevelop">
|
||||
<img src="https://github.com/seq-lang/seq/workflows/Seq%20CI/badge.svg?branch=develop"
|
||||
<a href="https://github.com/exaloop.io/codon/actions?query=branch%3Adevelop">
|
||||
<img src="https://github.com/exaloop.io/codon/workflows/Codon%20CI/badge.svg?branch=develop"
|
||||
alt="Build Status">
|
||||
</a>
|
||||
<a href="https://gitter.im/seq-lang/seq?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge">
|
||||
<img src="https://badges.gitter.im/Join%20Chat.svg"
|
||||
alt="Gitter">
|
||||
</a>
|
||||
<a href="https://github.com/seq-lang/seq/releases/latest">
|
||||
<img src="https://img.shields.io/github/v/release/seq-lang/seq?sort=semver"
|
||||
<a href="https://github.com/exaloop.io/codon/releases/latest">
|
||||
<img src="https://img.shields.io/github/v/release/exaloop.io/codon?sort=semver"
|
||||
alt="Version">
|
||||
</a>
|
||||
<a href="https://github.com/seq-lang/seq/blob/master/LICENSE">
|
||||
<img src="https://img.shields.io/github/license/seq-lang/seq"
|
||||
<a href="https://github.com/exaloop.io/codon/blob/master/LICENSE">
|
||||
<img src="https://img.shields.io/github/license/exaloop.io/codon"
|
||||
alt="License">
|
||||
</a>
|
||||
</p>
|
||||
|
@ -27,17 +27,17 @@
|
|||
|
||||
> **A strongly-typed and statically-compiled high-performance Pythonic language!**
|
||||
|
||||
Seq is a programming language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
|
||||
Codon is a programming language for computationally intensive workloads. With a Python-compatible syntax and a host of domain-specific features and optimizations, Codon makes writing high-performance software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
|
||||
|
||||
**Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.**
|
||||
**Think of Codon as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.**
|
||||
|
||||
Seq is able to outperform Python code by up to 160x. Seq can further beat equivalent C/C++ code by up to 2x without any manual interventions, and also natively supports parallelism out of the box. Implementation details and benchmarks are discussed [in our paper](https://dl.acm.org/citation.cfm?id=3360551).
|
||||
Codon is able to outperform Python code by up to 160x. Codon can further beat equivalent C/C++ code by up to 2x without any manual interventions, and also natively supports parallelism out of the box. Implementation details and benchmarks are discussed [in our paper](https://dl.acm.org/citation.cfm?id=3360551).
|
||||
|
||||
Learn more by following the [tutorial](https://docs.seq-lang.org/tutorial) or from the [cookbook](https://docs.seq-lang.org/cookbook).
|
||||
Learn more by following the [tutorial](https://docs.exaloop.io/tutorial).
|
||||
|
||||
## Examples
|
||||
|
||||
Seq is a Python-compatible language, and many Python programs should work with few if any modifications:
|
||||
Codon is a Python-compatible language, and many Python programs should work with few if any modifications:
|
||||
|
||||
```python
|
||||
def fib(n):
|
||||
|
@ -49,7 +49,7 @@ def fib(n):
|
|||
fib(1000)
|
||||
```
|
||||
|
||||
This prime counting example showcases Seq's [OpenMP](https://www.openmp.org/) support, enabled with the addition of one line. The `@par` annotation tells the compiler to parallelize the following for-loop, in this case using a dynamic schedule, chunk size of 100, and 16 threads.
|
||||
This prime counting example showcases Codon's [OpenMP](https://www.openmp.org/) support, enabled with the addition of one line. The `@par` annotation tells the compiler to parallelize the following for-loop, in this case using a dynamic schedule, chunk size of 100, and 16 threads.
|
||||
|
||||
```python
|
||||
from sys import argv
|
||||
|
@ -72,30 +72,11 @@ for i in range(2, limit):
|
|||
print(total)
|
||||
```
|
||||
|
||||
Here is an example showcasing some of Seq's bioinformatics features, which include native sequence and k-mer types.
|
||||
|
||||
```python
|
||||
from bio import *
|
||||
s = s'ACGTACGT' # sequence literal
|
||||
print(s[2:5]) # subsequence
|
||||
print(~s) # reverse complement
|
||||
kmer = Kmer[8](s) # convert to k-mer
|
||||
|
||||
# iterate over length-3 subsequences
|
||||
# with step 2
|
||||
for sub in s.split(3, step=2):
|
||||
print(sub[-1]) # last base
|
||||
|
||||
# iterate over 2-mers with step 1
|
||||
for kmer in sub.kmers(step=1, k=2):
|
||||
print(~kmer) # '~' also works on k-mers
|
||||
```
|
||||
|
||||
## Install
|
||||
|
||||
### Pre-built binaries
|
||||
|
||||
Pre-built binaries for Linux and macOS on x86_64 are available alongside [each release](https://github.com/seq-lang/seq/releases). We also have a script for downloading and installing pre-built versions:
|
||||
Pre-built binaries for Linux and macOS on x86_64 are available alongside [each release](https://github.com/exaloop/codon/releases). We also have a script for downloading and installing pre-built versions:
|
||||
|
||||
```bash
|
||||
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
|
||||
|
@ -107,35 +88,4 @@ See [Building from Source](docs/sphinx/build.rst).
|
|||
|
||||
## Documentation
|
||||
|
||||
Please check [docs.seq-lang.org](https://docs.seq-lang.org) for in-depth documentation.
|
||||
|
||||
## Citing Seq
|
||||
|
||||
If you use Seq in your research, please cite:
|
||||
|
||||
> Ariya Shajii, Ibrahim Numanagić, Riyadh Baghdadi, Bonnie Berger, and Saman Amarasinghe. 2019. Seq: a high-performance language for bioinformatics. *Proc. ACM Program. Lang.* 3, OOPSLA, Article 125 (October 2019), 29 pages. DOI: https://doi.org/10.1145/3360551
|
||||
|
||||
BibTeX:
|
||||
|
||||
```
|
||||
@article{Shajii:2019:SHL:3366395.3360551,
|
||||
author = {Shajii, Ariya and Numanagi\'{c}, Ibrahim and Baghdadi, Riyadh and Berger, Bonnie and Amarasinghe, Saman},
|
||||
title = {Seq: A High-performance Language for Bioinformatics},
|
||||
journal = {Proc. ACM Program. Lang.},
|
||||
issue_date = {October 2019},
|
||||
volume = {3},
|
||||
number = {OOPSLA},
|
||||
month = oct,
|
||||
year = {2019},
|
||||
issn = {2475-1421},
|
||||
pages = {125:1--125:29},
|
||||
articleno = {125},
|
||||
numpages = {29},
|
||||
url = {http://doi.acm.org/10.1145/3360551},
|
||||
doi = {10.1145/3360551},
|
||||
acmid = {3360551},
|
||||
publisher = {ACM},
|
||||
address = {New York, NY, USA},
|
||||
keywords = {Python, bioinformatics, computational biology, domain-specific language, optimization, programming language},
|
||||
}
|
||||
```
|
||||
Please check [docs.exaloop.io](https://docs.exaloop.io) for in-depth documentation.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
PROJECT_NAME = "seq"
|
||||
PROJECT_NAME = "codon"
|
||||
XML_OUTPUT = xml
|
||||
INPUT = ../../compiler/include ../../compiler/lang ../../compiler/types ../../compiler/util
|
||||
STRIP_FROM_PATH = ../..
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
# You can set these variables from the command line.
|
||||
SPHINXOPTS =
|
||||
SPHINXBUILD = sphinx-build
|
||||
SPHINXPROJ = seq
|
||||
SPHINXPROJ = codon
|
||||
SOURCEDIR = .
|
||||
BUILDDIR = _build
|
||||
|
||||
|
|
|
@ -0,0 +1,157 @@
|
|||
from collections import defaultdict
|
||||
|
||||
import re
|
||||
|
||||
from typing import Any,Dict,Iterable,Iterator,List,NamedTuple,Tuple
|
||||
from typing import cast
|
||||
|
||||
from docutils.parsers.rst import directives
|
||||
from sphinx import addnodes
|
||||
from sphinx.directives import ObjectDescription
|
||||
from sphinx.domains import Domain
|
||||
from sphinx.domains import Index
|
||||
from sphinx.roles import XRefRole
|
||||
from sphinx.util.nodes import make_refnode
|
||||
from sphinx.locale import _,__
|
||||
from docutils import nodes
|
||||
from docutils.nodes import Element,Node
|
||||
from docutils.parsers.rst import directives
|
||||
|
||||
from sphinx import addnodes
|
||||
from sphinx.addnodes import pending_xref,desc_signature
|
||||
from sphinx.application import Sphinx
|
||||
from sphinx.builders import Builder
|
||||
from sphinx.directives import ObjectDescription
|
||||
from sphinx.domains import Domain,ObjType,Index,IndexEntry
|
||||
from sphinx.environment import BuildEnvironment
|
||||
from sphinx.locale import _,__
|
||||
from sphinx.pycode.ast import ast,parse as ast_parse
|
||||
from sphinx.roles import XRefRole
|
||||
from sphinx.util import logging
|
||||
from sphinx.util.docfields import Field,GroupedField,TypedField
|
||||
from sphinx.util.docutils import SphinxDirective
|
||||
from sphinx.util.inspect import signature_from_str
|
||||
from sphinx.util.nodes import make_id,make_refnode
|
||||
from sphinx.util.typing import TextlikeNode
|
||||
|
||||
import sphinx.domains.python
|
||||
|
||||
codon_sig_re=re.compile(r'''^([\w.]*\.)?(\w+)\s*(?:\[\s*(.*)\s*\])?\s*(?:\(\s*(.*)\s*\)(?:\s*->\s*(.*))?)?$''',re.VERBOSE)
|
||||
|
||||
|
||||
def handle_signature(self,sig: str,signode: desc_signature) -> Tuple[str,str]:
|
||||
m=codon_sig_re.match(sig)
|
||||
if m is None:
|
||||
raise ValueError
|
||||
prefix,name,generics,arglist,retann=m.groups()
|
||||
|
||||
# determine module and class name (if applicable), as well as full name
|
||||
modname=self.options.get('module',self.env.ref_context.get('py:module'))
|
||||
classname=self.env.ref_context.get('py:class')
|
||||
isextension=self.objtype=='extension'
|
||||
if classname:
|
||||
add_module=False
|
||||
if prefix and (prefix==classname or prefix.startswith(classname+".")):
|
||||
fullname=prefix+name
|
||||
# class name is given again in the signature
|
||||
prefix=prefix[len(classname):].lstrip('.')
|
||||
elif prefix:
|
||||
# class name is given in the signature, but different
|
||||
# (shouldn't happen)
|
||||
fullname=classname+'.'+prefix+name
|
||||
else:
|
||||
# class name is not given in the signature
|
||||
fullname=classname+'.'+name
|
||||
else:
|
||||
add_module=True
|
||||
if prefix:
|
||||
classname=prefix.rstrip('.')
|
||||
fullname=prefix+name
|
||||
else:
|
||||
classname=''
|
||||
fullname=name
|
||||
|
||||
signode['module']=modname
|
||||
if modname.startswith('..'): # HACK: no idea why this happens
|
||||
modname=modname[2:]
|
||||
signode['class']=classname
|
||||
signode['fullname']=fullname
|
||||
|
||||
sig_prefix=self.get_signature_prefix(sig)
|
||||
if sig_prefix:
|
||||
signode+=addnodes.desc_annotation(sig_prefix,sig_prefix)
|
||||
|
||||
if not isextension:
|
||||
if prefix:
|
||||
signode+=addnodes.desc_addname(prefix,prefix)
|
||||
elif add_module and self.env.config.add_module_names:
|
||||
if modname and modname!='exceptions':
|
||||
# exceptions are a special case, since they are documented in the
|
||||
# 'exceptions' module.
|
||||
nodetext=modname+'.'
|
||||
signode+=addnodes.desc_addname(nodetext,nodetext)
|
||||
signode+=addnodes.desc_name(name,name)
|
||||
else:
|
||||
ref=type_to_xref(str(name),self.env)
|
||||
signode+=addnodes.desc_name(name,name)
|
||||
|
||||
if generics:
|
||||
signode+=addnodes.desc_name("generics","["+generics+"]")
|
||||
if arglist:
|
||||
try:
|
||||
signode+=sphinx.domains.python._parse_arglist(arglist,self.env)
|
||||
except SyntaxError:
|
||||
# fallback to parse arglist original parser.
|
||||
# it supports to represent optional arguments (ex. "func(foo [, bar])")
|
||||
sphinx.domains.python._pseudo_parse_arglist(signode,arglist)
|
||||
except NotImplementedError as exc:
|
||||
logger.warning("could not parse arglist (%r): %s",arglist,exc,location=signode)
|
||||
sphinx.domains.python._pseudo_parse_arglist(signode,arglist)
|
||||
else:
|
||||
if self.needs_arglist():
|
||||
# for callables, add an empty parameter list
|
||||
signode+=addnodes.desc_parameterlist()
|
||||
|
||||
if retann:
|
||||
children=sphinx.domains.python._parse_annotation(retann,self.env)
|
||||
signode+=addnodes.desc_returns(retann,'',*children)
|
||||
|
||||
anno=self.options.get('annotation')
|
||||
if anno:
|
||||
signode+=addnodes.desc_annotation(' '+anno,' '+anno)
|
||||
|
||||
return fullname,prefix
|
||||
|
||||
|
||||
sphinx.domains.python.PyObject.handle_signature=handle_signature
|
||||
|
||||
from sphinx.domains.python import *
|
||||
|
||||
logger=logging.getLogger(__name__)
|
||||
|
||||
|
||||
class CodonDomain(PythonDomain):
|
||||
name='codon'
|
||||
label='Codon'
|
||||
object_types={'function':ObjType(_('function'),'func','obj'),# 'data': ObjType(_('data'), 'data', 'obj'),
|
||||
'class':ObjType(_('class'),'class','exc','obj'),'type':ObjType(_('class'),'exc','class','obj'),'extension':ObjType(_('class'),'exc','class','obj'),
|
||||
'exception':ObjType(_('exception'),'exc','class','obj'),'method':ObjType(_('method'),'meth','obj'),# 'classmethod': ObjType(_('class method'), 'meth', 'obj'),
|
||||
# 'staticmethod': ObjType(_('static method'), 'meth', 'obj'),
|
||||
# 'attribute': ObjType(_('attribute'), 'attr', 'obj'),
|
||||
'module':ObjType(_('module'),'mod','obj'),} # type: Dict[str, ObjType]
|
||||
|
||||
directives={'function':PyFunction,'data':PyVariable,'class':PyClasslike,'exception':PyClasslike,'type':PyClasslike,'extension':PyClasslike,'method':PyMethod,'classmethod':PyClassMethod,
|
||||
'staticmethod':PyStaticMethod,'attribute':PyAttribute,'module':PyModule,'currentmodule':PyCurrentModule,'decorator':PyDecoratorFunction,'decoratormethod':PyDecoratorMethod,}
|
||||
roles={'data':PyXRefRole(),'exc':PyXRefRole(),'func':PyXRefRole(fix_parens=True),'class':PyXRefRole(),# 'const': PyXRefRole(),
|
||||
'attr':PyXRefRole(),'meth':PyXRefRole(fix_parens=True),'mod':PyXRefRole(),'obj':PyXRefRole(),}
|
||||
initial_data={'objects':{}, # fullname -> docname, objtype
|
||||
'modules':{}, # modname -> docname, synopsis, platform, deprecated
|
||||
} # type: Dict[str, Dict[str, Tuple[Any]]]
|
||||
indices=[PythonModuleIndex,]
|
||||
|
||||
|
||||
def setup(app):
|
||||
app.add_domain(CodonDomain)
|
||||
# app.connect('object-description-simplify', filter_meta_fields)
|
||||
|
||||
return {'version':'0.1','parallel_read_safe':True,'parallel_write_safe':True,}
|
|
@ -1,13 +1,13 @@
|
|||
Building from Source
|
||||
====================
|
||||
|
||||
Unless you really need to build Seq for whatever reason, we strongly
|
||||
Unless you really need to build Codon for whatever reason, we strongly
|
||||
recommend using pre-built binaries if possible.
|
||||
|
||||
Dependencies
|
||||
------------
|
||||
|
||||
Seq depends on LLVM 12, which can be installed via most package managers. To
|
||||
Codon depends on LLVM 12, which can be installed via most package managers. To
|
||||
build LLVM 12 yourself, you can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
@ -28,7 +28,7 @@ build LLVM 12 yourself, you can do the following:
|
|||
Build
|
||||
-----
|
||||
|
||||
The following can generally be used to build Seq. The build process will automatically
|
||||
The following can generally be used to build Codon. The build process will automatically
|
||||
download and build several smaller dependencies.
|
||||
|
||||
.. code-block:: bash
|
||||
|
@ -40,5 +40,5 @@ download and build several smaller dependencies.
|
|||
-DCMAKE_CXX_COMPILER=clang++)
|
||||
cmake --build build --config Release
|
||||
|
||||
This should produce the ``seqc`` executable in the ``build`` directory, as well as
|
||||
``seqtest`` which runs the test suite.
|
||||
This should produce the ``codon`` executable in the ``build`` directory, as well as
|
||||
``codon_test`` which runs the test suite.
|
||||
|
|
|
@ -16,17 +16,13 @@ import os
|
|||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.abspath("./_ext"))
|
||||
def setup(sphinx):
|
||||
sys.path.insert(0, os.path.abspath('.'))
|
||||
from seqlex import SeqLexer
|
||||
sphinx.add_lexer("seq", SeqLexer)
|
||||
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = u'Seq'
|
||||
copyright = u'2019-2021, seq-lang'
|
||||
author = u'seq-lang'
|
||||
project = u'Codon'
|
||||
copyright = u'2019-2021, codon'
|
||||
author = u'codon'
|
||||
|
||||
# The short X.Y version
|
||||
version = u'0.11'
|
||||
|
@ -50,9 +46,9 @@ extensions = [
|
|||
'sphinx.ext.mathjax',
|
||||
'sphinx.ext.githubpages',
|
||||
'sphinx.ext.intersphinx',
|
||||
'seq'
|
||||
# 'breathe',
|
||||
# 'exhale'
|
||||
# 'exhale',
|
||||
'codon',
|
||||
]
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
|
@ -95,7 +91,7 @@ html_theme = 'sphinx_book_theme'
|
|||
# documentation.
|
||||
#
|
||||
html_theme_options = {
|
||||
"repository_url": "https://github.com/seq-lang/seq",
|
||||
"repository_url": "https://github.com/exaloop/codon",
|
||||
"repository_branch": "develop",
|
||||
"path_to_docs": "docs/sphinx/",
|
||||
"use_repository_button": True,
|
||||
|
@ -120,7 +116,7 @@ html_sidebars = { '**': ['globaltoc.html', 'relations.html', 'searchbox.html'],
|
|||
# -- Options for HTMLHelp output ---------------------------------------------
|
||||
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = 'seqdoc'
|
||||
htmlhelp_basename = 'codondoc'
|
||||
|
||||
|
||||
# -- Options for LaTeX output ------------------------------------------------
|
||||
|
@ -147,7 +143,7 @@ latex_elements = {
|
|||
# (source start file, target name, title,
|
||||
# author, documentclass [howto, manual, or own class]).
|
||||
latex_documents = [
|
||||
(master_doc, 'seq.tex', u'seq Documentation',
|
||||
(master_doc, 'codon.tex', u'Codon Documentation',
|
||||
u'arshajii', 'manual'),
|
||||
]
|
||||
|
||||
|
@ -157,7 +153,7 @@ latex_documents = [
|
|||
# One entry per manual page. List of tuples
|
||||
# (source start file, name, description, authors, manual section).
|
||||
man_pages = [
|
||||
(master_doc, 'seq', u'Seq Documentation',
|
||||
(master_doc, 'codon', u'Codon Documentation',
|
||||
[author], 1)
|
||||
]
|
||||
|
||||
|
@ -168,18 +164,18 @@ man_pages = [
|
|||
# (source start file, target name, title, author,
|
||||
# dir menu entry, description, category)
|
||||
texinfo_documents = [
|
||||
(master_doc, 'seq', u'Seq Documentation',
|
||||
author, 'seq', 'a language for bioinformatics',
|
||||
(master_doc, 'codon', u'Codon Documentation',
|
||||
author, 'codon', 'a high-performance language',
|
||||
'Miscellaneous'),
|
||||
]
|
||||
|
||||
|
||||
# -- Extension configuration -------------------------------------------------
|
||||
breathe_projects = {
|
||||
'Seq': '../doxygen/xml'
|
||||
'Codon': '../doxygen/xml'
|
||||
}
|
||||
|
||||
breathe_default_project = 'Seq'
|
||||
breathe_default_project = 'Codon'
|
||||
|
||||
exhale_args = {
|
||||
# These arguments are required
|
||||
|
|
|
@ -1,273 +0,0 @@
|
|||
Cookbook
|
||||
========
|
||||
|
||||
.. contents::
|
||||
|
||||
Subsequence extraction
|
||||
----------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
myseq = s'CAATAGAGACTAAGCATTAT'
|
||||
sublen = 5
|
||||
stride = 2
|
||||
|
||||
# explicit for-loop
|
||||
for subseq in myseq.split(sublen, stride):
|
||||
print(subseq)
|
||||
|
||||
# pipelined
|
||||
myseq |> split(sublen, stride) |> print
|
||||
|
||||
k-mer extraction
|
||||
----------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
myseq = s'CAATAGAGACTAAGCATTAT'
|
||||
stride = 2
|
||||
|
||||
# explicit for-loop
|
||||
for subseq in myseq.kmers(stride, k=5):
|
||||
print(subseq)
|
||||
|
||||
# pipelined
|
||||
myseq |> kmers(stride, k=5) |> print
|
||||
|
||||
Reverse complementation
|
||||
-----------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# sequences
|
||||
s = s'GGATC'
|
||||
print(~s) # GATCC
|
||||
|
||||
# k-mers
|
||||
k = k'GGATC'
|
||||
print(~k) # GATCC
|
||||
|
||||
k-mer Hamming distance
|
||||
----------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
k1 = k'ACGTC'
|
||||
k2 = k'ACTTA'
|
||||
# ^ ^
|
||||
print(abs(k1 - k2)) # Hamming distance = 2
|
||||
|
||||
k-mer Hamming neighbors
|
||||
-----------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def neighbors(kmer):
|
||||
for i in range(len(kmer)):
|
||||
for b in (k'A', k'C', k'G', k'T'):
|
||||
if kmer[i] != b:
|
||||
yield kmer |> base(i, b)
|
||||
|
||||
print(list(neighbors(k'AGC'))) # CGC, GGC, etc.
|
||||
|
||||
k-mer minimizer
|
||||
---------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def minimizer(s, k: Static[int]):
|
||||
assert len(s) >= k
|
||||
kmer_min = Kmer[k](s[:k])
|
||||
for kmer in s[1:].kmers(k=k, step=1):
|
||||
kmer = min(kmer, ~kmer)
|
||||
if kmer < kmer_min: kmer_min = kmer
|
||||
return kmer_min
|
||||
|
||||
print(minimizer(s'ACGTACGTACGT', 10))
|
||||
|
||||
de Bruijn edge
|
||||
--------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def de_bruijn_edge(a, b):
|
||||
a = a |> base(0, k'A') # reset first base: [T]GAG -> [A]GAG
|
||||
b = b >> s'A' # shift right to A: [GAG]C -> A[GAG]
|
||||
return a == b # suffix of a == prefix of b
|
||||
|
||||
print(de_bruijn_edge(k'TGAG', k'GAGC')) # True
|
||||
print(de_bruijn_edge(k'TCAG', k'GAGC')) # False
|
||||
|
||||
Count bases
|
||||
-----------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
@tuple
|
||||
class BaseCount:
|
||||
A: int
|
||||
C: int
|
||||
G: int
|
||||
T: int
|
||||
|
||||
def __add__(self, other: BaseCount):
|
||||
a1, c1, g1, t1 = self
|
||||
a2, c2, g2, t2 = other
|
||||
return (a1 + a2, c1 + c2, g1 + g2, t1 + t2)
|
||||
|
||||
def count_bases(s):
|
||||
match s:
|
||||
case 'A*': return count_bases(s[1:]) + (1,0,0,0)
|
||||
case 'C*': return count_bases(s[1:]) + (0,1,0,0)
|
||||
case 'G*': return count_bases(s[1:]) + (0,0,1,0)
|
||||
case 'T*': return count_bases(s[1:]) + (0,0,0,1)
|
||||
case _: return BaseCount(0,0,0,0)
|
||||
|
||||
print(count_bases(s'ACCGGGTTTT')) # (A: 1, C: 2, G: 3, T: 4)
|
||||
|
||||
Spaced seed search
|
||||
------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def has_spaced_acgt(s):
|
||||
match s:
|
||||
case 'A_C_G_T*':
|
||||
return True
|
||||
case t if len(t) >= 8:
|
||||
return has_spaced_acgt(s[1:])
|
||||
case _:
|
||||
return False
|
||||
|
||||
print(has_spaced_acgt(s'AAATCTGTTAAA')) # True
|
||||
print(has_spaced_acgt(s'ACGTACGTACGT')) # False
|
||||
|
||||
Reverse-complement palindrome
|
||||
-----------------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def is_own_revcomp(s):
|
||||
match s:
|
||||
case 'A*T' | 'T*A' | 'C*G' | 'G*C':
|
||||
return is_own_revcomp(s[1:-1])
|
||||
case '':
|
||||
return True
|
||||
case _:
|
||||
return False
|
||||
|
||||
print(is_own_revcomp(s'ACGT')) # True
|
||||
print(is_own_revcomp(s'ATTA')) # False
|
||||
|
||||
Sequence alignment
|
||||
------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# default parameters
|
||||
s1 = s'CGCGAGTCTT'
|
||||
s2 = s'CGCAGAGTT'
|
||||
aln = s1 @ s2
|
||||
print(aln.cigar, aln.score) # 3M1I6M -3
|
||||
|
||||
# custom parameters
|
||||
# match = 2; mismatch = 4; gap1(k) = 2k + 4; gap2(k) = k + 13
|
||||
aln = s1.align(s2, a=2, b=4, gapo=4, gape=2, gapo2=13, gape2=1)
|
||||
print(aln.cigar, aln.score) # 3M1D3M2I2M 2
|
||||
|
||||
Reading FASTA/FASTQ
|
||||
-------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# iterate over everything
|
||||
for r in FASTA('genome.fa'):
|
||||
print(r.name)
|
||||
print(r.seq)
|
||||
|
||||
# iterate over sequences
|
||||
for s in FASTA('genome.fa') |> seqs:
|
||||
print(s)
|
||||
|
||||
# iterate over everything
|
||||
for r in FASTQ('reads.fq'):
|
||||
print(r.name)
|
||||
print(r.read)
|
||||
print(r.qual)
|
||||
|
||||
# iterate over sequences
|
||||
for s in FASTQ('reads.fq') |> seqs:
|
||||
print(s)
|
||||
|
||||
Reading paired-end FASTQ
|
||||
------------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
for r1, r2 in zip(FASTQ('reads_1.fq'), FASTQ('reads_2.fq')):
|
||||
print(r1.name, r2.name)
|
||||
print(r1.read, r2.read)
|
||||
print(r1.qual, r2.qual)
|
||||
|
||||
Parallel FASTQ processing
|
||||
-------------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def process(s: seq):
|
||||
...
|
||||
|
||||
# OMP_NUM_THREADS environment variable controls threads
|
||||
FASTQ('reads.fq') |> iter ||> process
|
||||
|
||||
# Sometimes batching reads into blocks can improve performance,
|
||||
# especially if each is quick to process.
|
||||
FASTQ('reads.fq') |> blocks(size=1000) ||> iter |> process
|
||||
|
||||
Reading SAM/BAM/CRAM
|
||||
--------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# iterate over everything
|
||||
for r in SAM('alignments.sam'):
|
||||
print(r.name)
|
||||
print(r.read)
|
||||
print(r.pos)
|
||||
print(r.mapq)
|
||||
print(r.cigar)
|
||||
print(r.reversed)
|
||||
# etc.
|
||||
|
||||
for r in BAM('alignments.bam'):
|
||||
# ...
|
||||
|
||||
for r in CRAM('alignments.cram'):
|
||||
# ...
|
||||
|
||||
# iterate over sequences
|
||||
for s in SAM('alignments.sam') |> seqs:
|
||||
print(s)
|
||||
|
||||
for s in BAM('alignments.bam') |> seqs:
|
||||
print(s)
|
||||
|
||||
for s in CRAM('alignments.cram') |> seqs:
|
||||
print(s)
|
||||
|
||||
DNA to protein translation
|
||||
--------------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
dna = s'AGGTCTAACGGC'
|
||||
protein = dna |> translate
|
||||
print(protein) # RSNG
|
||||
|
||||
Reading protein sequences from FASTA
|
||||
------------------------------------
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
for s in pFASTA('seqs.fasta') |> seqs:
|
||||
print(s)
|
|
@ -17,23 +17,23 @@ root=os.path.abspath(sys.argv[1])
|
|||
print(f"Generating documentation for {root}...")
|
||||
|
||||
|
||||
# 1. Call seqc -docstr and get a documentation in JSON format
|
||||
# 1. Call codon -docstr and get a documentation in JSON format
|
||||
def load_json(directory):
|
||||
# Get all seq files in the directory
|
||||
# Get all codon files in the directory
|
||||
files=[]
|
||||
for root,_,items in os.walk(directory):
|
||||
for f in items:
|
||||
if f.endswith('.seq'):
|
||||
if f.endswith('.codon'):
|
||||
files.append(os.path.abspath(os.path.join(root,f)))
|
||||
files='\n'.join(files)
|
||||
s=sp.run(['../../build/seqc','doc'],stdout=sp.PIPE,input=files.encode('utf-8'))
|
||||
s=sp.run(['../../build/codon','doc'],stdout=sp.PIPE,input=files.encode('utf-8'))
|
||||
if s.returncode!=0:
|
||||
raise ValueError('seqc failed')
|
||||
raise ValueError('codon failed')
|
||||
return json.loads(s.stdout.decode('utf-8'))
|
||||
|
||||
|
||||
j=load_json(root)
|
||||
print(f" - Done with seqc")
|
||||
print(f" - Done with codon")
|
||||
# with open('x.json','w') as f:
|
||||
# json.dump(j,f,indent=2)
|
||||
|
||||
|
@ -48,8 +48,8 @@ for mid,module in modules.items():
|
|||
print(root,mid,module)
|
||||
directory=os.path.relpath(directory,root) # remove the prefix
|
||||
os.makedirs(f"stdlib/{directory}",exist_ok=True)
|
||||
if name.endswith('.seq'):
|
||||
name=name[:-4]
|
||||
if name.endswith('.codon'):
|
||||
name=name[:-6] # drop suffix
|
||||
if name!='__init__':
|
||||
parsed_modules[directory].add((name,mid))
|
||||
module=os.path.split(module)[0]
|
||||
|
@ -57,7 +57,7 @@ for directory,modules in parsed_modules.items():
|
|||
module=directory.replace('/','.')
|
||||
with open(f'stdlib/{directory}/index.rst','w') as f:
|
||||
if module!='.':
|
||||
print(f".. seq:module:: {module}\n",file=f)
|
||||
print(f".. codon:module:: {module}\n",file=f)
|
||||
print(f"{module}",file=f)
|
||||
else:
|
||||
print("Standard Library Reference",file=f)
|
||||
|
@ -143,11 +143,11 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
|
|||
if name=='__init__':
|
||||
file,mode=f'stdlib/{directory}/index.rst','a'
|
||||
with open(file,mode) as f:
|
||||
print(f".. seq:module:: {module}\n",file=f)
|
||||
print(f":seq:mod:`{module}`",file=f)
|
||||
print("-"*(len(module)+11)+"\n",file=f)
|
||||
print(f".. codon:module:: {module}\n",file=f)
|
||||
print(f":codon:mod:`{module}`",file=f)
|
||||
print("-"*(len(module)+13)+"\n",file=f)
|
||||
directory_prefix=directory+'/' if directory!='.' else ''
|
||||
print(f"Source code: `{directory_prefix}{name}.seq <https://github.com/seq-lang/seq/blob/master/stdlib/{directory}/{name}.seq>`_\n",file=f)
|
||||
print(f"Source code: `{directory_prefix}{name}.codon <https://github.com/exaloop.io/codon/blob/master/stdlib/{directory}/{name}.codon>`_\n",file=f)
|
||||
if 'doc' in j[mid]:
|
||||
print(parse_docstr(j[mid]['doc']),file=f)
|
||||
|
||||
|
@ -162,13 +162,13 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
|
|||
if v['kind']=='class':
|
||||
if v['name'].endswith('Error'):
|
||||
v["type"]="exception"
|
||||
f.write(f'.. seq:{v["type"]}:: {v["name"]}')
|
||||
f.write(f'.. codon:{v["type"]}:: {v["name"]}')
|
||||
if 'generics' in v and v['generics']:
|
||||
f.write(f'[{",".join(v["generics"])}]')
|
||||
elif v['kind']=='function':
|
||||
f.write(f'.. seq:function:: {v["name"]}{parse_fn(v)}')
|
||||
f.write(f'.. codon:function:: {v["name"]}{parse_fn(v)}')
|
||||
elif v['kind']=='variable':
|
||||
f.write(f'.. seq:data:: {v["name"]}')
|
||||
f.write(f'.. codon:data:: {v["name"]}')
|
||||
# if v['kind'] == 'class' and v['type'] == 'extension':
|
||||
# f.write(f'**`{getLink(v["parent"])}`**')
|
||||
# else:
|
||||
|
@ -202,7 +202,7 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
|
|||
print(' **Properties:**\n',file=f)
|
||||
for c in props:
|
||||
v=j[c]
|
||||
f.write(f' .. seq:attribute:: {v["name"]}\n')
|
||||
f.write(f' .. codon:attribute:: {v["name"]}\n')
|
||||
if 'doc' in v:
|
||||
f.write("\n"+parse_docstr(v['doc'],4)+"\n\n")
|
||||
f.write("\n")
|
||||
|
@ -212,7 +212,7 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
|
|||
print(' **Magic methods:**\n',file=f)
|
||||
for c in magics:
|
||||
v=j[c]
|
||||
f.write(f' .. seq:method:: {v["name"]}{parse_fn(v,True)}\n')
|
||||
f.write(f' .. codon:method:: {v["name"]}{parse_fn(v,True)}\n')
|
||||
f.write(' :noindex:\n')
|
||||
if 'doc' in v:
|
||||
f.write("\n"+parse_docstr(v['doc'],4)+"\n\n")
|
||||
|
@ -222,7 +222,7 @@ for directory,(name,mid) in {(d,m) for d,mm in parsed_modules.items() for m in m
|
|||
print(' **Methods:**\n',file=f)
|
||||
for c in methods:
|
||||
v=j[c]
|
||||
f.write(f' .. seq:method:: {v["name"]}{parse_fn(v,True)}\n')
|
||||
f.write(f' .. codon:method:: {v["name"]}{parse_fn(v,True)}\n')
|
||||
if 'doc' in v:
|
||||
f.write("\n"+parse_docstr(v['doc'],4)+"\n\n")
|
||||
f.write("\n")
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
Calling Seq from C/C++
|
||||
======================
|
||||
Calling Codon from C/C++
|
||||
========================
|
||||
|
||||
Calling C/C++ from Seq is quite easy with ``from C import``, but Seq can also be called from C/C++ code. To make a Seq function externally visible, simply annotate it with ``@export``:
|
||||
Calling C/C++ from Codon is quite easy with ``from C import``, but Codon can also be called from C/C++ code. To make a Codon function externally visible, simply annotate it with ``@export``:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
@export
|
||||
def foo(n: int):
|
||||
|
@ -11,14 +11,14 @@ Calling C/C++ from Seq is quite easy with ``from C import``, but Seq can also be
|
|||
print(i * i)
|
||||
return n * n
|
||||
|
||||
Note that only top-level, non-generic functions can be exported. Now we can create a shared library containing ``foo`` (assuming source file *foo.seq*):
|
||||
Note that only top-level, non-generic functions can be exported. Now we can create a shared library containing ``foo`` (assuming source file *foo.codon*):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
seqc build -o foo.o foo.seq
|
||||
gcc -shared -lseqrt -lomp foo.o -o libfoo.so
|
||||
codon build -o foo.o foo.codon
|
||||
gcc -shared -lcodonrt -lomp foo.o -o libfoo.so
|
||||
|
||||
(The last command might require an additional ``-L/path/to/seqrt/lib/`` argument if ``libseqrt`` is not installed on a standard path.)
|
||||
(The last command might require an additional ``-L/path/to/codonrt/lib/`` argument if ``libcodonrt`` is not installed on a standard path.)
|
||||
|
||||
Now we can call ``foo`` from a C program:
|
||||
|
||||
|
@ -39,22 +39,21 @@ Compile:
|
|||
|
||||
gcc -o foo -L. -lfoo foo.c
|
||||
|
||||
Now running ``./foo`` should invoke ``foo()`` as defined in Seq, with an argument of ``10``.
|
||||
Now running ``./foo`` should invoke ``foo()`` as defined in Codon, with an argument of ``10``.
|
||||
|
||||
Converting types
|
||||
----------------
|
||||
|
||||
The following table shows the conversions between Seq and C/C++ types:
|
||||
The following table shows the conversions between Codon and C/C++ types:
|
||||
|
||||
============ ============
|
||||
Seq C/C++
|
||||
Codon C/C++
|
||||
------------ ------------
|
||||
``int`` ``int64_t``
|
||||
``float`` ``double``
|
||||
``bool`` ``bool``
|
||||
``byte`` ``int8_t``
|
||||
``str`` ``{int64_t, char*}``
|
||||
``seq`` ``{int64_t, char*}``
|
||||
``class`` Pointer to corresponding tuple
|
||||
``@tuple`` Struct of fields
|
||||
============ ============
|
||||
|
|
|
@ -1,7 +1,22 @@
|
|||
**Seq** — a Python implementation for bioinformatics
|
||||
**Codon** - a high-performance Python implementation
|
||||
====================================================
|
||||
|
||||
Seq is a Pythonic language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
|
||||
What is Codon?
|
||||
--------------
|
||||
|
||||
Codon is a high-performance Python implementation that compiles Python code to native machine code without any runtime overhead.
|
||||
The Codon framework grew out of the `Seq <https://seq-lang.org>`_ project; while Seq focuses on genomics and bioinformatics, Codon
|
||||
can be applied in a number of different areas and is extensible via a plugin system.
|
||||
|
||||
Typical speedups over Python are on the order of 10-100x or more, on a single thread. Codon supports native multithreading which can lead
|
||||
to speedups many times higher still.
|
||||
|
||||
What isn't Codon?
|
||||
-----------------
|
||||
|
||||
While Codon supports nearly all of Python's syntax, it is not a drop-in replacement, and large codebases might require modifications
|
||||
to be run through the Codon compiler. For example, some of Python's modules are not yet implemented within Codon, and a few of Python's
|
||||
dynamic features are disallowed. The Codon compiler produces detailed error messages to help identify and resolve any incompatibilities.
|
||||
|
||||
Questions, comments or suggestions? Visit our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_.
|
||||
|
||||
|
@ -13,62 +28,36 @@ Questions, comments or suggestions? Visit our `Gitter chatroom <https://gitter.i
|
|||
parallel
|
||||
python
|
||||
embed
|
||||
cookbook
|
||||
build
|
||||
stdlib/index
|
||||
|
||||
News
|
||||
----
|
||||
|
||||
What's new in 0.11?
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Version 0.11 again includes a number of upgrades to the parser, type system and compiler backend:
|
||||
|
||||
- Parallelism and multithreading support with OpenMP (e.g. ``@par for i in range(10): ...``)
|
||||
- New PEG parser
|
||||
- Improved compile-time static evaluation
|
||||
- Upgrade to LLVM 12 (from LLVM 6)
|
||||
- A few changes to the ``bio`` module, particularly with ``seq.kmers()``, which now takes a length argument as in ``s.kmers(k=12, step=1)``
|
||||
|
||||
What's new in 0.10?
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Version 0.10 brings a slew of improvements to the language and compiler, including:
|
||||
|
||||
- Nearly all of Python's syntax is now supported, including empty collections (``[]``, ``{}``), lambda functions (``lambda``), ``*args``/``**kwargs``, ``None`` and much more
|
||||
- Compiler error messages now pinpoint exactly where an error occurred with compile-time backtraces
|
||||
- Runtime exceptions now include backtraces with file names and line numbers in debug mode
|
||||
- GDB and LLDB support
|
||||
- Various syntax updates to further close the gap with Python
|
||||
- Numerous standard library improvements
|
||||
|
||||
.. caution::
|
||||
The default compilation and execution mode is now "debug", which disables most optimizations. Pass the ``-release`` argument to ``seqc`` to enable optimizations.
|
||||
|
||||
Frequently Asked Questions
|
||||
--------------------------
|
||||
|
||||
*Can I use Seq for general-purpose computing?*
|
||||
*What is the goal of Codon?*
|
||||
|
||||
Yes! While the Seq project started with a narrow focus on bioinformatics, it has grown to encompass much of Python's syntax, semantics and modules, making it a useful tool beyond just bioinformatics, particularly when large datasets need to be processed with Python.
|
||||
One of the main focuses of Codon is to bridge the gap between usability and performance. Codon aims to make writing high-performance software
|
||||
substantially easier, and to provide a common, unified framework for the development of such software across a range of domains.
|
||||
|
||||
*What is the goal of Seq?*
|
||||
*How does Codon compare to other Python implementations?*
|
||||
|
||||
One of the main focuses of Seq is to bridge the gap between usability and performance in the fields of bioinformatics and computational genomics, which have an unfortunate reputation for hard-to-use, buggy or generally poorly-written software. Seq aims to make writing high-performance genomics or bioinformatics software substantially easier, and to provide a common, unified framework for the development of such software.
|
||||
|
||||
*Why do we need a whole new language? Why not a library?*
|
||||
|
||||
There are many great bioinformatics libraries on the market today, including `Biopython <https://biopython.org>`_ for Python, `SeqAn <https://www.seqan.de>`_ for C++ and `BioJulia <https://biojulia.net>`_ for Julia. In fact, Seq offers a lot of the same functionality found in these libraries. The advantages of having a domain-specific language and compiler, however, are the higher-level constructs and optimizations like :ref:`pipeline`, :ref:`match`, :ref:`interalign` and :ref:`prefetch`, which are difficult to replicate in a library, as they often involve large-scale program transformations/optimizations. A domain-specific language also allows us to explore different backends like GPU, TPU or FPGA in a systematic way, in conjunction with these various constructs/optimizations, which is ongoing work.
|
||||
Unlike other performance-oriented Python implementations, such as PyPy or Numba, Codon is a standalone system implemented entirely independently
|
||||
of regular Python. Since it does not need to interoperate with CPython's runtime, Codon has far greater flexibility to generate optimized code.
|
||||
In fact, Codon will frequently generate the same code as that from an equivalent C or C++ program. This design choice also allows Codon to circumvent
|
||||
issues like Python's global interpretter lock, and thereby to take full advantage of parallelism and multithreading.
|
||||
|
||||
*What about interoperability with other languages and frameworks?*
|
||||
|
||||
Interoperability is and will continue to be a priority for the Seq project. We don't want using Seq to render you unable to use all the other great frameworks and libraries that exist. Seq already supports interoperability with C/C++ and Python (see :ref:`interop`).
|
||||
Interoperability is and will continue to be a priority for the Codon project. We don't want using Codon to render you unable to use all the other great
|
||||
frameworks and libraries that exist. Codon already supports interoperability with C/C++ and Python (see :ref:`interop`).
|
||||
|
||||
*I want to contribute! How do I get started?*
|
||||
|
||||
Great! Check out our `contribution guidelines <https://github.com/seq-lang/seq/blob/master/CONTRIBUTING.md>`_ and `open issues <https://github.com/seq-lang/seq/issues>`_ to get started. Also don't hesitate to drop by our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_ if you have any questions.
|
||||
Great! Check out our `contribution guidelines <https://github.com/exaloop/codon/blob/master/CONTRIBUTING.md>`_ and `open issues <https://github.com/exaloop/codon/issues>`_
|
||||
to get started. Also don't hesitate to drop by our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_
|
||||
if you have any questions.
|
||||
|
||||
*What is planned for the future?*
|
||||
|
||||
See the `roadmap <https://github.com/seq-lang/seq/wiki/Roadmap>`_ for information about this.
|
||||
See the `roadmap <https://github.com/exaloop/codon/wiki/Roadmap>`_ for information about this.
|
||||
|
|
|
@ -7,13 +7,13 @@ Install
|
|||
Pre-built binaries
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Pre-built binaries for Linux and macOS on x86_64 are available alongside `each release <https://github.com/seq-lang/seq/releases>`_. We also have a script for downloading and installing pre-built versions:
|
||||
Pre-built binaries for Linux and macOS on x86_64 are available alongside `each release <https://github.com/exaloop/codon/releases>`_. We also have a script for downloading and installing pre-built versions:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
|
||||
|
||||
This will install Seq in a new ``.seq`` directory within your home directory.
|
||||
This will install Codon in a new ``.codon`` directory within your home directory.
|
||||
|
||||
Building from source
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
@ -23,56 +23,55 @@ See `Building from Source <build.html>`_.
|
|||
Usage
|
||||
-----
|
||||
|
||||
The ``seqc`` program can either directly run Seq source in JIT mode:
|
||||
The ``codon`` program can either directly ``run`` Codon source in JIT mode:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
seqc run myprogram.seq
|
||||
codon run myprogram.codon
|
||||
|
||||
The default compilation and run mode is *debug* (``-debug``). Compile and run with optimizations with the ``-release`` option:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
seqc run -release myprogram.seq
|
||||
codon run -release myprogram.codon
|
||||
|
||||
``seqc`` can also produce executables (ensure you have ``clang`` installed, as it is used for linking):
|
||||
``codon`` can also ``build`` executables (ensure you have ``clang`` installed, as it is used for linking):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# generate 'myprogram' executable
|
||||
seqc build -exe myprogram.seq
|
||||
codon build -exe myprogram.codon
|
||||
|
||||
# generate 'foo' executable
|
||||
seqc build -o foo myprogram.seq
|
||||
codon build -o foo myprogram.codon
|
||||
|
||||
``seqc`` can produce object files:
|
||||
``codon`` can produce object files:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# generate 'myprogram.o' object file
|
||||
seqc build -obj myprogram.seq
|
||||
codon build -obj myprogram.codon
|
||||
|
||||
# generate 'foo.o' object file
|
||||
seqc build -o foo.o myprogram.seq
|
||||
codon build -o foo.o myprogram.codon
|
||||
|
||||
``seqc`` can produce LLVM IR:
|
||||
``codon`` can produce LLVM IR:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# generate 'myprogram.ll' object file
|
||||
seqc build -llvm myprogram.seq
|
||||
codon build -llvm myprogram.codon
|
||||
|
||||
# generate 'foo.ll' object file
|
||||
seqc build -o foo.ll myprogram.seq
|
||||
codon build -o foo.ll myprogram.codon
|
||||
|
||||
Compile-time definitions
|
||||
------------------------
|
||||
|
||||
``seqc`` allows for compile-time definitions via the ``-D`` flag. For example, in the following code:
|
||||
``codon`` allows for compile-time definitions via the ``-D`` flag. For example, in the following code:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
from bio import *
|
||||
print(Kmer[SEED_LEN]())
|
||||
print(Int[BIT_WIDTH]())
|
||||
|
||||
``SEED_LEN`` can be specified on the command line as such: ``seqc run -DSEED_LEN=10 myprogram.seq``.
|
||||
``BIT_WIDTH`` can be specified on the command line as such: ``codon run -DBIT_WIDTH=10 myprogram.codon``.
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 208 KiB After Width: | Height: | Size: 22 KiB |
|
@ -9,7 +9,7 @@ if "%SPHINXBUILD%" == "" (
|
|||
)
|
||||
set SOURCEDIR=.
|
||||
set BUILDDIR=_build
|
||||
set SPHINXPROJ=seq
|
||||
set SPHINXPROJ=codon
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
|
|
|
@ -1,9 +1,11 @@
|
|||
.. _parallelism:
|
||||
|
||||
Parallelism and Multithreading
|
||||
==============================
|
||||
|
||||
Seq supports parallelism and multithreading via OpenMP out of the box. Here's an example:
|
||||
Codon supports parallelism and multithreading via OpenMP out of the box. Here's an example:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
@par
|
||||
for i in range(10):
|
||||
|
@ -14,7 +16,7 @@ By default, parallel loops will use all available threads, or use the number of
|
|||
specified by the ``OMP_NUM_THREADS`` environment variable. A specific thread number can
|
||||
be given directly on the ``@par`` line as well:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
@par(num_threads=5)
|
||||
for i in range(10):
|
||||
|
@ -31,7 +33,7 @@ be given directly on the ``@par`` line as well:
|
|||
Other OpenMP parameters like ``private``, ``shared`` or ``reduction``, are inferred
|
||||
automatically by the compiler. For example, the following loop
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
total = 0
|
||||
@par
|
||||
|
@ -43,7 +45,7 @@ will automatically generate a reduction for variable ``a``.
|
|||
Here is an example that finds the sum of prime numbers up to a user-defined limit, using
|
||||
a parallel loop on 16 threads with a dynamic schedule and chunk size of 100:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
from sys import argv
|
||||
|
||||
|
@ -71,7 +73,7 @@ the factors of an integer takes more time for larger integers, we use a dynamic
|
|||
``@par`` also supports C/C++ OpenMP pragma strings. For example, the ``@par`` line in the
|
||||
above example can also be written as:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
# same as: @par(schedule='dynamic', chunk_size=100, num_threads=16)
|
||||
@par('schedule(dynamic, 100) num_threads(16)')
|
||||
|
@ -84,13 +86,13 @@ applies to *imperative* for-loops of the form ``for i in range(a, b, c)`` (where
|
|||
For general parallel for-loops of the form ``for i in some_generator()``, a task-based approach is
|
||||
used instead, where each loop iteration is executed as an independent task.
|
||||
|
||||
The Seq compiler also converts iterations over lists (``for a in some_list``) to imperative
|
||||
The Codon compiler also converts iterations over lists (``for a in some_list``) to imperative
|
||||
for-loops, meaning these loops can be executed using OpenMP's loop parallelism.
|
||||
|
||||
Custom reductions
|
||||
-----------------
|
||||
|
||||
Seq can automatically generate efficient reductions for ``int`` and ``float`` values. For other
|
||||
Codon can automatically generate efficient reductions for ``int`` and ``float`` values. For other
|
||||
data types, user-defined reductions can be specified. A class that supports reductions must
|
||||
include:
|
||||
|
||||
|
@ -99,7 +101,7 @@ include:
|
|||
|
||||
Here is an example for reducing a new ``Vector`` type:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
@tuple
|
||||
class Vector:
|
||||
|
@ -121,9 +123,9 @@ Here is an example for reducing a new ``Vector`` type:
|
|||
OpenMP constructs
|
||||
-----------------
|
||||
|
||||
All of OpenMP's API functions are accessible directly in Seq. For example:
|
||||
All of OpenMP's API functions are accessible directly in Codon. For example:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
import openmp as omp
|
||||
print(omp.get_num_threads())
|
||||
|
@ -132,7 +134,7 @@ All of OpenMP's API functions are accessible directly in Seq. For example:
|
|||
OpenMP's *critical*, *master*, *single* and *ordered* constructs can be applied via the
|
||||
corresponding decorators:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
import openmp as omp
|
||||
|
||||
|
@ -161,7 +163,7 @@ corresponding decorators:
|
|||
|
||||
For finer-grained locking, consider using the locks from the ``threading`` module:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
from threading import Lock
|
||||
lock = Lock() # or RLock for re-entrant lock
|
||||
|
|
|
@ -1,17 +1,17 @@
|
|||
Calling Python from Seq
|
||||
=======================
|
||||
Calling Python from Codon
|
||||
=========================
|
||||
|
||||
Calling Python from Seq is possible in two ways:
|
||||
Calling Python from Codon is possible in two ways:
|
||||
|
||||
- ``from python import`` allows importing and calling Python functions from existing Python modules.
|
||||
- ``@python`` allows writing Python code directly in Seq.
|
||||
- ``@python`` allows writing Python code directly in Codon.
|
||||
|
||||
In order to use these features, the ``SEQ_PYTHON`` environment variable must be set to the appropriate
|
||||
In order to use these features, the ``CODON_PYTHON`` environment variable must be set to the appropriate
|
||||
Python shared library:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export SEQ_PYTHON=/path/to/libpython.X.Y.so
|
||||
export CODON_PYTHON=/path/to/libpython.X.Y.so
|
||||
|
||||
For example, with a ``brew``-installed Python 3.9 on macOS, this might be
|
||||
|
||||
|
@ -21,8 +21,8 @@ For example, with a ``brew``-installed Python 3.9 on macOS, this might be
|
|||
|
||||
Note that only Python versions 3.6 and later are supported.
|
||||
|
||||
``from python import``
|
||||
----------------------
|
||||
from python import
|
||||
------------------
|
||||
|
||||
Let's say we have a Python function defined in *mymodule.py*:
|
||||
|
||||
|
@ -31,22 +31,22 @@ Let's say we have a Python function defined in *mymodule.py*:
|
|||
def multiply(a, b):
|
||||
return a * b
|
||||
|
||||
We can call this function in Seq using ``from python import`` and indicating the appropriate
|
||||
We can call this function in Codon using ``from python import`` and indicating the appropriate
|
||||
call and return types:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
from python import mymodule.multiply(int, int) -> int
|
||||
print(multiply(3, 4)) # 12
|
||||
|
||||
(Be sure the ``PYTHONPATH`` environment variable includes the path of *mymodule.py*!)
|
||||
|
||||
``@python``
|
||||
-----------
|
||||
@python
|
||||
-------
|
||||
|
||||
Seq programs can contain functions that will be executed by Python via ``pydef``:
|
||||
Codon programs can contain functions that will be executed by Python via ``pydef``:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
@python
|
||||
def multiply(a: int, b: int) -> int:
|
||||
|
@ -56,7 +56,7 @@ Seq programs can contain functions that will be executed by Python via ``pydef``
|
|||
|
||||
This makes calling Python modules like NumPy very easy:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
@python
|
||||
def myrange(n: int) -> List[int]:
|
||||
|
|
|
@ -1,245 +0,0 @@
|
|||
# Seq Pygments lexer based on Python's
|
||||
|
||||
import re
|
||||
|
||||
from pygments.lexer import Lexer, RegexLexer, include, bygroups, using, \
|
||||
default, words, combined, do_insertions
|
||||
from pygments.util import get_bool_opt, shebang_matches
|
||||
from pygments.token import Text, Comment, Operator, Keyword, Name, String, \
|
||||
Number, Punctuation, Generic, Other, Error
|
||||
from pygments import unistring as uni
|
||||
|
||||
__all__ = ['SeqLexer']
|
||||
|
||||
line_re = re.compile('.*?\n')
|
||||
|
||||
class SeqLexer(RegexLexer):
|
||||
name = 'Seq'
|
||||
aliases = ['seq']
|
||||
filenames = ['*.seq']
|
||||
mimetypes = []
|
||||
|
||||
flags = re.MULTILINE | re.UNICODE
|
||||
|
||||
uni_name = "[%s][%s]*" % (uni.xid_start, uni.xid_continue)
|
||||
|
||||
def innerstring_rules(ttype):
|
||||
return [
|
||||
# the old style '%s' % (...) string formatting (still valid in Py3)
|
||||
(r'%(\(\w+\))?[-#0 +]*([0-9]+|[*])?(\.([0-9]+|[*]))?'
|
||||
'[hlL]?[E-GXc-giorsux%]', String.Interpol),
|
||||
# the new style '{}'.format(...) string formatting
|
||||
(r'\{'
|
||||
'((\w+)((\.\w+)|(\[[^\]]+\]))*)?' # field name
|
||||
'(\![sra])?' # conversion
|
||||
'(\:(.?[<>=\^])?[-+ ]?#?0?(\d+)?,?(\.\d+)?[E-GXb-gnosx%]?)?'
|
||||
'\}', String.Interpol),
|
||||
|
||||
# backslashes, quotes and formatting signs must be parsed one at a time
|
||||
(r'[^\\\'"%{\n]+', ttype),
|
||||
(r'[\'"\\]', ttype),
|
||||
# unhandled string formatting sign
|
||||
(r'%|(\{{1,2})', ttype)
|
||||
# newlines are an error (use "nl" state)
|
||||
]
|
||||
|
||||
tokens = {
|
||||
'root': [
|
||||
(r'\n', Text),
|
||||
(r'^(\s*)([rRuUbB]{,2})("""(?:.|\n)*?""")',
|
||||
bygroups(Text, String.Affix, String.Doc)),
|
||||
(r"^(\s*)([rRuUbB]{,2})('''(?:.|\n)*?''')",
|
||||
bygroups(Text, String.Affix, String.Doc)),
|
||||
(r'[^\S\n]+', Text),
|
||||
(r'\A#!.+$', Comment.Hashbang),
|
||||
(r'#.*$', Comment.Single),
|
||||
(r'[]{}:(),;[]', Punctuation),
|
||||
(r'\\\n', Text),
|
||||
(r'\\', Text),
|
||||
(r'(in|is|and|or|not)\b', Operator.Word),
|
||||
(r'!=|==|<<|>>|[-~+/*%=<>&^|.]', Operator),
|
||||
include('keywords'),
|
||||
(r'(def)((?:\s|\\\s)+)', bygroups(Keyword, Text), 'funcname'),
|
||||
(r'(class)((?:\s|\\\s)+)', bygroups(Keyword, Text), 'classname'),
|
||||
(r'(from)((?:\s|\\\s)+)', bygroups(Keyword.Namespace, Text),
|
||||
'fromimport'),
|
||||
(r'(import)((?:\s|\\\s)+)', bygroups(Keyword.Namespace, Text),
|
||||
'import'),
|
||||
include('builtins'),
|
||||
include('magicfuncs'),
|
||||
include('magicvars'),
|
||||
include('backtick'),
|
||||
('([skprR]|[uUbB][rR]|[rR][uUbB])(""")',
|
||||
bygroups(String.Affix, String.Double), 'tdqs'),
|
||||
("([skprR]|[uUbB][rR]|[rR][uUbB])(''')",
|
||||
bygroups(String.Affix, String.Single), 'tsqs'),
|
||||
('([skprR]|[uUbB][rR]|[rR][uUbB])(")',
|
||||
bygroups(String.Affix, String.Double), 'dqs'),
|
||||
("([skprR]|[uUbB][rR]|[rR][uUbB])(')",
|
||||
bygroups(String.Affix, String.Single), 'sqs'),
|
||||
('([uUbB]?)(""")', bygroups(String.Affix, String.Double),
|
||||
combined('stringescape', 'tdqs')),
|
||||
("([uUbB]?)(''')", bygroups(String.Affix, String.Single),
|
||||
combined('stringescape', 'tsqs')),
|
||||
('([uUbB]?)(")', bygroups(String.Affix, String.Double),
|
||||
combined('stringescape', 'dqs')),
|
||||
("([uUbB]?)(')", bygroups(String.Affix, String.Single),
|
||||
combined('stringescape', 'sqs')),
|
||||
include('name'),
|
||||
include('numbers'),
|
||||
],
|
||||
'funcname': [
|
||||
include('magicfuncs'),
|
||||
('[a-zA-Z_]\w*', Name.Function, '#pop'),
|
||||
default('#pop'),
|
||||
],
|
||||
'stringescape': [
|
||||
(r'\\([\\abfnrtv"\']|\n|N\{.*?\}|u[a-fA-F0-9]{4}|'
|
||||
r'U[a-fA-F0-9]{8}|x[a-fA-F0-9]{2}|[0-7]{1,3})', String.Escape)
|
||||
],
|
||||
'strings-single': innerstring_rules(String.Single),
|
||||
'strings-double': innerstring_rules(String.Double),
|
||||
'dqs': [
|
||||
(r'"', String.Double, '#pop'),
|
||||
(r'\\\\|\\"|\\\n', String.Escape), # included here for raw strings
|
||||
include('strings-double')
|
||||
],
|
||||
'sqs': [
|
||||
(r"'", String.Single, '#pop'),
|
||||
(r"\\\\|\\'|\\\n", String.Escape), # included here for raw strings
|
||||
include('strings-single')
|
||||
],
|
||||
'tdqs': [
|
||||
(r'"""', String.Double, '#pop'),
|
||||
include('strings-double'),
|
||||
(r'\n', String.Double)
|
||||
],
|
||||
'tsqs': [
|
||||
(r"'''", String.Single, '#pop'),
|
||||
include('strings-single'),
|
||||
(r'\n', String.Single)
|
||||
],
|
||||
}
|
||||
|
||||
tokens['keywords'] = [
|
||||
(words((
|
||||
'assert', 'async', 'await', 'break', 'continue', 'del', 'elif',
|
||||
'else', 'except', 'finally', 'for', 'global', 'if', 'lambda', 'pass',
|
||||
'raise', 'nonlocal', 'return', 'try', 'while', 'yield', 'yield from',
|
||||
'as', 'with', 'match', 'case', 'pydef',
|
||||
'type', 'extend', 'print', 'cimport', 'pyimport'), suffix=r'\b'),
|
||||
Keyword),
|
||||
(words((
|
||||
'True', 'False', 'None'), suffix=r'\b'),
|
||||
Keyword.Constant),
|
||||
]
|
||||
|
||||
seqwords = ['__import__', 'abs', 'all', 'any', 'bin', 'bool', 'bytearray', 'bytes',
|
||||
'chr', 'classmethod', 'cmp', 'compile', 'complex', 'delattr', 'dict',
|
||||
'dir', 'divmod', 'enumerate', 'eval', 'filter', 'float', 'format',
|
||||
'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'hex', 'id',
|
||||
'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'list',
|
||||
'locals', 'map', 'max', 'memoryview', 'min', 'object', 'oct',
|
||||
'open', 'ord', 'pow', 'property', 'range', 'repr', 'reversed',
|
||||
'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str',
|
||||
'sum', 'super', 'tuple', 'vars', 'zip', 'seq', 'byte', 'ptr', 'array',
|
||||
'Kmer', 'Int', 'UInt', 'optional']
|
||||
|
||||
tokens['builtins'] = [
|
||||
(words(seqwords, prefix=r'(?<!\.)', suffix=r'\b'),
|
||||
Name.Builtin),
|
||||
(r'(?<!\.)(self|Ellipsis|NotImplemented|cls)\b', Name.Builtin.Pseudo),
|
||||
(words((
|
||||
'ArithmeticError', 'AssertionError', 'AttributeError',
|
||||
'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning',
|
||||
'EOFError', 'EnvironmentError', 'Exception', 'FloatingPointError',
|
||||
'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError',
|
||||
'ImportWarning', 'IndentationError', 'IndexError', 'KeyError',
|
||||
'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError',
|
||||
'NotImplementedError', 'OSError', 'OverflowError',
|
||||
'PendingDeprecationWarning', 'ReferenceError', 'ResourceWarning',
|
||||
'RuntimeError', 'RuntimeWarning', 'StopIteration',
|
||||
'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError',
|
||||
'TypeError', 'UnboundLocalError', 'UnicodeDecodeError',
|
||||
'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError',
|
||||
'UnicodeWarning', 'UserWarning', 'ValueError', 'VMSError', 'Warning',
|
||||
'WindowsError', 'ZeroDivisionError',
|
||||
# new builtin exceptions from PEP 3151
|
||||
'BlockingIOError', 'ChildProcessError', 'ConnectionError',
|
||||
'BrokenPipeError', 'ConnectionAbortedError', 'ConnectionRefusedError',
|
||||
'ConnectionResetError', 'FileExistsError', 'FileNotFoundError',
|
||||
'InterruptedError', 'IsADirectoryError', 'NotADirectoryError',
|
||||
'PermissionError', 'ProcessLookupError', 'TimeoutError'),
|
||||
prefix=r'(?<!\.)', suffix=r'\b'),
|
||||
Name.Exception),
|
||||
]
|
||||
tokens['magicfuncs'] = [
|
||||
(words((
|
||||
'__abs__', '__add__', '__aenter__', '__aexit__', '__aiter__', '__and__',
|
||||
'__anext__', '__await__', '__bool__', '__bytes__', '__call__',
|
||||
'__complex__', '__contains__', '__del__', '__delattr__', '__delete__',
|
||||
'__delitem__', '__dir__', '__divmod__', '__enter__', '__eq__', '__exit__',
|
||||
'__float__', '__floordiv__', '__format__', '__ge__', '__get__',
|
||||
'__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__',
|
||||
'__iadd__', '__iand__', '__ifloordiv__', '__ilshift__', '__imatmul__',
|
||||
'__imod__', '__import__', '__imul__', '__index__', '__init__',
|
||||
'__instancecheck__', '__int__', '__invert__', '__ior__', '__ipow__',
|
||||
'__irshift__', '__isub__', '__iter__', '__itruediv__', '__ixor__',
|
||||
'__le__', '__len__', '__length_hint__', '__lshift__', '__lt__',
|
||||
'__matmul__', '__missing__', '__mod__', '__mul__', '__ne__', '__neg__',
|
||||
'__new__', '__next__', '__or__', '__pos__', '__pow__', '__prepare__',
|
||||
'__radd__', '__rand__', '__rdivmod__', '__repr__', '__reversed__',
|
||||
'__rfloordiv__', '__rlshift__', '__rmatmul__', '__rmod__', '__rmul__',
|
||||
'__ror__', '__round__', '__rpow__', '__rrshift__', '__rshift__',
|
||||
'__rsub__', '__rtruediv__', '__rxor__', '__set__', '__setattr__',
|
||||
'__setitem__', '__str__', '__sub__', '__subclasscheck__', '__truediv__',
|
||||
'__xor__'), suffix=r'\b'),
|
||||
Name.Function.Magic),
|
||||
]
|
||||
tokens['magicvars'] = [
|
||||
(words((
|
||||
'__annotations__', '__bases__', '__class__', '__closure__', '__code__',
|
||||
'__defaults__', '__dict__', '__doc__', '__file__', '__func__',
|
||||
'__globals__', '__kwdefaults__', '__module__', '__mro__', '__name__',
|
||||
'__objclass__', '__qualname__', '__self__', '__slots__', '__weakref__'),
|
||||
suffix=r'\b'),
|
||||
Name.Variable.Magic),
|
||||
]
|
||||
tokens['numbers'] = [
|
||||
(r'(\d+\.\d*|\d*\.\d+)([eE][+-]?[0-9]+)?', Number.Float),
|
||||
(r'\d+[eE][+-]?[0-9]+j?', Number.Float),
|
||||
(r'0[oO][0-7]+', Number.Oct),
|
||||
(r'0[bB][01]+', Number.Bin),
|
||||
(r'0[xX][a-fA-F0-9]+', Number.Hex),
|
||||
(r'\d+', Number.Integer)
|
||||
]
|
||||
tokens['backtick'] = []
|
||||
tokens['name'] = [
|
||||
(r'@\w+', Name.Decorator),
|
||||
(r'@', Operator), # new matrix multiplication operator
|
||||
(uni_name, Name),
|
||||
]
|
||||
tokens['funcname'] = [
|
||||
(uni_name, Name.Function, '#pop')
|
||||
]
|
||||
tokens['classname'] = [
|
||||
(uni_name, Name.Class, '#pop')
|
||||
]
|
||||
tokens['import'] = [
|
||||
(r'(\s+)(as)(\s+)', bygroups(Text, Keyword, Text)),
|
||||
(r'\.', Name.Namespace),
|
||||
(uni_name, Name.Namespace),
|
||||
(r'(\s*)(,)(\s*)', bygroups(Text, Operator, Text)),
|
||||
default('#pop') # all else: go back
|
||||
]
|
||||
tokens['fromimport'] = [
|
||||
(r'(\s+)(import)\b', bygroups(Text, Keyword), '#pop'),
|
||||
(r'\.', Name.Namespace),
|
||||
(uni_name, Name.Namespace),
|
||||
default('#pop'),
|
||||
]
|
||||
tokens['strings-single'] = innerstring_rules(String.Single)
|
||||
tokens['strings-double'] = innerstring_rules(String.Double)
|
||||
|
||||
def analyse_text(text):
|
||||
return shebang_matches(text, r'seq')
|
|
@ -6,5 +6,4 @@ Tutorial
|
|||
|
||||
setup
|
||||
primer
|
||||
tutorial
|
||||
workshop
|
||||
pipelines
|
||||
|
|
|
@ -0,0 +1,52 @@
|
|||
Pipelines
|
||||
=========
|
||||
|
||||
Pipelines
|
||||
---------
|
||||
|
||||
Codon extends the core Python language with a pipe operator. You can chain multiple functions and generators to form a pipeline. Pipeline stages can be regular functions or generators. In the case of standard functions, the function is simply applied to the input data and the result is carried to the remainder of the pipeline, akin to F#'s functional piping. If, on the other hand, a stage is a generator, the values yielded by the generator are passed lazily to the remainder of the pipeline, which in many ways mirrors how piping is implemented in Bash. Note that Codon ensures that generator pipelines do not collect any data unless explicitly requested, thus allowing the processing of terabytes of data in a streaming fashion with no memory and minimal CPU overhead.
|
||||
|
||||
.. code:: python
|
||||
|
||||
def add1(x):
|
||||
return x + 1
|
||||
|
||||
2 |> add1 # 3; equivalent to add1(2)
|
||||
|
||||
def calc(x, y):
|
||||
return x + y**2
|
||||
2 |> calc(3) # 11; equivalent to calc(2, 3)
|
||||
2 |> calc(..., 3) # 11; equivalent to calc(2, 3)
|
||||
2 |> calc(3, ...) # 7; equivalent to calc(3, 2)
|
||||
|
||||
def gen(i):
|
||||
for i in range(i):
|
||||
yield i
|
||||
|
||||
5 |> gen |> print # prints 0 1 2 3 4 separated by newline
|
||||
range(1, 4) |> iter |> gen |> print(end=' ') # prints 0 0 1 0 1 2 without newline
|
||||
[1, 2, 3] |> print # prints [1, 2, 3]
|
||||
range(100000000) |> print # prints range(0, 100000000)
|
||||
range(100000000) |> iter |> print # not only prints all those numbers, but it uses almost no memory at all
|
||||
|
||||
Codon will chain anything that implements ``__iter__``; however, use
|
||||
generators as much as possible because the compiler will optimize out
|
||||
generators whenever possible. Combinations of pipes and generators can be
|
||||
used to implement efficient streaming pipelines.
|
||||
|
||||
.. caution::
|
||||
The Codon compiler may perform optimizations that change the order of elements passed through a pipeline. Therefore, it is best to not rely on order when using pipelines. If order needs to be maintained, consider using a regular loop or passing an index alongside each element sent through the pipeline.
|
||||
|
||||
Parallel Pipelines
|
||||
------------------
|
||||
|
||||
CPython and many other implementations alike cannot take advantage of parallelism due to the infamous global interpreter lock, a mutex that protects accesses to Python objects, preventing multiple threads from executing Python bytecode at once. Unlike CPython, Codon has no such restriction and supports full multithreading. To this end, Codon supports a *parallel* pipe operator ``||>``, which is semantically similar to the standard pipe operator except that it allows the elements sent through it to be processed in parallel by the remainder of the pipeline. Hence, turning a serial program into a parallel one often requires the addition of just a single character in Codon. Further, a single pipeline can contain multiple parallel pipes, resulting in nested parallelism.
|
||||
|
||||
.. code:: python
|
||||
|
||||
range(100000) |> iter ||> print # prints all these numbers, probably in random order
|
||||
range(100000) |> iter ||> process ||> clean # runs process in parallel, and then cleans data in parallel
|
||||
|
||||
Codon will automatically schedule the ``process`` and ``clean`` functions to execute as soon as possible. You can control the number of threads via the ``OMP_NUM_THREADS`` environment variable.
|
||||
|
||||
Internally, the Codon compiler uses an OpenMP task backend to generate code for parallel pipelines. Logically, parallel pipe operators are similar to parallel-for loops: the portion of the pipeline after the parallel pipe is outlined into a new function that is called by the OpenMP runtime task spawning routines (as in ``#pragma omp task`` in C++), and a synchronization point (``#pragma omp taskwait``) is added after the outlined segment. Lastly, the entire program is implicitly placed in an OpenMP parallel region (``#pragma omp parallel``) that is guarded by a "single" directive (``#pragma omp single``) so that the serial portions are still executed by one thread (this is required by OpenMP as tasks must be bound to an enclosing parallel region).
|
|
@ -1,14 +1,14 @@
|
|||
Language Primer
|
||||
===============
|
||||
|
||||
If you know Python, you already know 95% of Seq. The following primer
|
||||
If you know Python, you already know 95% of Codon. The following primer
|
||||
assumes some familiarity with Python or at least one "modern"
|
||||
programming language (QBASIC doesn't count).
|
||||
|
||||
Printing
|
||||
--------
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
print('hello world')
|
||||
|
||||
|
@ -18,9 +18,9 @@ Printing
|
|||
Comments
|
||||
--------
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
# Seq comments start with "# 'and go until the end of the line
|
||||
# Codon comments start with "# 'and go until the end of the line
|
||||
|
||||
"""
|
||||
There are no multi-line comments. You can (ab)use the docstring operator (''')
|
||||
|
@ -30,10 +30,10 @@ Comments
|
|||
Literals
|
||||
--------
|
||||
|
||||
Seq is a strongly typed language like C++, Java, or C#. That means each
|
||||
Codon is a strongly typed language like C++, Java, or C#. That means each
|
||||
expression must have a type that can be inferred at the compile-time.
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
# Booleans
|
||||
True # type: bool
|
||||
|
@ -41,7 +41,7 @@ expression must have a type that can be inferred at the compile-time.
|
|||
|
||||
# Numbers
|
||||
a = 1 # type: int. It's 64-bit signed integer.
|
||||
b = 1.12 # type: float. Seq's float is identical to C's double.
|
||||
b = 1.12 # type: float. Codon's float is identical to C's double.
|
||||
c = 5u # type: int, but unsigned
|
||||
d = Int[8](12) # 8-bit signed integer; you can go all the way to Int[2048]
|
||||
e = UInt[8](200) # 8-bit unsigned integer
|
||||
|
@ -68,15 +68,10 @@ expression must have a type that can be inferred at the compile-time.
|
|||
# \\, \', \", \a, \b, \f, \n, \r, \t, \v,
|
||||
# \xHHH (HHH is hex code), \OOO (OOO is octal code)
|
||||
|
||||
# Sequence types
|
||||
dna = s"ACGT" # type: seq. These are DNA sequences.
|
||||
prt = p"MYX" # type: bio.pseq. These are protein sequences.
|
||||
kmer = k"ACGT" # type: Kmer[4]. Note that Kmer[5] is different than Kmer[12].
|
||||
|
||||
Tuples
|
||||
~~~~~~
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
# Tuples
|
||||
t = (1, 2.3, 'hi') # type: Tuple[int, float, str].
|
||||
|
@ -87,10 +82,10 @@ As all types must be known at compile time, tuple indexing works
|
|||
only if a tuple is homogenous (all types are the same) or if the value
|
||||
of the index is known at compile-time.
|
||||
|
||||
You can, however, iterate over heterogenous tuples in Seq. This is achieved
|
||||
You can, however, iterate over heterogenous tuples in Codon. This is achieved
|
||||
by unrolling the loop to accommodate the different types.
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
t = (1, 2.3, 'hi')
|
||||
t[1] # works because 1 is a constant int
|
||||
|
@ -116,7 +111,7 @@ by unrolling the loop to accommodate the different types.
|
|||
Containers
|
||||
~~~~~~~~~~
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
l = [1, 2, 3] # type: List[int]; a list of integers
|
||||
s = {1.1, 3.3, 2.2, 3.3} # type: Set[float]; a set of floats
|
||||
|
@ -127,9 +122,9 @@ Containers
|
|||
dn = {} # an empty dict whose type is inferred based on usage
|
||||
dn = Dict[int, float]() # an empty dictionary with explicit element types
|
||||
|
||||
Because Seq is strongly typed, these won't compile:
|
||||
Because Codon is strongly typed, these won't compile:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
l = [1, 's'] # is it a List[int] or List[str]? you cannot mix-and-match types
|
||||
d = {1: 'hi'}
|
||||
|
@ -138,11 +133,11 @@ Because Seq is strongly typed, these won't compile:
|
|||
t = (1, 2.2) # Tuple[int, float]
|
||||
lt = list(t) # compile error: t is not homogenous
|
||||
|
||||
lp = [1, 2.1, 3, 5] # compile error: Seq will not automatically cast a float to an int
|
||||
lp = [1, 2.1, 3, 5] # compile error: Codon will not automatically cast a float to an int
|
||||
|
||||
This will work, though:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
u = (1, 2, 3)
|
||||
lu = list(u) # works: u is homogenous
|
||||
|
@ -156,7 +151,7 @@ This will work, though:
|
|||
Assignments and operators
|
||||
-------------------------
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
a = 1 + 2 # this is 3
|
||||
a = (1).__add__(2) # you can use a function call instead of an operator; this is also 3
|
||||
|
@ -178,7 +173,6 @@ Operator Magic method Description
|
|||
``**`` ``__pow__`` exponentiation
|
||||
``%`` ``__mod__`` modulo
|
||||
``@`` ``__matmul__`` matrix multiplication;
|
||||
sequence alignment
|
||||
``&`` ``__and__`` bitwise and
|
||||
``|`` ``__or__`` bitwise or
|
||||
``^`` ``__xor__`` bitwise xor
|
||||
|
@ -195,7 +189,7 @@ Operator Magic method Description
|
|||
``or`` none boolean or (short-circuits)
|
||||
======== ================ ==================================================
|
||||
|
||||
Seq also has the following unary operators:
|
||||
Codon also has the following unary operators:
|
||||
|
||||
======== ================ =============================
|
||||
Operator Magic method Description
|
||||
|
@ -211,9 +205,9 @@ Operator Magic method Description
|
|||
Tuple unpacking
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Seq supports most of Python's tuple unpacking syntax:
|
||||
Codon supports most of Python's tuple unpacking syntax:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
x, y = 1, 2 # x is 1, y is 2
|
||||
(x, (y, z)) = 1, (2, 3) # x is 1, y is 2, z is 3
|
||||
|
@ -240,9 +234,9 @@ Control flow
|
|||
Conditionals
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Seq supports the standard Python conditional syntax:
|
||||
Codon supports the standard Python conditional syntax:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
if a or b or some_cond():
|
||||
print(1)
|
||||
|
@ -255,10 +249,10 @@ Seq supports the standard Python conditional syntax:
|
|||
|
||||
a = b if sth() else c # ternary conditional operator
|
||||
|
||||
Seq extends the Python conditional syntax with a ``match`` statement, which is
|
||||
Codon extends the Python conditional syntax with a ``match`` statement, which is
|
||||
inspired by Rust's:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
match a + some_heavy_expr(): # assuming that the type of this expression is int
|
||||
case 1: # is it 1?
|
||||
|
@ -300,12 +294,6 @@ inspired by Rust's:
|
|||
case [...]: # any other list
|
||||
...
|
||||
|
||||
match sequence: # of type seq
|
||||
case 'ACGT': ...
|
||||
case 'AC_T': ... # _ is a wildcard character and it can be anything
|
||||
case 'A_C_T_': ... # a spaced k-mer AxCxTx
|
||||
case 'AC*T': ... # matches a sequence that starts with AC and ends with T
|
||||
|
||||
You can mix, match and chain match rules as long as the match type
|
||||
matches the expression type.
|
||||
|
||||
|
@ -314,7 +302,7 @@ Loops
|
|||
|
||||
Standard fare:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
a = 10
|
||||
while a > 0: # prints even numbers from 9 to 1
|
||||
|
@ -340,7 +328,7 @@ Comprehensions
|
|||
Technically, comprehensions are not statements (they are expressions).
|
||||
Comprehensions are a nifty, Pythonic way to create collections:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
l = [i for i in range(5)] # type: List[int]; l is [0, 1, 2, 3, 4]
|
||||
l = [i for i in range(15) if i % 2 == 1 if i > 10] # type: List[int]; l is [11, 13]
|
||||
|
@ -352,7 +340,7 @@ Comprehensions are a nifty, Pythonic way to create collections:
|
|||
You can also use collections to create generators (more about them later
|
||||
on):
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
g = (i for i in range(10))
|
||||
print(list(g)) # prints number from 0 to 9, inclusive
|
||||
|
@ -361,9 +349,9 @@ Exception handling
|
|||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Again, if you know how to do this in Python, you know how to do it in
|
||||
Seq:
|
||||
Codon:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def throwable():
|
||||
raise ValueError("doom and gloom")
|
||||
|
@ -378,14 +366,14 @@ Seq:
|
|||
print("whatever, it's done")
|
||||
|
||||
.. note::
|
||||
Right now, Seq cannot catch multiple exceptions in one
|
||||
Right now, Codon cannot catch multiple exceptions in one
|
||||
statement. Thus ``catch (Exc1, Exc2, Exc3) as var`` will not compile.
|
||||
|
||||
If you have an object that implements ``__enter__`` and ``__exit__``
|
||||
methods to manage its lifetime (say, a ``File``), you can use a ``with``
|
||||
statement to make your life easy:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
with open('foo.txt') as f, open('foo_copy.txt', 'w') as fo:
|
||||
for l in f:
|
||||
|
@ -394,20 +382,20 @@ statement to make your life easy:
|
|||
Variables and scoping
|
||||
---------------------
|
||||
|
||||
You have probably noticed by now that blocks in Seq are defined by their
|
||||
You have probably noticed by now that blocks in Codon are defined by their
|
||||
indentation level (as in Python). We recommend using 2 or 4 spaces to
|
||||
indent blocks. Do not mix indentation levels, and do not mix tabs and spaces;
|
||||
stick to any *consistent* way of indenting your code.
|
||||
|
||||
One of the major differences between Seq and Python lies in variable
|
||||
scoping rules. Seq variables cannot *leak* to outer blocks. Every
|
||||
One of the major differences between Codon and Python lies in variable
|
||||
scoping rules. Codon variables cannot *leak* to outer blocks. Every
|
||||
variable is accessible only within its own block (after the variable is
|
||||
defined, of course), and within any block that is nested within the
|
||||
variable's own block.
|
||||
|
||||
That means that the following Pythonic pattern won't compile:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
if cond():
|
||||
x = 1
|
||||
|
@ -419,9 +407,9 @@ That means that the following Pythonic pattern won't compile:
|
|||
...
|
||||
print(i) # error: i is only accessible within the for loop block
|
||||
|
||||
In Seq, you can rewrite that as:
|
||||
In Codon, you can rewrite that as:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
x = 2
|
||||
if cond():
|
||||
|
@ -432,10 +420,10 @@ In Seq, you can rewrite that as:
|
|||
|
||||
print(x)
|
||||
|
||||
Another important difference between Seq and Python is that, in Seq, variables
|
||||
Another important difference between Codon and Python is that, in Codon, variables
|
||||
cannot change types. So this won't compile:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
a = 's'
|
||||
a = 1 # error: expected string, but got int
|
||||
|
@ -444,7 +432,7 @@ A note about function scoping: functions cannot modify variables that
|
|||
are not defined within the function block. You need to use ``global`` to
|
||||
modify such variables:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
g = 5
|
||||
def foo():
|
||||
|
@ -468,24 +456,24 @@ are not defined within the function.
|
|||
Imports
|
||||
-------
|
||||
|
||||
You can import functions and classes from another Seq module by doing:
|
||||
You can import functions and classes from another Codon module by doing:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
# Create foo.seq with a bunch of useful methods
|
||||
# Create foo.codon with a bunch of useful methods
|
||||
import foo
|
||||
|
||||
foo.useful1()
|
||||
p = foo.FooType()
|
||||
|
||||
# Create bar.seq with a bunch of useful methods
|
||||
# Create bar.codon with a bunch of useful methods
|
||||
from bar import x, y
|
||||
x(y)
|
||||
|
||||
from bar import z as bar_z
|
||||
bar_z()
|
||||
|
||||
``import foo`` looks for ``foo.seq`` or ``foo/__init__.seq`` in the
|
||||
``import foo`` looks for ``foo.codon`` or ``foo/__init__.codon`` in the
|
||||
current directory.
|
||||
|
||||
Functions
|
||||
|
@ -493,7 +481,7 @@ Functions
|
|||
|
||||
Functions are defined as follows:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def foo(a, b, c):
|
||||
return a + b + c
|
||||
|
@ -501,7 +489,7 @@ Functions are defined as follows:
|
|||
|
||||
How about procedures? Well, don't return anything meaningful:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def proc(a, b):
|
||||
print(a, 'followed by', b)
|
||||
|
@ -514,10 +502,10 @@ How about procedures? Well, don't return anything meaningful:
|
|||
proc2(1, 's')
|
||||
proc2(5, 's') # this prints nothing
|
||||
|
||||
Seq is a strongly-typed language, so you can restrict argument and
|
||||
Codon is a strongly-typed language, so you can restrict argument and
|
||||
return types:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def fn(a: int, b: float):
|
||||
return a + b # this works because int implements __add__(float)
|
||||
|
@ -538,7 +526,7 @@ return types:
|
|||
|
||||
Default arguments? Named arguments? You bet:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def foo(a, b: int, c: float = 1.0, d: str = 'hi'):
|
||||
print(a, b, c, d)
|
||||
|
@ -547,7 +535,7 @@ Default arguments? Named arguments? You bet:
|
|||
|
||||
How about optional arguments? We support that too:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
# type of b promoted to Optional[int]
|
||||
def foo(a, b: int = None):
|
||||
|
@ -559,29 +547,29 @@ How about optional arguments? We support that too:
|
|||
Generics
|
||||
~~~~~~~~
|
||||
|
||||
As we've said several times: Seq is a strongly typed language. As
|
||||
As we've said several times: Codon is a strongly typed language. As
|
||||
such, it is not as flexible as Python when it comes to types (e.g. you
|
||||
can't have lists with elements of different types). However,
|
||||
Seq tries to mimic Python's *"I don't care about types until I do"*
|
||||
Codon tries to mimic Python's *"I don't care about types until I do"*
|
||||
attitude as much as possible by utilizing a technique known as
|
||||
*monomorphization*. If there is a function that has an argument
|
||||
without a type definition, Seq will consider it a *generic* function,
|
||||
without a type definition, Codon will consider it a *generic* function,
|
||||
and will generate different functions for each invocation of
|
||||
that generic function:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def foo(x):
|
||||
print(x) # print relies on typeof(x).__str__(x) method to print the representation of x
|
||||
foo(1) # Seq automatically generates foo(x: int) and calls int.__str__ when needed
|
||||
foo('s') # Seq automatically generates foo(x: str) and calls str.__str__ when needed
|
||||
foo([1, 2]) # Seq automatically generates foo(x: List[int]) and calls List[int].__str__ when needed
|
||||
foo(1) # Codon automatically generates foo(x: int) and calls int.__str__ when needed
|
||||
foo('s') # Codon automatically generates foo(x: str) and calls str.__str__ when needed
|
||||
foo([1, 2]) # Codon automatically generates foo(x: List[int]) and calls List[int].__str__ when needed
|
||||
|
||||
But what if you need to mix type definitions and generic types? Say, your
|
||||
function can take a list of *anything*? Well, you can use generic
|
||||
specifiers:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def foo(x: List[T], T: type):
|
||||
print(x)
|
||||
|
@ -604,9 +592,9 @@ specifiers:
|
|||
Generators
|
||||
~~~~~~~~~~
|
||||
|
||||
Seq supports generators, and they are fast! Really, really fast!
|
||||
Codon supports generators, and they are fast! Really, really fast!
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def gen(i):
|
||||
while i < 10:
|
||||
|
@ -619,7 +607,7 @@ You can also use ``yield`` to implement coroutines: ``yield``
|
|||
suspends the function, while ``(yield)`` (yes, parentheses are required)
|
||||
receives a value, as in Python.
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
def mysum[T](start: T):
|
||||
m = start
|
||||
|
@ -635,62 +623,16 @@ receives a value, as in Python.
|
|||
iadder.send(i) # send a value to coroutine
|
||||
print(iadder.send(-1)) # prints 45
|
||||
|
||||
Pipelines
|
||||
~~~~~~~~~
|
||||
|
||||
Seq extends the core Python language with a pipe operator, which is
|
||||
similar to bash pipes (or F#'s ``|>`` operator). You can chain multiple
|
||||
functions and generators to form a pipeline:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
def add1(x):
|
||||
return x + 1
|
||||
|
||||
2 |> add1 # 3; equivalent to add1(2)
|
||||
|
||||
def calc(x, y):
|
||||
return x + y**2
|
||||
2 |> calc(3) # 11; equivalent to calc(2, 3)
|
||||
2 |> calc(..., 3) # 11; equivalent to calc(2, 3)
|
||||
2 |> calc(3, ...) # 7; equivalent to calc(3, 2)
|
||||
|
||||
def gen(i):
|
||||
for i in range(i):
|
||||
yield i
|
||||
5 |> gen |> print # prints 0 1 2 3 4 separated by newline
|
||||
range(1, 4) |> iter |> gen |> print(end=' ') # prints 0 0 1 0 1 2 without newline
|
||||
[1, 2, 3] |> print # prints [1, 2, 3]
|
||||
range(100000000) |> print # prints range(0, 100000000)
|
||||
range(100000000) |> iter |> print # not only prints all those numbers, but it uses almost no memory at all
|
||||
|
||||
Seq will chain anything that implements ``__iter__``; however, use
|
||||
generators as much as possible because the compiler will optimize out
|
||||
generators whenever possible. Combinations of pipes and generators can be
|
||||
used to implement efficient streaming pipelines.
|
||||
|
||||
Seq can also execute pipelines in parallel via the parallel pipe (``||>``)
|
||||
operator:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
range(100000) |> iter ||> print # prints all these numbers, probably in random order
|
||||
range(100000) |> iter ||> process ||> clean # runs process in parallel, and then cleans data in parallel
|
||||
|
||||
In the last example, Seq will automatically schedule the ``process`` and
|
||||
``clean`` functions to execute as soon as possible. You can control the
|
||||
number of threads via the ``OMP_NUM_THREADS`` environment variable.
|
||||
|
||||
.. _interop:
|
||||
|
||||
Foreign function interface (FFI)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Seq can easily call functions from C and Python.
|
||||
Codon can easily call functions from C and Python.
|
||||
|
||||
Let's import some C functions:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
from C import pow(float) -> float
|
||||
pow(2.0) # 4.0
|
||||
|
@ -698,15 +640,15 @@ Let's import some C functions:
|
|||
# Import and rename function
|
||||
from C import puts(cobj) -> void as print_line # type cobj is C's pointer (void*, char*, etc.)
|
||||
print_line("hi!".c_str()) # prints "hi!".
|
||||
# Note .ptr at the end of string--- needed to cast Seq's string to char*.
|
||||
# Note .ptr at the end of string--- needed to cast Codon's string to char*.
|
||||
|
||||
``from C import`` only works if the symbol is available to the program. If you
|
||||
are running your programs via ``seqc``, you can link dynamic libraries
|
||||
by running ``seqc run -l path/to/dynamic/library.so ...``.
|
||||
are running your programs via ``codon``, you can link dynamic libraries
|
||||
by running ``codon run -l path/to/dynamic/library.so ...``.
|
||||
|
||||
Hate linking? You can also use dyld library loading as follows:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
|
||||
LIBRARY = "mycoollib.so"
|
||||
|
@ -716,21 +658,21 @@ Hate linking? You can also use dyld library loading as follows:
|
|||
foo2 = my2(4, 3.2)
|
||||
|
||||
.. note::
|
||||
When importing external non-Seq functions, you must explicitly specify
|
||||
When importing external non-Codon functions, you must explicitly specify
|
||||
argument and return types.
|
||||
|
||||
How about Python? If you have set the ``SEQ_PYTHON`` environment variable as
|
||||
How about Python? If you have set the ``CODON_PYTHON`` environment variable as
|
||||
described in the first section, you can do:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
from python import mymodule.myfunction(str) -> int as foo
|
||||
print(foo("bar"))
|
||||
|
||||
Often you want to execute more complex Python code within Seq. To that
|
||||
end, you can use Seq's ``@python`` annotation:
|
||||
Often you want to execute more complex Python code within Codon. To that
|
||||
end, you can use Codon's ``@python`` annotation:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
@python
|
||||
def scipy_here_i_come(i: List[List[float]]) -> List[float]:
|
||||
|
@ -743,18 +685,18 @@ end, you can use Seq's ``@python`` annotation:
|
|||
return list(eigenvalues)
|
||||
print(scipy_here_i_come([[1.0, 2.0], [3.0, 4.0]])) # [-0.372281, 5.37228] with some warnings...
|
||||
|
||||
Seq will automatically bridge any object that implements the ``__to_py__``
|
||||
and ``__from_py__`` magic methods. All standard Seq types already
|
||||
Codon will automatically bridge any object that implements the ``__to_py__``
|
||||
and ``__from_py__`` magic methods. All standard Codon types already
|
||||
implement these methods.
|
||||
|
||||
Classes and types
|
||||
-----------------
|
||||
|
||||
Of course, Seq supports classes! However, you must declare class members
|
||||
Of course, Codon supports classes! However, you must declare class members
|
||||
and their types in the preamble of each class (like you would do with
|
||||
Python's dataclasses).
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
class Foo:
|
||||
x: int
|
||||
|
@ -770,11 +712,11 @@ Python's dataclasses).
|
|||
f.method() # prints "1 2"
|
||||
|
||||
.. note::
|
||||
Seq does not (yet!) support inheritance and polymorphism.
|
||||
Codon does not (yet!) support inheritance and polymorphism.
|
||||
|
||||
Unlike Python, Seq supports method overloading:
|
||||
Unlike Python, Codon supports method overloading:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
class Foo:
|
||||
x: int
|
||||
|
@ -799,7 +741,7 @@ Unlike Python, Seq supports method overloading:
|
|||
|
||||
Classes can also be generic:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
class Container[T]:
|
||||
l: List[T]
|
||||
|
@ -809,7 +751,7 @@ Classes can also be generic:
|
|||
|
||||
Classes create objects that are passed by reference:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
class Point:
|
||||
x: int
|
||||
|
@ -824,9 +766,9 @@ Classes create objects that are passed by reference:
|
|||
If you need to copy an object's contents, implement the ``__copy__`` magic
|
||||
method and use ``q = copy(p)`` instead.
|
||||
|
||||
Seq also supports pass-by-value types via the ``@tuple`` annotation:
|
||||
Codon also supports pass-by-value types via the ``@tuple`` annotation:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
@tuple
|
||||
class Point:
|
||||
|
@ -840,7 +782,7 @@ Seq also supports pass-by-value types via the ``@tuple`` annotation:
|
|||
However, **by-value objects are immutable!**. The following code will
|
||||
not compile:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
p = Point(1, 2)
|
||||
p.x = 2 # error! immutable type
|
||||
|
@ -850,7 +792,7 @@ Under the hood, types are basically named tuples (equivalent to Python's
|
|||
|
||||
You can also add methods to types:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
@tuple
|
||||
class Point:
|
||||
|
@ -869,17 +811,16 @@ You can also add methods to types:
|
|||
Type extensions
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Suppose you have a class that lacks a method or an operator that might
|
||||
be really useful. You can extend that class and add the method at compile time:
|
||||
Suppose you have a class that lacks a method or an operator that might be really useful. Codon provides an ``@extend`` annotation that allows you to add and modify methods of various types at compile time, including built-in types like ``int`` or ``str``. This allows much of the functionality of built-in types to be implemented in Codon as type extensions in the standard library.
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
class Foo:
|
||||
...
|
||||
|
||||
f = Foo(...)
|
||||
|
||||
# we need foo.cool() but it does not exist... not a problem for Seq
|
||||
# We need foo.cool() but it does not exist... not a problem for Codon
|
||||
@extend
|
||||
class Foo:
|
||||
def cool(self: Foo):
|
||||
|
@ -887,14 +828,16 @@ be really useful. You can extend that class and add the method at compile time:
|
|||
|
||||
f.cool() # works!
|
||||
|
||||
# how about we add support for adding integers and strings?
|
||||
# How about we add support for adding integers and strings:
|
||||
@extend
|
||||
class int:
|
||||
def __add__(self: int, other: str) -> int:
|
||||
def __add__(self: int, other: str):
|
||||
return self + int(other)
|
||||
|
||||
print(5 + '4') # 9
|
||||
|
||||
Note that all type extensions are performed strictly at compile time and incur no runtime overhead.
|
||||
|
||||
Magic methods
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
|
@ -915,13 +858,22 @@ operators overload unary and binary operators (see :ref:`operators`)
|
|||
``__str__`` support printing and ``str`` method
|
||||
================ =============================================
|
||||
|
||||
Other types
|
||||
~~~~~~~~~~~
|
||||
|
||||
Codon provides arbitrary-width signed and unsigned integers, e.g. ``Int[32]`` is a signed 32-bit integer while ``UInt[128]`` is an unsigned 128-bit integer, respectively (note that ``int`` is an ``Int[64]``). Typedefs for common bit widths are provided in the standard library, such as ``i8``, ``i16``, ``u32``, ``u64`` etc.
|
||||
|
||||
The ``Ptr[T]`` type in Codon also corresponds to a raw C pointer (e.g. ``Ptr[byte]`` is equivalent to ``char*`` in C). The ``Array[T]`` type represents a fixed-length array (essentially a pointer with a length).
|
||||
|
||||
Codon also provides ``__ptr__`` for obtaining a pointer to a variable (as in ``__ptr__(myvar)``) and ``__array__`` for declaring stack-allocated arrays (as in ``__array__[int](10)``).
|
||||
|
||||
LLVM functions
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
In certain cases, you might want to use LLVM features that are not directly
|
||||
accessible with Seq. This can be done with the ``@llvm`` attribute:
|
||||
accessible with Codon. This can be done with the ``@llvm`` attribute:
|
||||
|
||||
.. code:: seq
|
||||
.. code:: python
|
||||
|
||||
@llvm
|
||||
def llvm_add[T](a: T, b: T) -> T:
|
||||
|
@ -933,4 +885,4 @@ accessible with Seq. This can be done with the ``@llvm`` attribute:
|
|||
|
||||
--------------
|
||||
|
||||
Issues, feedback, or comments regarding this tutorial? Let us know `on GitHub <https://github.com/seq-lang/seq>`__.
|
||||
Issues, feedback, or comments regarding this tutorial? Let us know `on GitHub <https://github.com/exaloop/codon>`__.
|
||||
|
|
|
@ -11,22 +11,22 @@ Simple!
|
|||
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
|
||||
|
||||
If you want to use Python interop, you also need to point
|
||||
``SEQ_PYTHON`` to the Python library (typically called
|
||||
``libpython3.9m.so`` or similar). The Seq repository contains a
|
||||
`Python script <https://github.com/seq-lang/seq/blob/develop/test/python/find-python-library.py>`_
|
||||
``CODON_PYTHON`` to the Python library (typically called
|
||||
``libpython3.9m.so`` or similar). The Codon repository contains a
|
||||
`Python script <https://github.com/exaloop/codon/blob/develop/test/python/find-python-library.py>`_
|
||||
that will identify and print the path to this library.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
Assuming that Seq was properly installed, you can use it as follows:
|
||||
Assuming that Codon was properly installed, you can use it as follows:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
seqc run foo.seq # Compile and run foo.seq
|
||||
seqc run -release foo.seq # Compile and run foo.seq with optimizations
|
||||
seqc build -exe file.seq # Compile foo.seq to executable "foo"
|
||||
codon run foo.codon # Compile and run foo.codon
|
||||
codon run -release foo.codon # Compile and run foo.codon with optimizations
|
||||
codon build -exe file.codon # Compile foo.codon to executable "foo"
|
||||
|
||||
Note that the ``-exe`` option requires ``clang`` to be installed, and
|
||||
the ``LIBRARY_PATH`` environment variable to point to the Seq runtime
|
||||
library (installed by default at ``~/.seq/lib/seq``).
|
||||
the ``LIBRARY_PATH`` environment variable to point to the Codon runtime
|
||||
library (installed by default at ``~/.codon/lib/codon``).
|
||||
|
|
|
@ -1,411 +0,0 @@
|
|||
Tutorial
|
||||
========
|
||||
|
||||
Bio-specific features
|
||||
---------------------
|
||||
|
||||
Seq's ``bio`` module contains all the following functions, types, and methods. The code snippets below should be preceded with ``from bio import *``, although we omit that line for simplicity below.
|
||||
|
||||
Genomic types
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Seq's namesake type is indeed the sequence type: ``seq``. A ``seq`` object represents a DNA sequence of any length and---on top of general-purpose string functionality---provides methods for performing common sequence operations such as splitting into subsequences, reverse complementation and :math:`k`-mer extraction. Alongside the ``seq`` type are :math:`k`-mer types, where e.g. ``Kmer[1]`` represents a 1-mer, ``Kmer[2]`` a 2-mer and so on, up to ``Kmer[1024]``.
|
||||
|
||||
Sequences can be seamlessly converted between these various types:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
dna = s'ACGTACGTACGT' # sequence literal
|
||||
|
||||
# (a) split into subsequences of length 3
|
||||
# with a step of 2
|
||||
for sub in dna.split(k=3, step=2):
|
||||
print(sub)
|
||||
print(~sub) # reverse complement
|
||||
|
||||
# (b) split into 5-mers with step 1 (default)
|
||||
for kmer in dna.kmers(k=5):
|
||||
print(kmer)
|
||||
print(~kmer) # reverse complement
|
||||
|
||||
# (c) convert entire sequence to 12-mer
|
||||
kmer = Kmer[12](dna)
|
||||
|
||||
Seq also supports a ``pseq`` type for protein sequences:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
protein = p'HEAGAWGHE' # pseq literal
|
||||
print(list(protein.split(3, 3))) # [HEA, GAW, GHE]
|
||||
print(s'ACCATGACA' |> translate) # TMT
|
||||
|
||||
.. Note:: What's the difference between sequences and :math:`k`-mers in Seq? Sequences have arbitrary length and allow for ambiguous bases like ``N``. :math:`k`-mers, on the other hand, have a length that is fixed and must be known at compile time, only allowing for ``ACGT`` bases. :math:`k`-mers can therefore be represented internally as 2-bit encoded integers, making them compact and very efficient to manipulate.
|
||||
|
||||
In practice, sequences would be read from e.g. a FASTQ file:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
for record in FASTQ('input.fq'):
|
||||
print('processing', record.name)
|
||||
process(record.seq)
|
||||
|
||||
If you only care about the sequences, you can also do this:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
for read in FASTQ('input.fq') |> seqs:
|
||||
process(read)
|
||||
|
||||
Common formats like FASTQ, FASTA, SAM, BAM and CRAM are supported. The ``FASTQ`` and ``FASTA`` parsers support several additional options:
|
||||
|
||||
- ``validate`` (``True`` by default): Perform data validation as sequences are read
|
||||
- ``gzip`` (``True`` by default): Perform I/O using zlib, supporting gzip'd files (note that plain text files will still work with this enabled)
|
||||
- ``fai`` (``True`` by default; FASTA only): Look for a ``.fai`` file to determine sequence lengths before reading
|
||||
|
||||
For example:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
for read in FASTQ('input.fq', validate=False, gzip=False) |> seqs:
|
||||
process(read)
|
||||
|
||||
To read protein sequences, you can use ``pFASTA``, which has the same interface as ``FASTA`` (but does not support ``fai``):
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
for p in pFASTA('input.fa') |> seqs:
|
||||
process(p)
|
||||
|
||||
.. _match:
|
||||
|
||||
Sequence matching
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Seq provides the conventional ``match`` construct, which works on integers, lists, strings and tuples. Here's a simple example:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
def describe(n: int):
|
||||
match n:
|
||||
case m if m < 0:
|
||||
print('negative')
|
||||
case 0:
|
||||
print('zero')
|
||||
case m if 0 < m < 10:
|
||||
print('small')
|
||||
case _:
|
||||
print('large')
|
||||
|
||||
A novel aspect of Seq's ``match`` statement is that it also works on sequences, and allows for concise recursive representations of several sequence operations such as subsequence search, reverse complementation tests and base counting, as shown in this example:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# (a)
|
||||
def has_spaced_acgt(s: seq):
|
||||
match s:
|
||||
case 'A_C_G_T*':
|
||||
return True
|
||||
case t if len(t) >= 8:
|
||||
return has_spaced_acgt(s[1:])
|
||||
case _:
|
||||
return False
|
||||
|
||||
# (b)
|
||||
def is_own_revcomp(s: seq):
|
||||
match s:
|
||||
case 'A*T' or 'T*A' or 'C*G' or 'G*C':
|
||||
return is_own_revcomp(s[1:-1])
|
||||
case s'':
|
||||
return True
|
||||
case _:
|
||||
return False
|
||||
|
||||
# (c)
|
||||
@tuple
|
||||
class BaseCount:
|
||||
A: int
|
||||
C: int
|
||||
G: int
|
||||
T: int
|
||||
|
||||
def __add__(self, other: BaseCount):
|
||||
a1, c1, g1, t1 = self
|
||||
a2, c2, g2, t2 = other
|
||||
return (a1 + a2, c1 + c2, g1 + g2, t1 + t2)
|
||||
|
||||
def count_bases(s):
|
||||
match s:
|
||||
case 'A*': return count_bases(s[1:]) + (1,0,0,0)
|
||||
case 'C*': return count_bases(s[1:]) + (0,1,0,0)
|
||||
case 'G*': return count_bases(s[1:]) + (0,0,1,0)
|
||||
case 'T*': return count_bases(s[1:]) + (0,0,0,1)
|
||||
case _: return BaseCount(0,0,0,0)
|
||||
|
||||
- Example (a) checks if a given sequence contains the subsequence ``A_C_G_T``, where ``_`` is a wildcard base.
|
||||
- Example (b) checks if the given sequence is its own reverse complement.
|
||||
- Example (c) counts how many times each base appears in the given sequence.
|
||||
|
||||
Sequence patterns consist of literal ``ACGT`` characters, single-base wildcards (``_``) or "zero or more" wildcards (``...``) that match zero or more of any base.
|
||||
|
||||
.. _pipeline:
|
||||
|
||||
Pipelines
|
||||
^^^^^^^^^
|
||||
|
||||
Pipelining is a natural model for thinking about processing genomic data, as sequences are typically processed in stages (e.g. read from input file, split into :math:`k`-mers, query :math:`k`-mers in index, perform full dynamic programming alignment, output results to file), and are almost always independent of one another as far as this processing is concerned. Because of this, Seq supports a pipe operator: ``|>``, similar to F#'s pipe and R's ``magrittr`` (``%>%``).
|
||||
|
||||
Pipeline stages in Seq can be regular functions or generators. In the case of standard functions, the function is simply applied to the input data and the result is carried to the remainder of the pipeline, akin to F#'s functional piping. If, on the other hand, a stage is a generator, the values yielded by the generator are passed lazily to the remainder of the pipeline, which in many ways mirrors how piping is implemented in Bash. Note that Seq ensures that generator pipelines do not collect any data unless explicitly requested, thus allowing the processing of terabytes of data in a streaming fashion with no memory and minimal CPU overhead.
|
||||
|
||||
Here's an example of pipeline usage, which shows the same two loops from above, but as pipelines:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
dna = s'ACGTACGTACGT' # sequence literal
|
||||
|
||||
# (a) split into subsequences of length 3
|
||||
# with a stride of 2
|
||||
dna |> split(..., k=3, step=2) |> print
|
||||
|
||||
# (b) split into 5-mers with stride 1
|
||||
def f(kmer):
|
||||
print(kmer)
|
||||
print(~kmer)
|
||||
|
||||
dna |> kmers(k=5, step=1) |> f
|
||||
|
||||
First, note that ``split`` is a Seq standard library function that takes three arguments: the sequence to split, the subsequence length and the stride; ``split(..., k=3, step=2)`` is a partial call of ``split`` that produces a new single-argument function ``f(x)`` which produces ``split(x, k=3, step=2)``. The undefined argument(s) in a partial call can be implicit, as in the second example: ``kmers`` (also a standard library function) is parameterized by the target :math:`k`-mer type and takes as arguments the sequence to :math:`k`-merize, the :math:`k`-mer length, and the stride; since just two of the three arguments are provided, the first is implicitly replaced by ``...`` to produce a partial call (i.e. the expression is equivalent to ``kmers(..., k=5, step=1)``). Both ``split`` and ``kmers`` are themselves generators that yield subsequences and :math:`k`-mers respectively, which are passed sequentially to the last stage of the enclosing pipeline in the two examples.
|
||||
|
||||
.. caution::
|
||||
The Seq compiler may perform optimizations that change the order of elements passed through a pipeline. Therefore, it is best to not rely on order when using pipelines. If order needs to be maintained, consider using a regular loop or passing an index alongside each element sent through the pipeline.
|
||||
|
||||
Sequence alignment
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Aligning sequences is very straightforward in Seq, and supports numerous options/variants:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# default parameters
|
||||
s1 = s'CGCGAGTCTT'
|
||||
s2 = s'CGCAGAGTT'
|
||||
aln = s1 @ s2
|
||||
print(aln.cigar, aln.score)
|
||||
|
||||
# custom parameters
|
||||
# match = 2; mismatch = 4; gap1(k) = 2k + 4; gap2(k) = k + 13
|
||||
aln = s1.align(s2, a=2, b=4, gapo=4, gape=2, gapo2=13, gape2=1)
|
||||
print(aln.cigar, aln.score)
|
||||
|
||||
Here is the list of options supported by the ``align()`` method; all are optional (default is global alignment):
|
||||
|
||||
- ``a``: match score
|
||||
- ``b``: mismatch score
|
||||
- ``ambig``: ambiguous (i.e. N) match score
|
||||
- ``gapo``: gap open cost
|
||||
- ``gape``: gap extension cost
|
||||
- ``gapo2``: 2nd gap open cost for dual gap cost function
|
||||
- ``gape2``: 2nd gap extension cost for dual gap cost function
|
||||
- ``bandwidth``: bandwidth for DP alignment
|
||||
- ``zdrop``: off-diagonal drop-off to stop extension
|
||||
- ``score_only``: if true, don't compute CIGAR
|
||||
- ``right``: if true, right-align gaps
|
||||
- ``approx_max``: if true, approximate max
|
||||
- ``approx_drop``: if true, approximate Z-drop
|
||||
- ``rev_cigar``: if true, reverse CIGAR in output
|
||||
- ``ext_only``: if true, perform extension alignment
|
||||
- ``splice``: if true, perform spliced alignment
|
||||
|
||||
Note that all costs/scores are positive by convention.
|
||||
|
||||
.. _interalign:
|
||||
|
||||
Inter-sequence alignment
|
||||
""""""""""""""""""""""""
|
||||
|
||||
Seq uses `ksw2 <https://github.com/lh3/ksw2>`_ as its default alignment kernel. ksw2 does a good job of applying SIMD parallelization to align a single pair of sequences, which is referred to as *intra-sequence* alignment. However, we can often get better performance by aligning multiple sequences at once, referred to as *inter-sequence* alignment. Inter-sequence alignment is usually more cumbersome to program in general-purpose languages because many sequences need to be batched before performing the alignment. However, in Seq, inter-sequence alignment is as easy as intra-sequence, using the ``@inter_align`` annotation:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
@inter_align
|
||||
def process(t):
|
||||
query, target = t
|
||||
score = query.align(target, a=1, b=2, ambig=0, gapo=2, gape=1, zdrop=100, bandwidth=100, end_bonus=5)
|
||||
print(query, target, score)
|
||||
|
||||
zip(seqs('queries.txt'), seqs('targets.txt')) |> process
|
||||
|
||||
Internally, the Seq compiler performs pipeline transformations when sequence alignment is performed within a function tagged ``@inter_align``, so as to suspend execution of the calling function, batch sequences that need to be aligned, perform inter-sequence alignment and return the results to the suspended functions. Note that the inter-sequence alignment kernel used by Seq is adapted from `BWA-MEM2 <https://github.com/bwa-mem2/bwa-mem2>`_.
|
||||
|
||||
.. _prefetch:
|
||||
|
||||
Genomic index prefetching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Large genomic indices---ranging from several to tens or even hundreds of gigabytes---used in many applications result in extremely poor cache performance and, ultimately, a substantial fraction of stalled memory-bound cycles. For this reason, Seq performs pipeline optimizations to enable data prefetching and to hide memory latencies. You, the user, must provide just:
|
||||
|
||||
- a ``__prefetch__`` magic method definition in the index class, which is logically similar to ``__getitem__`` (indexing construct) but performs a prefetch instead of actually loading the requested value (and can simply delegate to ``__prefetch__`` methods of built-in types);
|
||||
- a one-line ``@prefetch`` annotation on functions that should perform prefetching.
|
||||
|
||||
In particular, a typical prefetch-friendly index class would look like this:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
class MyIndex: # abstract k-mer index
|
||||
...
|
||||
def __getitem__(self, kmer: Kmer[20]):
|
||||
# standard __getitem__
|
||||
def __prefetch__(self, kmer: Kmer[20]):
|
||||
# similar to __getitem__, but performs prefetch
|
||||
|
||||
Now, if we were to process data in a pipeline as such:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
@prefetch
|
||||
def process(read: seq, index: MyIndex):
|
||||
...
|
||||
for kmer in read.kmers(k=20, step=step):
|
||||
hits_fwd = index[kmer]
|
||||
hits_rev = index[~kmer]
|
||||
...
|
||||
return x
|
||||
|
||||
FASTQ("reads.fq") |> seqs |> process(index) |> postprocess
|
||||
|
||||
The Seq compiler will perform pipeline transformations to overlap cache misses in ``MyIndex`` with other useful work, increasing overall throughput. In our benchmarks, we often find these transformations to improve performance by 50% to 2×. However, the improvement is dataset- and application-dependent (and can potentially even decrease performance, although we rarely observed this), so users are encouraged to experiment with it for their own use case.
|
||||
|
||||
As a concrete example, consider Seq's built-in FM-index type, ``FMIndex``, and a toy application that counts occurences of 20-mers from an input FASTQ. ``FMIndex`` provides end-to-end search methods like ``locate()`` and ``count()``, but we can take advantage of Seq's prefetch optimization by working with FM-index intervals:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
from bio.fmindex import FMIndex
|
||||
|
||||
fmi = FMIndex('/path/to/genome.fa')
|
||||
k, step, n = 20, 20, 0
|
||||
|
||||
def update(count: int):
|
||||
global n
|
||||
n += count
|
||||
|
||||
@prefetch
|
||||
def find(s: seq, fmi: FMIndex):
|
||||
intv = fmi.interval(s[-1]) # initial FM-index interval
|
||||
s = s[:-1] # trim off last base of sequence
|
||||
while s and intv:
|
||||
intv = fmi[intv, s[-1]] # backwards extend FM-index interval
|
||||
s = s[:-1] # trim off last base of sequence
|
||||
return len(intv) # return count of sequence in index
|
||||
|
||||
FASTQ('/path/to/reads.fq') |> seqs |> split(k, step) |> find(fmi) |> update
|
||||
print('total:', n)
|
||||
|
||||
That single ``@prefetch`` line can have a significant impact, especially for larger ``k``. Here is a graph of the performance of this exact snippet for various ``k`` using hg19 as the reference:
|
||||
|
||||
.. image:: ../../images/prefetch.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
:alt: prefetch performance
|
||||
|
||||
Other features
|
||||
--------------
|
||||
|
||||
Parallelism
|
||||
^^^^^^^^^^^
|
||||
|
||||
CPython and many other implementations alike cannot take advantage of parallelism due to the infamous global interpreter lock, a mutex that protects accesses to Python objects, preventing multiple threads from executing Python bytecode at once. Unlike CPython, Seq has no such restriction and supports full multithreading. To this end, Seq supports a *parallel* pipe operator ``||>``, which is semantically similar to the standard pipe operator except that it allows the elements sent through it to be processed in parallel by the remainder of the pipeline. Hence, turning a serial program into a parallel one often requires the addition of just a single character in Seq. Further, a single pipeline can contain multiple parallel pipes, resulting in nested parallelism. As an example, here are the same two pipelines as above, but parallelized:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
dna = s'ACGTACGTACGT' # sequence literal
|
||||
|
||||
# (a) split into subsequences of length 3
|
||||
# with a stride of 2
|
||||
dna |> split(..., k=3, step=2) ||> print
|
||||
|
||||
# (b) split into 5-mers with stride 1
|
||||
def f(kmer):
|
||||
print(kmer)
|
||||
print(~kmer)
|
||||
|
||||
dna |> kmers(k=5, step=1) ||> f
|
||||
|
||||
Internally, the Seq compiler uses an OpenMP task backend to generate code for parallel pipelines. Logically, parallel pipe operators are similar to parallel-for loops: the portion of the pipeline after the parallel pipe is outlined into a new function that is called by the OpenMP runtime task spawning routines (as in ``#pragma omp task`` in C++), and a synchronization point (``#pragma omp taskwait``) is added after the outlined segment. Lastly, the entire program is implicitly placed in an OpenMP parallel region (``#pragma omp parallel``) that is guarded by a "single" directive (``#pragma omp single``) so that the serial portions are still executed by one thread (this is required by OpenMP as tasks must be bound to an enclosing parallel region).
|
||||
|
||||
Type extensions
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
Seq provides an ``@extend`` annotation that allows programmers to add and modify methods of various types at compile time, including built-in types like ``int`` or ``str``. This actually allows much of the functionality of built-in types to be implemented in Seq as type extensions in the standard library. Here is an example where the ``int`` type is extended to include a ``to`` method that generates integers in a specified range, as well as to override the ``__mul__`` magic method to "intercept" integer multiplications:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
@extend
|
||||
class int:
|
||||
def to(self, other: int):
|
||||
for i in range(self, other + 1):
|
||||
yield i
|
||||
|
||||
def __truediv__(self, other: int):
|
||||
print('caught int div!')
|
||||
return 42
|
||||
|
||||
for i in (5).to(10):
|
||||
print(i) # 5, 6, ..., 10
|
||||
|
||||
# prints 'caught int div!' then '42'
|
||||
print(2 / 3)
|
||||
|
||||
Note that all type extensions are performed strictly at compile time and incur no runtime overhead.
|
||||
|
||||
Other types
|
||||
^^^^^^^^^^^
|
||||
|
||||
Seq provides arbitrary-width signed and unsigned integers up to ``Int[512]`` and ``UInt[512]``, respectively (note that ``int`` is an ``Int[64]``). Typedefs for common bit widths are provided in the standard library, such as ``i8``, ``i16``, ``u32``, ``u64`` etc.
|
||||
|
||||
The ``Ptr[T]`` type in Seq also corresponds to a raw C pointer (e.g. ``Ptr[byte]`` is equivalent to ``char*`` in C). The ``array[T]`` type represents a fixed-length array (essentially a pointer with a length).
|
||||
|
||||
Seq also provides ``__ptr__`` for obtaining a pointer to a variable (as in ``__ptr__(myvar)``) and ``__array__`` for declaring stack-allocated arrays (as in ``__array__[int](10)``).
|
||||
|
||||
Calling BWA from Seq
|
||||
--------------------
|
||||
|
||||
Seq provides a built-in module for interfacing with BWA. To use this module, simply build a shared BWA library and set ``BWA_LIB`` accordingly:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/lh3/bwa
|
||||
cd bwa
|
||||
make
|
||||
gcc -shared -o libbwa.so *.o -lz
|
||||
export BWA_LIB=`pwd`/libbwa.so
|
||||
|
||||
Now BWA can be used in Seq as such:
|
||||
|
||||
.. code-block:: seq
|
||||
|
||||
# Implementation of https://github.com/lh3/bwa/blob/master/example.c
|
||||
from sys import argv
|
||||
from bio.bwa import *
|
||||
|
||||
bwa = BWA(argv[1])
|
||||
for read in FASTQ(argv[2]):
|
||||
for reg in bwa.align(read.read):
|
||||
if reg.secondary >= 0: continue
|
||||
aln = bwa.reg2aln(read.read, reg)
|
||||
print(read.name, '-' if aln.rev else '+', bwa.name(aln), aln.pos, aln.mapq, aln.cigar, aln.NM)
|
||||
|
||||
This program can be invoked as ``seqc run example.seq /path/to/hg19.fa /path/to/reads.fq``.
|
||||
|
||||
BWA options can be passed via ``BWA(options(...), ...)``. For example, to set a mismatch score of 5, use ``BWA(options(mismatch_score=5), "hg19.fa")``. Valid options are:
|
||||
|
||||
- ``match_score``
|
||||
- ``mismatch_score``
|
||||
- ``open_del``
|
||||
- ``open_ins``
|
||||
- ``extend_del``
|
||||
- ``extend_ins``
|
||||
- ``bandwidth``
|
||||
- ``zdrop``
|
||||
- ``clip_penalty``
|
||||
- ``unpaired_penalty``
|
||||
|
||||
Consult the BWA documentation for a detailed description of each of these.
|
|
@ -1,770 +0,0 @@
|
|||
Workshop
|
||||
========
|
||||
|
||||
In this workshop, we will build a toy read mapper called *SeqMap* to
|
||||
showcase some of Seq's domain-specific features and optimizations.
|
||||
|
||||
Code Listings
|
||||
-------------
|
||||
|
||||
- **Section 1**: :ref:`section1-code` | :download:`click to download <../../workshop/section1.seq>`
|
||||
- **Section 2**: :ref:`section2-code` | :download:`click to download <../../workshop/section2.seq>`
|
||||
- **Section 3**: :ref:`section3-code` | :download:`click to download <../../workshop/section3.seq>`
|
||||
- **Section 4**: :ref:`section4-code` | :download:`click to download <../../workshop/section4.seq>`
|
||||
- **Section 5**: :ref:`section5-code` | :download:`click to download <../../workshop/section5.seq>`
|
||||
- **Section 6**: :ref:`section6-code` | :download:`click to download <../../workshop/section6.seq>`
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
|
||||
If you don't have Seq installed already, you can install it with one bash command:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
|
||||
|
||||
Be sure to restart the shell for ``seqc`` to be added to your ``PATH`` (or add it manually).
|
||||
|
||||
Now, let's create a new directory to house our Seq code and test data:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
mkdir seqmap
|
||||
cd seqmap
|
||||
|
||||
We'll test *SeqMap* on the following data:
|
||||
|
||||
- Chromosome 22 as the reference sequence (``chr22.fa``, 52MB)
|
||||
- A simulated set of 1 million 101bp reads (``reads.fastq``, 251MB)
|
||||
|
||||
Let's download this data into a new folder called ``data``:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
curl -L https://www.dropbox.com/s/8lsngqvn0tzjn1o/data.tar.gz | tar zxf - -C .
|
||||
|
||||
What does this data actually look like? Let's take a look:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
head data/reads.fq
|
||||
|
||||
.. code:: text
|
||||
|
||||
@chr22_16993648_16994131_1:0:0_2:0:0_0/1
|
||||
CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
|
||||
+
|
||||
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
|
||||
@chr22_28253010_28253558_1:0:0_0:0:0_1/1
|
||||
AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
|
||||
+
|
||||
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
|
||||
@chr22_21656192_21656802_0:0:0_2:0:0_2/1
|
||||
ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
|
||||
|
||||
Each `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ record consists of four lines:
|
||||
|
||||
- Read name, starting with ``@``. Notice that since this data is simulated, the read name includes the
|
||||
location on the genome where the read comes from; this will be useful later!
|
||||
- Read sequence. This is a DNA sequence consisting of ``ACGT`` bases and possibly ``N``, indicating an
|
||||
ambiguous base.
|
||||
- Separator (``+``)
|
||||
- Read quality scores. This is a string of characters as long as the read sequence, where each character
|
||||
indicates the "quality" of the corresponding base in the read sequence. We won't be using it here.
|
||||
|
||||
What about the reference sequence `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`_ file?
|
||||
|
||||
.. code:: bash
|
||||
|
||||
head data/chr22.fa
|
||||
|
||||
.. code:: text
|
||||
|
||||
>chr22
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||||
|
||||
FASTQ records start with the sequence name (prefixed with ``>``) followed by the sequence itself, split
|
||||
across multiple lines. But why are all the bases ``N``? Chromosomal sequences often have their starts
|
||||
and ends masked with ``N``'s to cover repetitive `telomeric <https://en.wikipedia.org/wiki/Telomere>`_ sequences.
|
||||
Since we usually don't want to include such regions in our analyses, they are masked in the file. Let's
|
||||
look a bit further down:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
head -n 1000000 data/chr22.fa | tail -n 10
|
||||
|
||||
.. code:: text
|
||||
|
||||
tattaaaggaaaaaactgtatgaaatagtacatttctcataattctcatt
|
||||
ttgtaaaaataaagtacttatctatggacataatgagaaaatgactcaag
|
||||
gtaccaagagtttagccattagctataccagtggattataagcaaattct
|
||||
gttACGTGCATGCACTCACCTACGCATGTTCATGTATTCATACATACGTA
|
||||
CATAATTTTTTAAATTTTCTTTTATAGACAAGCAATAGCTTTATAATCTC
|
||||
TATAATCAGTAAAAATAAGTAAGTggctggacgcagtggctcacacctgt
|
||||
aatctcagcactttgggaggctgaggagggcagattatgaggtcagaaga
|
||||
tcaagaccatcctggctaacacagtgaaaccccatctctactaaaaatac
|
||||
aaaaaattagccacgcgtggtggcacgcgcctgtagtcccagctactggg
|
||||
gaggctgaggcaggaaaatcgcttgaacccgggaggcagaggttgcggtg
|
||||
|
||||
Now we can see the usual ``ACGT`` bases. The fact that some bases are lowercase indicates that they
|
||||
are a part of some repetitive element or region. Seq will handle these different uppercase and lowercase
|
||||
characters automatically, so we don't need to worry about them.
|
||||
|
||||
You might notice an additional file called ``chr22.fa.fai``: this is a FASTA index file that includes
|
||||
information about each sequence contained in the file for easier parsing. We won't use it directly,
|
||||
but Seq uses it internally to make FASTA parsing more efficient.
|
||||
|
||||
|
||||
Section 1: Reading sequences from disk
|
||||
--------------------------------------
|
||||
|
||||
The first step of processing any kind of sequencing data is to read it from disk.
|
||||
Seq has builtin support for many of the standard file formats such as FASTA, FASTQ,
|
||||
SAM, BAM, etc.
|
||||
|
||||
Let's write a program to read our FASTQ file and print each record's name and sequence
|
||||
on a single line:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
from sys import argv
|
||||
from bio import *
|
||||
for record in FASTQ(argv[1]):
|
||||
print(record.name, record.seq)
|
||||
|
||||
Now we can run this Seq program:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
seqc run section1.seq data/reads.fq > out.txt
|
||||
|
||||
and view the results:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
head out.txt
|
||||
|
||||
.. code:: text
|
||||
|
||||
chr22_16993648_16994131_1:0:0_2:0:0_0/1 CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
|
||||
chr22_28253010_28253558_1:0:0_0:0:0_1/1 AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
|
||||
chr22_21656192_21656802_0:0:0_2:0:0_2/1 ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
|
||||
chr22_44541236_44541725_0:1:0_0:0:0_3/1 CTCTCTGTCTCTCTCTCTCCCCTAGGTCAGGGTGGTCCCTGGGGAGGCCCCTGGGTTACCCCAAGACAGGTGGGAGGTGCTTCCTACCCGACCCTCTTCCT
|
||||
chr22_39607671_39608139_0:0:0_2:0:0_4/1 ATTGGCTCAGAGTTCAGCAGGCTGTACCAGCATGGCGCCAGTGTCTGCTCCTGGTGAGGCCTTACGGACGTTACAATAACGGCGGAAGGCAAAGGCGGAGC
|
||||
chr22_35577703_35578255_3:0:0_1:0:0_5/1 TGCCATGGTGGTTAGCTGCACCCATCAACCTGTCATCTACATTAGGTATTTTTCCTAATGCTATCCCTCCCCTAGCACCCTACCCTCTGATAGGCCCTGGT
|
||||
chr22_46059124_46059578_1:0:0_1:0:0_6/1 AATCAGTACCAAACAATATATGGATATTATTGGCACTTTGTGCTCCCTCTGCCTGAACTGGGAATTCCTCTATTAGTTTTGACATTATCTGGTATTGAACC
|
||||
chr22_31651867_31652385_2:0:0_2:0:0_7/1 ATCTAGTGACAGTAAGTGGCTGATAAAGTGAGCTGCCATTACATAGTCATCATCTTTAATAGAAGTTAACACATACTGAGTTTCTACTATATTGGGTCTTT
|
||||
chr22_24816466_24817026_1:0:0_1:0:0_8/1 CACCTCTAGGGCTCAAGGGGCAGTTCCTCCATTCCTCAGCAGTGGCGCCTGTGGAACTGTGTCCTGAGGCCAGGGGGTGGTCAGGCAGGGCCTGGAGTGGC
|
||||
chr22_27496272_27496752_1:0:0_1:0:0_9/1 CTTAGCCCCATTAAACACTGGCAGGGCTGAATTGTCTGCTGCCATCCATCACACCTTCTCCCCTAGCCTGGTTTCTTACCTACCTGGAAGCCGTCCCTTTT
|
||||
|
||||
Pretty straightforward! FASTA files can be read in a very similar way.
|
||||
|
||||
.. _section1-code:
|
||||
|
||||
Full code listing
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
:download:`click to download <../../workshop/section1.seq>`
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# SeqMap
|
||||
# Seq workshop -- Section 1
|
||||
# Reads and prints a FASTQ file.
|
||||
# Usage: seqc run section1.seq <FASTQ path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
for record in FASTQ(argv[1]):
|
||||
print(record.name, record.seq)
|
||||
|
||||
|
||||
Section 2: Building an index
|
||||
----------------------------
|
||||
|
||||
Our goal is to find a "mapping" on the genome for each read. Comparing to every
|
||||
position on the reference sequence would take far too long. An alternative is
|
||||
to create an index of the k-mers from the reference sequence and use it to guide
|
||||
the mapping process.
|
||||
|
||||
Let's build a dictionary that maps each k-mer to its position ("locus") on the
|
||||
reference sequence:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
from sys import argv
|
||||
from bio import *
|
||||
index = {}
|
||||
|
||||
for record in FASTA(argv[1]):
|
||||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||||
index[kmer] = pos
|
||||
|
||||
Of course, there will be k-mers that appear multiple times, but let's ignore this
|
||||
detail for now and just store the latest position we see for each k-mer.
|
||||
|
||||
Another important issue is *reverse complementation*: some of our reads will map
|
||||
in the reverse direction rather than in the forward direction. For this reason,
|
||||
let's build our index in such a way that a k-mer is considered "equal" to its
|
||||
reverse complement. One easy way to do this is by using "canonical" k-mers, i.e.
|
||||
the minimum of a k-mer and its reverse complement:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
from sys import argv
|
||||
from bio import *
|
||||
index = {}
|
||||
|
||||
for record in FASTA(argv[1]):
|
||||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||||
index[min(kmer, ~kmer)] = pos # <--
|
||||
|
||||
(We'll have to use canonical k-mers when querying the index too, of course.)
|
||||
|
||||
Now we have our index as a dictionary (``index``), but we don't want to build it
|
||||
each time we perform read mapping, since it only depends on the (fixed) reference
|
||||
sequence. So, as a last step, let's dump the index to a file using the ``pickle``
|
||||
module:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'wb') as jar:
|
||||
pickle.dump(index, jar)
|
||||
|
||||
Run the program:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
seqc run section2.seq data/chr22.fa
|
||||
|
||||
Now we should see a new file ``data/chr22.fa.index`` which stores our
|
||||
serialized index.
|
||||
|
||||
The nice thing is we should only have to build our index once!
|
||||
|
||||
.. _section2-code:
|
||||
|
||||
Full code listing
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
:download:`click to download <../../workshop/section2.seq>`
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# SeqMap
|
||||
# Seq workshop -- Section 2
|
||||
# Reads and constructs a hash table index from an input
|
||||
# FASTA file.
|
||||
# Usage: seqc run section2.seq <FASTA path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
index = {}
|
||||
|
||||
for record in FASTA(argv[1]):
|
||||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||||
index[min(kmer, ~kmer)] = pos
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'wb') as jar:
|
||||
pickle.dump(index, jar)
|
||||
|
||||
|
||||
Section 3: Finding k-mer matches
|
||||
--------------------------------
|
||||
|
||||
At this point, we have an index we can load from disk. Let's use it
|
||||
to find candidate mappings for our reads.
|
||||
|
||||
We'll split each read into k-mers and report a mapping if at least two
|
||||
k-mers support a particular locus.
|
||||
|
||||
The first step is to load the index:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
Now we can iterate over our reads and query k-mers in the index. We need
|
||||
a way to keep track of candidate mapping positions as we process the
|
||||
k-mers of a read: we can do this using a new dictionary, ``candidates``,
|
||||
which maps candidate alignment positions to the number of k-mers supporting
|
||||
the given position.
|
||||
|
||||
Then, we just iterate over ``candidates`` and output positions supported by
|
||||
2 or more k-mers. Finally, we clear ``candidates`` before processing the next
|
||||
read:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
candidates = {} # position -> count mapping
|
||||
for record in FASTQ(argv[2]):
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
print(record.name, pos + 1)
|
||||
|
||||
candidates.clear()
|
||||
|
||||
Run the program:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
seqc run section3.seq data/chr22.fa data/reads.fq > out.txt
|
||||
|
||||
Let's take a look at the output:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
head out.txt
|
||||
|
||||
.. code:: text
|
||||
|
||||
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648
|
||||
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010
|
||||
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236
|
||||
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867
|
||||
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577
|
||||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563
|
||||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174
|
||||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988
|
||||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316
|
||||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615
|
||||
|
||||
Notice that most positions we reported match the position from the read
|
||||
name (the first integer after the ``_``); not bad!
|
||||
|
||||
.. _section3-code:
|
||||
|
||||
Full code listing
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
:download:`click to download <../../workshop/section3.seq>`
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# SeqMap
|
||||
# Seq workshop -- Section 3
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings.
|
||||
# Usage: seqc run section3.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
candidates = {} # position -> count mapping
|
||||
for record in FASTQ(argv[2]):
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
print(record.name, pos + 1)
|
||||
|
||||
candidates.clear()
|
||||
|
||||
|
||||
Section 4: Smith-Waterman alignment and CIGAR strings
|
||||
-----------------------------------------------------
|
||||
|
||||
We now have the ability to report mapping *positions* for each read,
|
||||
but usually we want *alignments*, which include information about
|
||||
mismatches, insertions and deletions.
|
||||
|
||||
Luckily, Seq makes sequence alignment easy: to align sequence ``q``
|
||||
against sequence ``t``, you can just do:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
aln = q @ t
|
||||
|
||||
``aln`` is a tuple of alignment score and CIGAR string (a *CIGAR string* is
|
||||
a way of encoding an alignment result, and consists of operations such as ``M``
|
||||
for match/mismatch, ``I`` for insertion and ``D`` for deletion, accompanied
|
||||
by the number of associated bases; for example, ``3M2I4M`` indicates 3 (mis)matches
|
||||
followed by a length-2 insertion followed by 4 (mis)matches).
|
||||
|
||||
By default, `Levenshtein distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ is
|
||||
used, meaning mismatch and gap costs are both 1, while match costs are zero. More
|
||||
control over alignment parameters can be achieved using the ``align`` method:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
aln = q.align(t, a=2, b=4, ambig=0, gapo=4, gape=2)
|
||||
|
||||
where ``a`` is the match score, ``b`` is the mismatch cost, ``ambig`` is the
|
||||
ambiguous base (``N``) match score, ``gapo`` is the gap open cost and ``gape``
|
||||
the gap extension cost (i.e. a gap of length ``k`` costs ``gapo + (k * gape)``).
|
||||
There are many more parameters as well, controlling factors like alignment bandwidth,
|
||||
Z-drop, global/extension alignment and more; check the
|
||||
`standard library reference <https://docs.seq-lang.org/stdlib/bio/align.html#bio.align.seq.align>`_
|
||||
for further details.
|
||||
|
||||
For now, we'll use a simple ``query.align(target)``:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
candidates = {} # position -> count mapping
|
||||
for record in FASTQ(argv[2]):
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
# get query, target and align:
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
candidates.clear()
|
||||
|
||||
Run the program:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
seqc run section4.seq data/chr22.fa data/reads.fq > out.txt
|
||||
|
||||
And let's take a look at the output once again:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
head out.txt
|
||||
|
||||
.. code:: text
|
||||
|
||||
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648 196 101M
|
||||
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010 196 101M
|
||||
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236 196 101M
|
||||
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867 190 101M
|
||||
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577 196 101M
|
||||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563 110 20M1I4M1D76M
|
||||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174 134 20M1I4M1D76M
|
||||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988 128 20M1I4M1D76M
|
||||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316 118 101M
|
||||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615 202 101M
|
||||
|
||||
Most of the alignments contain only matches or mismatches (``M``), which
|
||||
is to be expected as insertions and deletions are far less common. In fact,
|
||||
the three mappings containing indels appear to be incorrect!
|
||||
|
||||
A more thorough mapping scheme would also look at alignment scores before
|
||||
reporting mappings, although for the purposes of this workshop we'll ignore
|
||||
such improvements.
|
||||
|
||||
.. _section4-code:
|
||||
|
||||
Full code listing
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
:download:`click to download <../../workshop/section4.seq>`
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# SeqMap
|
||||
# Seq workshop -- Section 4
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings, then performs alignment.
|
||||
# Usage: seqc run section4.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
reference = s''
|
||||
for record in FASTA(argv[1]):
|
||||
reference = record.seq
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
candidates = {} # position -> count mapping
|
||||
for record in FASTQ(argv[2]):
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
candidates.clear()
|
||||
|
||||
|
||||
Section 5: Pipelines
|
||||
--------------------
|
||||
|
||||
Pipelines are a very convenient Seq construct for expressing a variety
|
||||
of algorithms and applications. In fact, *SeqMap* can be thought of as
|
||||
a pipeline with the following stages:
|
||||
|
||||
- read a record from the FASTQ file,
|
||||
- find candidate alignments by querying the index,
|
||||
- perform alignment for mappings supported by 2+ k-mers and output results.
|
||||
|
||||
We can write this as a pipeline in Seq as follows:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
def find_candidates(record):
|
||||
candidates = {} # position -> count mapping
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
yield record, pos
|
||||
|
||||
def align_and_output(t):
|
||||
record, pos = t
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
Notice that ``find_candidates`` *yields* candidate alignments to ``align_and_output``,
|
||||
which then performs alignment and prints the results. In Seq, all values generated
|
||||
from one stage of a pipeline are passed to the next. The Seq compiler performs many
|
||||
domain-specific optimizations on pipelines, one of which we focus on in the next section.
|
||||
|
||||
(Optional) Parallelism
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Parallelism can be achieved using the parallel pipe operator, ``||>``, which
|
||||
tells the compiler that all subsequent stages can be executed in parallel:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output
|
||||
|
||||
Since the full program also involves loading the index, let's time the main
|
||||
pipeline using the ``timing`` module:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
import timing
|
||||
with timing('mapping'):
|
||||
FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output
|
||||
|
||||
We can try this for different numbers of threads:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
export OMP_NUM_THREADS=1
|
||||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||||
# mapping took 48.2858s
|
||||
|
||||
export OMP_NUM_THREADS=2
|
||||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||||
# mapping took 35.886s
|
||||
|
||||
Often, batching reads into larger blocks and processing those blocks in parallel can
|
||||
yield better performance, especially if each read is quick to process. This is also
|
||||
very easy to do in Seq:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
def process_block(block):
|
||||
block |> iter |> find_candidates |> align_and_output
|
||||
|
||||
with timing('mapping'):
|
||||
FASTQ(argv[2]) |> blocks(size=2000) ||> process_block
|
||||
|
||||
And now:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
export OMP_NUM_THREADS=1
|
||||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||||
# mapping took 48.2858s
|
||||
|
||||
export OMP_NUM_THREADS=2
|
||||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||||
# mapping took 25.2648s
|
||||
|
||||
.. _section5-code:
|
||||
|
||||
Full code listing
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
:download:`click to download <../../workshop/section5.seq>`
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# SeqMap
|
||||
# Seq workshop -- Section 5
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings, then performs alignment.
|
||||
# Implemented with Seq pipelines.
|
||||
# Usage: seqc run section5.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from time import timing
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
reference = s''
|
||||
for record in FASTA(argv[1]):
|
||||
reference = record.seq
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
def find_candidates(record):
|
||||
candidates = {} # position -> count mapping
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
yield record, pos
|
||||
|
||||
def align_and_output(t):
|
||||
record, pos = t
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
with timing('mapping'):
|
||||
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
|
||||
|
||||
|
||||
Section 6: Domain-specific optimizations
|
||||
----------------------------------------
|
||||
|
||||
Seq already performs numerous domain-specific optimizations under the hood.
|
||||
However, we can give the compiler a hint in this case to perform one more:
|
||||
*inter-sequence alignment*. This optimization entails batching sequences
|
||||
prior to alignment, then aligning multiple pairs using a very fast SIMD
|
||||
optimized alignment kernel.
|
||||
|
||||
In Seq, we just need one additional function annotation to tell the compiler
|
||||
to perform this optimization:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
@inter_align
|
||||
def align_and_output(t):
|
||||
...
|
||||
|
||||
Let's run the program with and without this optimization:
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# without @inter_align
|
||||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||||
# mapping took 43.4457s
|
||||
|
||||
# with @inter_align
|
||||
seqc run section6.seq data/chr22.fa data/reads.fq > out.txt
|
||||
# mapping took 32.3241s
|
||||
|
||||
(The timings with inter-sequence alignment will depend on the SIMD instruction
|
||||
sets your CPU supports; these numbers are from using AVX2.)
|
||||
|
||||
.. _section6-code:
|
||||
|
||||
Full code listing
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
:download:`click to download <../../workshop/section6.seq>`
|
||||
|
||||
.. code:: seq
|
||||
|
||||
# SeqMap
|
||||
# Seq workshop -- Section 6
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings, then performs alignment.
|
||||
# Implemented with Seq pipelines using inter-seq. alignment.
|
||||
# Usage: seqc run section6.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from time import timing
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
reference = s''
|
||||
for record in FASTA(argv[1]):
|
||||
reference = record.seq
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
def find_candidates(record):
|
||||
candidates = {} # position -> count mapping
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
yield record, pos
|
||||
|
||||
@inter_align
|
||||
def align_and_output(t):
|
||||
record, pos = t
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
with timing('mapping'):
|
||||
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
|
|
@ -1,8 +0,0 @@
|
|||
# SeqMap
|
||||
# Seq workshop -- Section 1
|
||||
# Reads and prints a FASTQ file.
|
||||
# Usage: seqc run section1.seq <FASTQ path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
for record in FASTQ(argv[1]):
|
||||
print(record.name, record.seq)
|
|
@ -1,18 +0,0 @@
|
|||
# SeqMap
|
||||
# Seq workshop -- Section 2
|
||||
# Reads and constructs a hash table index from an input
|
||||
# FASTA file.
|
||||
# Usage: seqc run section2.seq <FASTA path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
index = {}
|
||||
|
||||
for record in FASTA(argv[1]):
|
||||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||||
index[min(kmer, ~kmer)] = pos
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'wb') as jar:
|
||||
pickle.dump(index, jar)
|
|
@ -1,29 +0,0 @@
|
|||
# SeqMap
|
||||
# Seq workshop -- Section 3
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings.
|
||||
# Usage: seqc run section3.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
candidates = {} # position -> count mapping
|
||||
for record in FASTQ(argv[2]):
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
print(record.name, pos + 1)
|
||||
|
||||
candidates.clear()
|
|
@ -1,36 +0,0 @@
|
|||
# SeqMap
|
||||
# Seq workshop -- Section 4
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings, then performs alignment.
|
||||
# Usage: seqc run section4.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
reference = s''
|
||||
for record in FASTA(argv[1]):
|
||||
reference = record.seq
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
candidates = {} # position -> count mapping
|
||||
for record in FASTQ(argv[2]):
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
candidates.clear()
|
|
@ -1,42 +0,0 @@
|
|||
# SeqMap
|
||||
# Seq workshop -- Section 5
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings, then performs alignment.
|
||||
# Implemented with Seq pipelines.
|
||||
# Usage: seqc run section5.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from time import timing
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
reference = s''
|
||||
for record in FASTA(argv[1]):
|
||||
reference = record.seq
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
def find_candidates(record):
|
||||
candidates = {} # position -> count mapping
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
yield record, pos
|
||||
|
||||
def align_and_output(t):
|
||||
record, pos = t
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
with timing('mapping'):
|
||||
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
|
|
@ -1,43 +0,0 @@
|
|||
# SeqMap
|
||||
# Seq workshop -- Section 6
|
||||
# Reads index constructed in Section 2 and looks up k-mers from
|
||||
# input reads to find candidate mappings, then performs alignment.
|
||||
# Implemented with Seq pipelines using inter-seq. alignment.
|
||||
# Usage: seqc run section6.seq <FASTA path> <FASTQ path>
|
||||
from sys import argv
|
||||
from time import timing
|
||||
from bio import *
|
||||
import pickle
|
||||
import gzip
|
||||
|
||||
reference = s''
|
||||
for record in FASTA(argv[1]):
|
||||
reference = record.seq
|
||||
|
||||
K: Static[int] = 32
|
||||
index = None
|
||||
|
||||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||||
|
||||
def find_candidates(record):
|
||||
candidates = {} # position -> count mapping
|
||||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||||
found = index.get(min(kmer, ~kmer), -1)
|
||||
if found > 0:
|
||||
loc = found - pos
|
||||
candidates[loc] = candidates.get(loc, 0) + 1
|
||||
for pos,count in candidates.items():
|
||||
if count > 1:
|
||||
yield record, pos
|
||||
|
||||
@inter_align
|
||||
def align_and_output(t):
|
||||
record, pos = t
|
||||
query = record.read
|
||||
target = reference[pos:pos + len(query)]
|
||||
alignment = query.align(target)
|
||||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||||
|
||||
with timing('mapping'):
|
||||
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
|
|
@ -37,7 +37,7 @@ def timing(msg: str = ""):
|
|||
"""
|
||||
Example usage:
|
||||
|
||||
.. code-block:: seq
|
||||
.. code-block:: python
|
||||
|
||||
from time import timing
|
||||
with timing('foo function'):
|
||||
|
|
Loading…
Reference in New Issue