mirror of https://github.com/exaloop/codon.git
771 lines
25 KiB
ReStructuredText
771 lines
25 KiB
ReStructuredText
Workshop
|
||
========
|
||
|
||
In this workshop, we will build a toy read mapper called *SeqMap* to
|
||
showcase some of Seq's domain-specific features and optimizations.
|
||
|
||
Code Listings
|
||
-------------
|
||
|
||
- **Section 1**: :ref:`section1-code` | :download:`click to download <../../workshop/section1.seq>`
|
||
- **Section 2**: :ref:`section2-code` | :download:`click to download <../../workshop/section2.seq>`
|
||
- **Section 3**: :ref:`section3-code` | :download:`click to download <../../workshop/section3.seq>`
|
||
- **Section 4**: :ref:`section4-code` | :download:`click to download <../../workshop/section4.seq>`
|
||
- **Section 5**: :ref:`section5-code` | :download:`click to download <../../workshop/section5.seq>`
|
||
- **Section 6**: :ref:`section6-code` | :download:`click to download <../../workshop/section6.seq>`
|
||
|
||
Getting Started
|
||
---------------
|
||
|
||
If you don't have Seq installed already, you can install it with one bash command:
|
||
|
||
.. code:: bash
|
||
|
||
/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"
|
||
|
||
Be sure to restart the shell for ``seqc`` to be added to your ``PATH`` (or add it manually).
|
||
|
||
Now, let's create a new directory to house our Seq code and test data:
|
||
|
||
.. code:: bash
|
||
|
||
mkdir seqmap
|
||
cd seqmap
|
||
|
||
We'll test *SeqMap* on the following data:
|
||
|
||
- Chromosome 22 as the reference sequence (``chr22.fa``, 52MB)
|
||
- A simulated set of 1 million 101bp reads (``reads.fastq``, 251MB)
|
||
|
||
Let's download this data into a new folder called ``data``:
|
||
|
||
.. code:: bash
|
||
|
||
curl -L https://www.dropbox.com/s/8lsngqvn0tzjn1o/data.tar.gz | tar zxf - -C .
|
||
|
||
What does this data actually look like? Let's take a look:
|
||
|
||
.. code:: bash
|
||
|
||
head data/reads.fq
|
||
|
||
.. code:: text
|
||
|
||
@chr22_16993648_16994131_1:0:0_2:0:0_0/1
|
||
CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
|
||
+
|
||
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
|
||
@chr22_28253010_28253558_1:0:0_0:0:0_1/1
|
||
AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
|
||
+
|
||
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
|
||
@chr22_21656192_21656802_0:0:0_2:0:0_2/1
|
||
ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
|
||
|
||
Each `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ record consists of four lines:
|
||
|
||
- Read name, starting with ``@``. Notice that since this data is simulated, the read name includes the
|
||
location on the genome where the read comes from; this will be useful later!
|
||
- Read sequence. This is a DNA sequence consisting of ``ACGT`` bases and possibly ``N``, indicating an
|
||
ambiguous base.
|
||
- Separator (``+``)
|
||
- Read quality scores. This is a string of characters as long as the read sequence, where each character
|
||
indicates the "quality" of the corresponding base in the read sequence. We won't be using it here.
|
||
|
||
What about the reference sequence `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`_ file?
|
||
|
||
.. code:: bash
|
||
|
||
head data/chr22.fa
|
||
|
||
.. code:: text
|
||
|
||
>chr22
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
|
||
|
||
FASTQ records start with the sequence name (prefixed with ``>``) followed by the sequence itself, split
|
||
across multiple lines. But why are all the bases ``N``? Chromosomal sequences often have their starts
|
||
and ends masked with ``N``'s to cover repetitive `telomeric <https://en.wikipedia.org/wiki/Telomere>`_ sequences.
|
||
Since we usually don't want to include such regions in our analyses, they are masked in the file. Let's
|
||
look a bit further down:
|
||
|
||
.. code:: bash
|
||
|
||
head -n 1000000 data/chr22.fa | tail -n 10
|
||
|
||
.. code:: text
|
||
|
||
tattaaaggaaaaaactgtatgaaatagtacatttctcataattctcatt
|
||
ttgtaaaaataaagtacttatctatggacataatgagaaaatgactcaag
|
||
gtaccaagagtttagccattagctataccagtggattataagcaaattct
|
||
gttACGTGCATGCACTCACCTACGCATGTTCATGTATTCATACATACGTA
|
||
CATAATTTTTTAAATTTTCTTTTATAGACAAGCAATAGCTTTATAATCTC
|
||
TATAATCAGTAAAAATAAGTAAGTggctggacgcagtggctcacacctgt
|
||
aatctcagcactttgggaggctgaggagggcagattatgaggtcagaaga
|
||
tcaagaccatcctggctaacacagtgaaaccccatctctactaaaaatac
|
||
aaaaaattagccacgcgtggtggcacgcgcctgtagtcccagctactggg
|
||
gaggctgaggcaggaaaatcgcttgaacccgggaggcagaggttgcggtg
|
||
|
||
Now we can see the usual ``ACGT`` bases. The fact that some bases are lowercase indicates that they
|
||
are a part of some repetitive element or region. Seq will handle these different uppercase and lowercase
|
||
characters automatically, so we don't need to worry about them.
|
||
|
||
You might notice an additional file called ``chr22.fa.fai``: this is a FASTA index file that includes
|
||
information about each sequence contained in the file for easier parsing. We won't use it directly,
|
||
but Seq uses it internally to make FASTA parsing more efficient.
|
||
|
||
|
||
Section 1: Reading sequences from disk
|
||
--------------------------------------
|
||
|
||
The first step of processing any kind of sequencing data is to read it from disk.
|
||
Seq has builtin support for many of the standard file formats such as FASTA, FASTQ,
|
||
SAM, BAM, etc.
|
||
|
||
Let's write a program to read our FASTQ file and print each record's name and sequence
|
||
on a single line:
|
||
|
||
.. code:: seq
|
||
|
||
from sys import argv
|
||
from bio import *
|
||
for record in FASTQ(argv[1]):
|
||
print(record.name, record.seq)
|
||
|
||
Now we can run this Seq program:
|
||
|
||
.. code:: bash
|
||
|
||
seqc run section1.seq data/reads.fq > out.txt
|
||
|
||
and view the results:
|
||
|
||
.. code:: bash
|
||
|
||
head out.txt
|
||
|
||
.. code:: text
|
||
|
||
chr22_16993648_16994131_1:0:0_2:0:0_0/1 CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
|
||
chr22_28253010_28253558_1:0:0_0:0:0_1/1 AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
|
||
chr22_21656192_21656802_0:0:0_2:0:0_2/1 ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
|
||
chr22_44541236_44541725_0:1:0_0:0:0_3/1 CTCTCTGTCTCTCTCTCTCCCCTAGGTCAGGGTGGTCCCTGGGGAGGCCCCTGGGTTACCCCAAGACAGGTGGGAGGTGCTTCCTACCCGACCCTCTTCCT
|
||
chr22_39607671_39608139_0:0:0_2:0:0_4/1 ATTGGCTCAGAGTTCAGCAGGCTGTACCAGCATGGCGCCAGTGTCTGCTCCTGGTGAGGCCTTACGGACGTTACAATAACGGCGGAAGGCAAAGGCGGAGC
|
||
chr22_35577703_35578255_3:0:0_1:0:0_5/1 TGCCATGGTGGTTAGCTGCACCCATCAACCTGTCATCTACATTAGGTATTTTTCCTAATGCTATCCCTCCCCTAGCACCCTACCCTCTGATAGGCCCTGGT
|
||
chr22_46059124_46059578_1:0:0_1:0:0_6/1 AATCAGTACCAAACAATATATGGATATTATTGGCACTTTGTGCTCCCTCTGCCTGAACTGGGAATTCCTCTATTAGTTTTGACATTATCTGGTATTGAACC
|
||
chr22_31651867_31652385_2:0:0_2:0:0_7/1 ATCTAGTGACAGTAAGTGGCTGATAAAGTGAGCTGCCATTACATAGTCATCATCTTTAATAGAAGTTAACACATACTGAGTTTCTACTATATTGGGTCTTT
|
||
chr22_24816466_24817026_1:0:0_1:0:0_8/1 CACCTCTAGGGCTCAAGGGGCAGTTCCTCCATTCCTCAGCAGTGGCGCCTGTGGAACTGTGTCCTGAGGCCAGGGGGTGGTCAGGCAGGGCCTGGAGTGGC
|
||
chr22_27496272_27496752_1:0:0_1:0:0_9/1 CTTAGCCCCATTAAACACTGGCAGGGCTGAATTGTCTGCTGCCATCCATCACACCTTCTCCCCTAGCCTGGTTTCTTACCTACCTGGAAGCCGTCCCTTTT
|
||
|
||
Pretty straightforward! FASTA files can be read in a very similar way.
|
||
|
||
.. _section1-code:
|
||
|
||
Full code listing
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
:download:`click to download <../../workshop/section1.seq>`
|
||
|
||
.. code:: seq
|
||
|
||
# SeqMap
|
||
# Seq workshop -- Section 1
|
||
# Reads and prints a FASTQ file.
|
||
# Usage: seqc run section1.seq <FASTQ path>
|
||
from sys import argv
|
||
from bio import *
|
||
for record in FASTQ(argv[1]):
|
||
print(record.name, record.seq)
|
||
|
||
|
||
Section 2: Building an index
|
||
----------------------------
|
||
|
||
Our goal is to find a "mapping" on the genome for each read. Comparing to every
|
||
position on the reference sequence would take far too long. An alternative is
|
||
to create an index of the k-mers from the reference sequence and use it to guide
|
||
the mapping process.
|
||
|
||
Let's build a dictionary that maps each k-mer to its position ("locus") on the
|
||
reference sequence:
|
||
|
||
.. code:: seq
|
||
|
||
from sys import argv
|
||
from bio import *
|
||
index = {}
|
||
|
||
for record in FASTA(argv[1]):
|
||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||
index[kmer] = pos
|
||
|
||
Of course, there will be k-mers that appear multiple times, but let's ignore this
|
||
detail for now and just store the latest position we see for each k-mer.
|
||
|
||
Another important issue is *reverse complementation*: some of our reads will map
|
||
in the reverse direction rather than in the forward direction. For this reason,
|
||
let's build our index in such a way that a k-mer is considered "equal" to its
|
||
reverse complement. One easy way to do this is by using "canonical" k-mers, i.e.
|
||
the minimum of a k-mer and its reverse complement:
|
||
|
||
.. code:: seq
|
||
|
||
from sys import argv
|
||
from bio import *
|
||
index = {}
|
||
|
||
for record in FASTA(argv[1]):
|
||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||
index[min(kmer, ~kmer)] = pos # <--
|
||
|
||
(We'll have to use canonical k-mers when querying the index too, of course.)
|
||
|
||
Now we have our index as a dictionary (``index``), but we don't want to build it
|
||
each time we perform read mapping, since it only depends on the (fixed) reference
|
||
sequence. So, as a last step, let's dump the index to a file using the ``pickle``
|
||
module:
|
||
|
||
.. code:: seq
|
||
|
||
import pickle
|
||
import gzip
|
||
|
||
with gzip.open(argv[1] + '.index', 'wb') as jar:
|
||
pickle.dump(index, jar)
|
||
|
||
Run the program:
|
||
|
||
.. code:: bash
|
||
|
||
seqc run section2.seq data/chr22.fa
|
||
|
||
Now we should see a new file ``data/chr22.fa.index`` which stores our
|
||
serialized index.
|
||
|
||
The nice thing is we should only have to build our index once!
|
||
|
||
.. _section2-code:
|
||
|
||
Full code listing
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
:download:`click to download <../../workshop/section2.seq>`
|
||
|
||
.. code:: seq
|
||
|
||
# SeqMap
|
||
# Seq workshop -- Section 2
|
||
# Reads and constructs a hash table index from an input
|
||
# FASTA file.
|
||
# Usage: seqc run section2.seq <FASTA path>
|
||
from sys import argv
|
||
from bio import *
|
||
import pickle
|
||
import gzip
|
||
|
||
index = {}
|
||
|
||
for record in FASTA(argv[1]):
|
||
for pos,kmer in record.seq.kmers_with_pos(k=32, step=1):
|
||
index[min(kmer, ~kmer)] = pos
|
||
|
||
with gzip.open(argv[1] + '.index', 'wb') as jar:
|
||
pickle.dump(index, jar)
|
||
|
||
|
||
Section 3: Finding k-mer matches
|
||
--------------------------------
|
||
|
||
At this point, we have an index we can load from disk. Let's use it
|
||
to find candidate mappings for our reads.
|
||
|
||
We'll split each read into k-mers and report a mapping if at least two
|
||
k-mers support a particular locus.
|
||
|
||
The first step is to load the index:
|
||
|
||
.. code:: seq
|
||
|
||
from sys import argv
|
||
from bio import *
|
||
import pickle
|
||
import gzip
|
||
|
||
K: Static[int] = 32
|
||
index = None
|
||
|
||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||
|
||
Now we can iterate over our reads and query k-mers in the index. We need
|
||
a way to keep track of candidate mapping positions as we process the
|
||
k-mers of a read: we can do this using a new dictionary, ``candidates``,
|
||
which maps candidate alignment positions to the number of k-mers supporting
|
||
the given position.
|
||
|
||
Then, we just iterate over ``candidates`` and output positions supported by
|
||
2 or more k-mers. Finally, we clear ``candidates`` before processing the next
|
||
read:
|
||
|
||
.. code:: seq
|
||
|
||
candidates = {} # position -> count mapping
|
||
for record in FASTQ(argv[2]):
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
print(record.name, pos + 1)
|
||
|
||
candidates.clear()
|
||
|
||
Run the program:
|
||
|
||
.. code:: bash
|
||
|
||
seqc run section3.seq data/chr22.fa data/reads.fq > out.txt
|
||
|
||
Let's take a look at the output:
|
||
|
||
.. code:: bash
|
||
|
||
head out.txt
|
||
|
||
.. code:: text
|
||
|
||
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648
|
||
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010
|
||
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236
|
||
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867
|
||
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577
|
||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563
|
||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174
|
||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988
|
||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316
|
||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615
|
||
|
||
Notice that most positions we reported match the position from the read
|
||
name (the first integer after the ``_``); not bad!
|
||
|
||
.. _section3-code:
|
||
|
||
Full code listing
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
:download:`click to download <../../workshop/section3.seq>`
|
||
|
||
.. code:: seq
|
||
|
||
# SeqMap
|
||
# Seq workshop -- Section 3
|
||
# Reads index constructed in Section 2 and looks up k-mers from
|
||
# input reads to find candidate mappings.
|
||
# Usage: seqc run section3.seq <FASTA path> <FASTQ path>
|
||
from sys import argv
|
||
from bio import *
|
||
import pickle
|
||
import gzip
|
||
|
||
K: Static[int] = 32
|
||
index = None
|
||
|
||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||
|
||
candidates = {} # position -> count mapping
|
||
for record in FASTQ(argv[2]):
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
print(record.name, pos + 1)
|
||
|
||
candidates.clear()
|
||
|
||
|
||
Section 4: Smith-Waterman alignment and CIGAR strings
|
||
-----------------------------------------------------
|
||
|
||
We now have the ability to report mapping *positions* for each read,
|
||
but usually we want *alignments*, which include information about
|
||
mismatches, insertions and deletions.
|
||
|
||
Luckily, Seq makes sequence alignment easy: to align sequence ``q``
|
||
against sequence ``t``, you can just do:
|
||
|
||
.. code:: seq
|
||
|
||
aln = q @ t
|
||
|
||
``aln`` is a tuple of alignment score and CIGAR string (a *CIGAR string* is
|
||
a way of encoding an alignment result, and consists of operations such as ``M``
|
||
for match/mismatch, ``I`` for insertion and ``D`` for deletion, accompanied
|
||
by the number of associated bases; for example, ``3M2I4M`` indicates 3 (mis)matches
|
||
followed by a length-2 insertion followed by 4 (mis)matches).
|
||
|
||
By default, `Levenshtein distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ is
|
||
used, meaning mismatch and gap costs are both 1, while match costs are zero. More
|
||
control over alignment parameters can be achieved using the ``align`` method:
|
||
|
||
.. code:: seq
|
||
|
||
aln = q.align(t, a=2, b=4, ambig=0, gapo=4, gape=2)
|
||
|
||
where ``a`` is the match score, ``b`` is the mismatch cost, ``ambig`` is the
|
||
ambiguous base (``N``) match score, ``gapo`` is the gap open cost and ``gape``
|
||
the gap extension cost (i.e. a gap of length ``k`` costs ``gapo + (k * gape)``).
|
||
There are many more parameters as well, controlling factors like alignment bandwidth,
|
||
Z-drop, global/extension alignment and more; check the
|
||
`standard library reference <https://docs.seq-lang.org/stdlib/bio/align.html#bio.align.seq.align>`_
|
||
for further details.
|
||
|
||
For now, we'll use a simple ``query.align(target)``:
|
||
|
||
.. code:: seq
|
||
|
||
candidates = {} # position -> count mapping
|
||
for record in FASTQ(argv[2]):
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
# get query, target and align:
|
||
query = record.read
|
||
target = reference[pos:pos + len(query)]
|
||
alignment = query.align(target)
|
||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||
|
||
candidates.clear()
|
||
|
||
Run the program:
|
||
|
||
.. code:: bash
|
||
|
||
seqc run section4.seq data/chr22.fa data/reads.fq > out.txt
|
||
|
||
And let's take a look at the output once again:
|
||
|
||
.. code:: bash
|
||
|
||
head out.txt
|
||
|
||
.. code:: text
|
||
|
||
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648 196 101M
|
||
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010 196 101M
|
||
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236 196 101M
|
||
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867 190 101M
|
||
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577 196 101M
|
||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563 110 20M1I4M1D76M
|
||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174 134 20M1I4M1D76M
|
||
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988 128 20M1I4M1D76M
|
||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316 118 101M
|
||
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615 202 101M
|
||
|
||
Most of the alignments contain only matches or mismatches (``M``), which
|
||
is to be expected as insertions and deletions are far less common. In fact,
|
||
the three mappings containing indels appear to be incorrect!
|
||
|
||
A more thorough mapping scheme would also look at alignment scores before
|
||
reporting mappings, although for the purposes of this workshop we'll ignore
|
||
such improvements.
|
||
|
||
.. _section4-code:
|
||
|
||
Full code listing
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
:download:`click to download <../../workshop/section4.seq>`
|
||
|
||
.. code:: seq
|
||
|
||
# SeqMap
|
||
# Seq workshop -- Section 4
|
||
# Reads index constructed in Section 2 and looks up k-mers from
|
||
# input reads to find candidate mappings, then performs alignment.
|
||
# Usage: seqc run section4.seq <FASTA path> <FASTQ path>
|
||
from sys import argv
|
||
from bio import *
|
||
import pickle
|
||
import gzip
|
||
|
||
reference = s''
|
||
for record in FASTA(argv[1]):
|
||
reference = record.seq
|
||
|
||
K: Static[int] = 32
|
||
index = None
|
||
|
||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||
|
||
candidates = {} # position -> count mapping
|
||
for record in FASTQ(argv[2]):
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
query = record.read
|
||
target = reference[pos:pos + len(query)]
|
||
alignment = query.align(target)
|
||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||
|
||
candidates.clear()
|
||
|
||
|
||
Section 5: Pipelines
|
||
--------------------
|
||
|
||
Pipelines are a very convenient Seq construct for expressing a variety
|
||
of algorithms and applications. In fact, *SeqMap* can be thought of as
|
||
a pipeline with the following stages:
|
||
|
||
- read a record from the FASTQ file,
|
||
- find candidate alignments by querying the index,
|
||
- perform alignment for mappings supported by 2+ k-mers and output results.
|
||
|
||
We can write this as a pipeline in Seq as follows:
|
||
|
||
.. code:: seq
|
||
|
||
def find_candidates(record):
|
||
candidates = {} # position -> count mapping
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
yield record, pos
|
||
|
||
def align_and_output(t):
|
||
record, pos = t
|
||
query = record.read
|
||
target = reference[pos:pos + len(query)]
|
||
alignment = query.align(target)
|
||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||
|
||
Notice that ``find_candidates`` *yields* candidate alignments to ``align_and_output``,
|
||
which then performs alignment and prints the results. In Seq, all values generated
|
||
from one stage of a pipeline are passed to the next. The Seq compiler performs many
|
||
domain-specific optimizations on pipelines, one of which we focus on in the next section.
|
||
|
||
(Optional) Parallelism
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Parallelism can be achieved using the parallel pipe operator, ``||>``, which
|
||
tells the compiler that all subsequent stages can be executed in parallel:
|
||
|
||
.. code:: seq
|
||
|
||
FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output
|
||
|
||
Since the full program also involves loading the index, let's time the main
|
||
pipeline using the ``timing`` module:
|
||
|
||
.. code:: seq
|
||
|
||
import timing
|
||
with timing('mapping'):
|
||
FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output
|
||
|
||
We can try this for different numbers of threads:
|
||
|
||
.. code:: bash
|
||
|
||
export OMP_NUM_THREADS=1
|
||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||
# mapping took 48.2858s
|
||
|
||
export OMP_NUM_THREADS=2
|
||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||
# mapping took 35.886s
|
||
|
||
Often, batching reads into larger blocks and processing those blocks in parallel can
|
||
yield better performance, especially if each read is quick to process. This is also
|
||
very easy to do in Seq:
|
||
|
||
.. code:: seq
|
||
|
||
def process_block(block):
|
||
block |> iter |> find_candidates |> align_and_output
|
||
|
||
with timing('mapping'):
|
||
FASTQ(argv[2]) |> blocks(size=2000) ||> process_block
|
||
|
||
And now:
|
||
|
||
.. code:: bash
|
||
|
||
export OMP_NUM_THREADS=1
|
||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||
# mapping took 48.2858s
|
||
|
||
export OMP_NUM_THREADS=2
|
||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||
# mapping took 25.2648s
|
||
|
||
.. _section5-code:
|
||
|
||
Full code listing
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
:download:`click to download <../../workshop/section5.seq>`
|
||
|
||
.. code:: seq
|
||
|
||
# SeqMap
|
||
# Seq workshop -- Section 5
|
||
# Reads index constructed in Section 2 and looks up k-mers from
|
||
# input reads to find candidate mappings, then performs alignment.
|
||
# Implemented with Seq pipelines.
|
||
# Usage: seqc run section5.seq <FASTA path> <FASTQ path>
|
||
from sys import argv
|
||
from time import timing
|
||
from bio import *
|
||
import pickle
|
||
import gzip
|
||
|
||
reference = s''
|
||
for record in FASTA(argv[1]):
|
||
reference = record.seq
|
||
|
||
K: Static[int] = 32
|
||
index = None
|
||
|
||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||
|
||
def find_candidates(record):
|
||
candidates = {} # position -> count mapping
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
yield record, pos
|
||
|
||
def align_and_output(t):
|
||
record, pos = t
|
||
query = record.read
|
||
target = reference[pos:pos + len(query)]
|
||
alignment = query.align(target)
|
||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||
|
||
with timing('mapping'):
|
||
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
|
||
|
||
|
||
Section 6: Domain-specific optimizations
|
||
----------------------------------------
|
||
|
||
Seq already performs numerous domain-specific optimizations under the hood.
|
||
However, we can give the compiler a hint in this case to perform one more:
|
||
*inter-sequence alignment*. This optimization entails batching sequences
|
||
prior to alignment, then aligning multiple pairs using a very fast SIMD
|
||
optimized alignment kernel.
|
||
|
||
In Seq, we just need one additional function annotation to tell the compiler
|
||
to perform this optimization:
|
||
|
||
.. code:: seq
|
||
|
||
@inter_align
|
||
def align_and_output(t):
|
||
...
|
||
|
||
Let's run the program with and without this optimization:
|
||
|
||
.. code:: seq
|
||
|
||
# without @inter_align
|
||
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
|
||
# mapping took 43.4457s
|
||
|
||
# with @inter_align
|
||
seqc run section6.seq data/chr22.fa data/reads.fq > out.txt
|
||
# mapping took 32.3241s
|
||
|
||
(The timings with inter-sequence alignment will depend on the SIMD instruction
|
||
sets your CPU supports; these numbers are from using AVX2.)
|
||
|
||
.. _section6-code:
|
||
|
||
Full code listing
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
:download:`click to download <../../workshop/section6.seq>`
|
||
|
||
.. code:: seq
|
||
|
||
# SeqMap
|
||
# Seq workshop -- Section 6
|
||
# Reads index constructed in Section 2 and looks up k-mers from
|
||
# input reads to find candidate mappings, then performs alignment.
|
||
# Implemented with Seq pipelines using inter-seq. alignment.
|
||
# Usage: seqc run section6.seq <FASTA path> <FASTQ path>
|
||
from sys import argv
|
||
from time import timing
|
||
from bio import *
|
||
import pickle
|
||
import gzip
|
||
|
||
reference = s''
|
||
for record in FASTA(argv[1]):
|
||
reference = record.seq
|
||
|
||
K: Static[int] = 32
|
||
index = None
|
||
|
||
with gzip.open(argv[1] + '.index', 'rb') as jar:
|
||
index = pickle.load(jar, T=Dict[Kmer[K],int])
|
||
|
||
def find_candidates(record):
|
||
candidates = {} # position -> count mapping
|
||
for pos,kmer in record.read.kmers_with_pos(k=K, step=1):
|
||
found = index.get(min(kmer, ~kmer), -1)
|
||
if found > 0:
|
||
loc = found - pos
|
||
candidates[loc] = candidates.get(loc, 0) + 1
|
||
for pos,count in candidates.items():
|
||
if count > 1:
|
||
yield record, pos
|
||
|
||
@inter_align
|
||
def align_and_output(t):
|
||
record, pos = t
|
||
query = record.read
|
||
target = reference[pos:pos + len(query)]
|
||
alignment = query.align(target)
|
||
print(record.name, pos + 1, alignment.score, alignment.cigar)
|
||
|
||
with timing('mapping'):
|
||
FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output
|