when i worked on https://github.com/iontorrent/tmap we thought it would be a good idea to do something like a “local alignment” (using https://en.wikipedia.org/wiki/Smith–Waterman_algorithm) after doing a lookup into a burrows wheeler transform on a substring of the “read.”
I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?
source for 3.2 billion: https://www.ncbi.nlm.nih.gov/books/NBK21134/#!po=0.485437
In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]
With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].
[1] https://en.wikipedia.org/wiki/FM-index [2] https://academic.oup.com/bioinformatics/article/25/14/1754/2... [3] https://academic.oup.com/bioinformatics/article/25/14/1754/2...
There are some great computational benefits using 2 bit encoding for the BWT