> "barrage of 12 patents from GenomSys" Based on the patent titles (I'll see if ...

twic · on Oct 14, 2018

> ~2.5 bit/base

Given that there are four bases, i would have thought you could reliably do it in 2. What am i missing?

pdkl95 · on Oct 14, 2018

At a minimum there were two additional symbols:

"N" for uNknown, where the raw sensor information - which should be a series of nice Gaussians[1 top] - was too ambiguous/low-quality for a useful basecall, but the presence of something can be inferred from the timing information[1 bottom, with "N"s in some bases]

An additional unused bit pattern was reserved for future emergency additions. It used the longest bit pattern so future additions wouldn't be balanced properly, but I wanted something simple, guaranteed binary-compatibility, that could handle a sudden change in requirements.

In some situations, a variation was used that included a few symbols for partial basecalls like "unknown purine" (A or G) and "unknown pyrimidine" (C or T).

[1] http://image.slidesharecdn.com/binoinfo-8-141021112354-conve...

twic · on Oct 14, 2018

Makes sense, thanks. Back when i was phredding for a living, i had the luxury of just cutting off the file when the quality of the read got dodgy!

jkbonfield · on Oct 14, 2018

It's hard with huffman given you need to deal with N. Realistically you'll end up with 3 bases at 2 bits, 1 and 3 and the other 3 being a prefix for everything else (N, ambiguity codes, etc), so somewhere averaging close to 2.3 bits is the norm.

If N is rare though, you're better off just doing blocks of 2-bit encoding and dropping to something more complex for the rare cases. Or of course just using an arithmetic / range / ANS coder.

mbreese · on Oct 14, 2018

You need to encode for an ambiguous base call (N). So it’s really 3 bits, but you can play with the format to encode it slightly more efficiently.

deugtniet · on Oct 14, 2018

Probably that certain base sequences are more likely to occur together than random. If you account for this, it'll help in your compression.

pdkl95 · on Oct 14, 2018

That's right - the Huffman tree was built based on the base frequency of use, which actually varies depending on what type of organism you're sequencing[1].

[1] https://en.wikipedia.org/wiki/GC-content#Among-genome_variat...

rayiner · on Oct 14, 2018

The title is just a title, it doesn't have any legal significance. These patents address a specific encoding that is compact and indexed, not the general idea of such encodings. The patents expressly distinguish the claimed method from the approach you're describing (applying traditional compression techniques to FASTA/FASTQ files): https://patentscope.wipo.int/search/en/detail.jsf?docId=WO20...

> [0003] The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).

(Note that GZIP, mentioned in the patent as an example of the prior art, uses Huffman coding.)

The patent also expressly distinguishes using an index separate from the bitstream of the genomic data, which would be the case in your method above where you're storing compressed data in a MySQL database:

> 1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.

It also explains:

> [0006] The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.

Without knowing more about it, it sounds like the approach you describe wouldn't allow for incremental updates.