Based on the patent titles (I'll see if I can read some of them in detail tomorrow), most of these sounds almost exactly like the code I wrote while working at the JGI[1][2][3] in the early 2000s that managed moving large amounts of reads from the ABI (Sanger) sequencers, running it through phred/phrap, and storing it all so the biologists could access it easily. This included a custom Huffman tree based encoder/decoder to efficiently store FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as packed array of bytes), a very large MySQL backend, and a large set of Perl libraries that provided easy access to reads/libraries/assemblies/etc. It was certainly a "method and apparatus" for "storing and accessing" + "indexing" bioinformatics data using a "compact representation" that provided many different types of "selective access".
I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that intercepted calls to open(2) to load reads automagically from the DB. Reading Huffman encoded data in bulk from the DB (instead of one file per read) reduced the network bandwidth required to open an assembly with all it's aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics data" over a network and "access ... structured in access units". It defiantly involved "reconstruction of genomic reference sequences from compressed genomic sequence reads".
They may have a more efficient compression method, and we didn't do anything re: "multiple genomic descriptors" (was that even a thing <2004?), but... no... they didn't invent what is basically a bioinformatics-specific variations of the same methods used everywhere in the computer industry for as long as "text file formats" have existed.
[2] These are my personal comments and opinions only, which are not endorsed by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley National Laboratory, or the U.S. Department Of Energy.
[3] While I have no idea if any of that code even exists today (I left the JGI in 2004), I did mark the source files with the BSD license, since there was historical precedent.
"N" for uNknown, where the raw sensor information - which should be a series of nice Gaussians[1 top] - was too ambiguous/low-quality for a useful basecall, but the presence of something can be inferred from the timing information[1 bottom, with "N"s in some bases]
An additional unused bit pattern was reserved for future emergency additions. It used the longest bit pattern so future additions wouldn't be balanced properly, but I wanted something simple, guaranteed binary-compatibility, that could handle a sudden change in requirements.
In some situations, a variation was used that included a few symbols for partial basecalls like "unknown purine" (A or G) and "unknown pyrimidine" (C or T).
It's hard with huffman given you need to deal with N. Realistically you'll end up with 3 bases at 2 bits, 1 and 3 and the other 3 being a prefix for everything else (N, ambiguity codes, etc), so somewhere averaging close to 2.3 bits is the norm.
If N is rare though, you're better off just doing blocks of 2-bit encoding and dropping to something more complex for the rare cases. Or of course just using an arithmetic / range / ANS coder.
That's right - the Huffman tree was built based on the base frequency of use, which actually varies depending on what type of organism you're sequencing[1].
The title is just a title, it doesn't have any legal significance. These patents address a specific encoding that is compact and indexed, not the general idea of such encodings. The patents expressly distinguish the claimed method from the approach you're describing (applying traditional compression techniques to FASTA/FASTQ files): https://patentscope.wipo.int/search/en/detail.jsf?docId=WO20...
> [0003] The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
(Note that GZIP, mentioned in the patent as an example of the prior art, uses Huffman coding.)
The patent also expressly distinguishes using an index separate from the bitstream of the genomic data, which would be the case in your method above where you're storing compressed data in a MySQL database:
> 1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.
It also explains:
> [0006] The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.
Without knowing more about it, it sounds like the approach you describe wouldn't allow for incremental updates.
Based on the patent titles (I'll see if I can read some of them in detail tomorrow), most of these sounds almost exactly like the code I wrote while working at the JGI[1][2][3] in the early 2000s that managed moving large amounts of reads from the ABI (Sanger) sequencers, running it through phred/phrap, and storing it all so the biologists could access it easily. This included a custom Huffman tree based encoder/decoder to efficiently store FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as packed array of bytes), a very large MySQL backend, and a large set of Perl libraries that provided easy access to reads/libraries/assemblies/etc. It was certainly a "method and apparatus" for "storing and accessing" + "indexing" bioinformatics data using a "compact representation" that provided many different types of "selective access".
I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that intercepted calls to open(2) to load reads automagically from the DB. Reading Huffman encoded data in bulk from the DB (instead of one file per read) reduced the network bandwidth required to open an assembly with all it's aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics data" over a network and "access ... structured in access units". It defiantly involved "reconstruction of genomic reference sequences from compressed genomic sequence reads".
They may have a more efficient compression method, and we didn't do anything re: "multiple genomic descriptors" (was that even a thing <2004?), but... no... they didn't invent what is basically a bioinformatics-specific variations of the same methods used everywhere in the computer industry for as long as "text file formats" have existed.
[1] https://jgi.doe.gov/
[2] These are my personal comments and opinions only, which are not endorsed by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley National Laboratory, or the U.S. Department Of Energy.
[3] While I have no idea if any of that code even exists today (I left the JGI in 2004), I did mark the source files with the BSD license, since there was historical precedent.