From high-throughput sequencing mapped reads (SAM/BAM), GeneAbacus:
Flexibility: Analyzing standard and novel high-throughput sequencing (HTS) experiments. High-throughput sequencing has been adapted to measure a variety of biological traits in both DNA and RNA. Some methods have become standard such as mRNA-seq, some are becoming standard like Ribo-seq or CLIP-seq, and some require unique and specialized designs, as in MPRAs. GeneAbacus provides many types of count and profile methods adapted for many of these methods. Simultaneously, new profile methods can easily be added using the existing methods as templates.
Mapping coordinates: DNA and RNA profiles. HTS experiments are usually mapped to the genome. From the mapped reads, tools providing genome coverage are very useful to generate genomic profiles used for data visualizing in genome browsers. However, while downstream analyses for experiments based on genomic DNA, such as ChIP-seq, can directly use genome coverage of mapped reads, analysis of RNA-based experiments are less straightforward. Most RNAs are strand-specific and/or spliced. Using an algorithm to map coordinates from genomic DNA to RNA, GeneAbacus generates strand-specific RNA profiles. RNA profiles can then be directly used for downstream analysis.
Flexibility: Output formats. GeneAbacus can export to BedGraph files and in text form (CSV). While BedGraph is widely used, making it appropriate for sharing data, it is not trivial to parse these files efficiently. To overcome this hurdle, GeneAbacus exports to an ad-hoc binary format. The binary format is a simple memory dump, together with a checksum for integrity. This makes it fast and easy to load data for downstream analysis in Python or other environments.
Speed: High speed is achieved using Go, parallelization and favoring unsorted over sorted SAM/BAM. To profile and count, GeneAbacus goes through the mapped reads by processing the input SAM/BAM file(s) from beginning to end. This is true for unsorted and sorted SAM/BAMs. GeneAbacus tries to overlap each mapped read (or pair of reads) with a feature. To speed up the overlapping operations, the GeneAbacus core algorithm uses a binary search tree built using the features represented as intervals of coordinates.
Most tools (such as BEDtools genomecov or mosdepth) require sorted BAM files. This imposes computationally expensive steps to convert SAM files into sorted BAM files using samtools. GeneAbacus avoids these expensive steps by sorting the features instead of the reads. This strategy has two advantages:
Currently, GeneAbacus can use SAM or BAM files containing:
We recommend using compressed SAM files: they save as much space as BAM files and are processed as fast as BAM files when used with GeneAbacus.
GeneAbacus is parallelized using Go channels.
See refs page for tarball and executable.
geneabacus -path_bam "input.bam" \ -path_features "danrer_cdna_all_ensembl104.fon1.json" \ -read_strand "-" \ -count_multis "1,900"
Computes read counts and RPKMs for mRNA-seq to
input.bam. In this example, the library was prepared using a dUTP protocol that produces antisense first reads to RNAs (set using option
-read_strand). Using option
-count_multis, two counts are generated per gene: one with reads mapping uniquely ("1") to the genome and one with reads mapping uniquely and those mapping to up to 900 loci (each counting 1/n).
geneabacus -path_sam "input.sam.zst" \ -sam_command_in "zstdcat" \ -path_features "danrer_cdna_protein_coding_rpf_cds_exons_ensembl104.fon1.json" \ -read_length "28,29" \ -read_strand "+" \ -profile_type "first" \ -profile_multi "900" \ -profile_norm
The input SAM is compressed using Zstandard. Ribosome-protected fragments (sense to mRNA) of length 28 and 29 nucleotides, mapping a maximum of 900 times in the genome (each counting 1/n) are added to mRNA profiles using the first position of each read. Profiles are normalized to RPM. By default, profiles are output to
profiles.bedgraph in BedGraph format. For a more convenient format, see our binary format below which is easy to import into Python for downstream analysis.
geneabacus -path_bam "input.bam" \ -path_features "danrer_genome_all_ensembl_grcz11_chrom_length.tab" \ -format_features "tab" \ -profile_type "all" \ -profile_norm \ -profile_multi 1 \ -profile_no_coord_mapping
Genomic profiles are directly viewable as genome browser tracks (default output format is BedGraph). The
all profile type generates genomic "coverage". In this case, the profiles are generated for each chromosome (in the
tab file). Since chromosomes are the same features used for mapping the reads, there is no need to map coordinates from mapped genomic features to genomic profiles: use
-profile_no_coord_mapping to skip this step.
-path_bamPath to BAM file(s). Multiple files can be specified using a comma separated list.
-path_samPath to SAM file(s). Multiple files can be specified using a comma separated list.
-sam_command_inCommand line to execute for opening each SAM file (comma separated). For example
-sam_command_in zstdcatto open a Zstandard-zipped file (
-pairedfor pair-end sequencing. SAM/BAM with paired reads must be unsorted, so that the reads of each pair are next to each other.
Filtering mapped reads
-fragment_min_lengthMinimum fragment length
-fragment_max_lengthMaximum fragment length
-read_lengthComma separated list of specific read length(s)
-read_min_mapping_qualityMinimum read mapping quality (5th column in SAM, MAPQ)
-read_in_proper_pairOnly read in proper pairs (default: all pairs) (2nd column in SAM, 0x2 flag)
-rand_proportionRandomly select a proportion of all reads (from 0. to 1.).
0.5will keep 50% of the reads/pairs.
-read_min_overlapMinimum total overlap of the read with the feature interval(s) (default 10) to be counted and included into the profile.
-read_strandSpecify strandness of the sequenced library by setting the orientation of read 1, i.e. + or - or unstranded if empty.
-path_featuresPath to features file
-format_features FON(default) FON format Within FON, specify where to find the name of the features, their chromosome, coordinate list (such as exons), and strand:
-fon_nameFON key for feature name (default "transcript_stable_id")
-fon_chromFON key for chromosome or locus (default "chrom")
-fon_coordsFON key for coordinates (exons for example) (default "exons")
-fon_strandFON key for strand (default "strand")
-format_features tabTabulated file
-feature_strandDefault feature strand (default "+")
-path_mappingPath to feature name(s) mapping (tabulated file). For example, if features are chromosomes, this file can be used to translate chromosome/contig names from Ensembl to UCSC names.
Filtering features. Filter input reads. Only reads mapping to features within the filter will be further processed. All options have the same name of the main features with the added suffix
_filter. For example, filter features can be used to filter out reads from a Ribo-seq experiment that are not mapped to exons. Remaining reads can then be counted within coding-sequences (CDS) using
-path_features_filterPath to filtering features file
-format_features_filter FON(default) FON format Within FON, specify where to find the name of the features, their chromosome, coordinate list (such as exons), and strand:
-fon_name_filterFON key for feature name (default "transcript_stable_id")
-fon_chrom_filterFON key for chromosome or locus (default "chrom")
-fon_coords_filterFON key for coordinates (exons for example) (default "exons")
-fon_strand_filterFON key for strand (default "strand")
-format_features_filter tabTabulated file
-feature_strand_filterDefault feature strand (default "+")
-include_missing_in_filterCommon use cases of filter require each feature to be in the main (specified with
-path_features) and the filter (specified with
-path_features_filter) features. By default, GeneAbacus will check that each feature from the main can also be found in the filter using the feature name.
-include_missing_in_filterremoves this check allowing for features to be absent in the filter.
-ignore_nh_tagSoftware mapping reads, notably RNA such as STAR, add an NH tag in the optional fields of SAM files. The NH tag holds the total number of hits. With this option, any NH tag will be ignored and all alignments will be considered unique (i.e. NH=1).
-path_reportWrite a report to this path (stdout with
-path_sam_outPath to saved SAM file containing read(s) included in counts or profiles
-appendInstead of creating new count and/or profile files and eventually overwriting existing files, this option will open existing files using APPEND mode, and append content at the end of existing files.
-count_multisMultiple counts can be generated including reads/pairs mapping uniquely ("1") to the genome or not. A read multiplicity of "2" will include reads/pairs mapping uniquely and those mapping to two loci. Multiple counts are specified using a comma separated list (default "1,2,900").
-count_totalsTotals used for normalization such as computing RPKM are calculated by the program. If desired, totals can be specified by the user as comma separated list of totals. This list must have the same number of totals as multiplicity in the
-count_total_real_readTotals used for normalization such as computing RPKM are calculated as the number of alignments intersecting with the features. Each alignment is weighted by their multiplicity (number of hits for the read from the NH tag) so that a read will count 1/NH for each alignment. While this approach is acceptable, this calculation is an approximation: the NH tag is computed genome-wide while most counts are not (on the transcriptome for example). The
-count_total_real_readoption calculates the real total number of reads by counting the reads intersecting the features using their name. Be aware, this option requires large amounts of RAM.
-count_pathPath to counts output (default
-count_in_profileOnly count reads included in the profiles
-profile_pathsPath to profile output(s) (comma separated) (default
-profile_formatsProfile output format. Available formats are bedgraph, binary' or csv (default bedgraph). Multiple formats can be set as comma separated list. The number of formats and output paths (in
-profile_paths) must be the same.
-profile_multiMaximum alignment multiplicity to include a read in the profile (default 900). See
-profile_normBy default, profiles contains the number of reads per nucleotide. With
-profile_norm, profile counts are normalized using total reads to RPM.
-profile_no_coord_mappingSkip coordinate mapping from input to feature to speed things up. This option requires input reads and features to be within the same coordinate system (for example reads mapped to a genome and features being chromosomes). It only produces profile in the same orientation as the input and convenient to generate genomic profiles.
-profile_overhangOverhang length to add to each side of the profiles
-profile_type Profile type: first, last, first-last, position, all, all-extension or all-slice
-profile_no_untemplatedInclude only reads w/o untemplated nucleotide in the profile
-profile_untemplatedRemove maximum untemplated nucleotides
-profile_position_fractionFraction of position between start and end for position profile (default 0.5)
-num_workerNumber of worker(s) to run in parallel (default 1)
-verboseVerbose (adapt how much verbose is the output using
-versionPrint version and quit
The profile binary format consists of a header followed by the profile of each feature concatenated together. Data is stored as a raw sequence of bytes (in little-endian order) in a file with the
In the current version (3), the header contains 3 fields:
How to split the concatenated profiles into individual profiles is not included in the binary file. A list of the length of each profile from a FON1, BED, or TAB file is required. A checksum insures that the same list of lengths is used at the creation of the binary file and when reading it. The checksum is computed by concatenating the lengths of each profile (uint32) into a raw sequence of bytes, of which an Adler-32 checksum is calculated.
For reducing storage requirements, binary files are compressed using LZ4. Since most genomic profiles usually have many zeros, LZ4 offers fast decompression speed and a high compression rate. Other or no algorithm can easily be employed. Using this compression, binary profiles are not indexed and are intended to be fully loaded into RAM to be used in downstream analysis.
import geneabacus.profileio profiles = geneabacus.profileio.pfopen('profiles.bin.lz4', 'danrer_cdna_protein_coding_ensembl104.fon1.json')
To get a transcript profile:
profiles['ENSDART00000000486'] # will return array([0., 0., 21., ..., 0., 3., 0.], dtype=float32)
GeneAbacus is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).
Copyright (C) 2015-2022 Charles E. Vejnar