Analysis of alternative polyadenylation (APA) from RNA-seq data (human and mouse). QAPA consists of two main components:
- Extraction and annotation of 3′ UTR sequences from gene models
- Calculation of relative usage of alternative 3′ UTR isoforms based on transcript-level abundance.
Note that QAPA itself does not perform transcript quantification. It relies on other tools such as Sailfish and Salmon.
QAPA consists of both Python and R scripts. A conda virtual environment
can be created using the provided environment.yml file.
-
Clone the repo:
git clone https://github.com/morrislab/qapa.git cd qapa -
(Optional) Install
mambafor faster Conda managementconda install -c conda-forge mamba -
Create the environment
mamba env create -f environment.yml conda activate qapa -
Test that
qapacommand is availableqapa --help
QAPA has three sub-commands:
build: Generate a 3′ UTR library from annotationsfasta: Extract sequences for indexing by transcript quantification toolsquant: Calculate relative usage of alternative 3′ UTR isoforms
Pre-compiled libraries for human and mouse are available for download below. Otherwise, continue reading to build from scratch.
Updated in v1.3.0, the following libraries are pre-compiled with Polyasite V2:
(Old) Versions v1.2.3 and earlier:
To run build, gene and poly(A) annotation sources need to be prepared:
A. Gene annotation
-
Ensembl gene metadata table from Biomart.
Human and mouse tables are provided in the
examplesfolder. To obtain a fresh copy, download a table of Ensembl Genes from Biomart with the following attributes:- Ensembl Gene ID
- Ensembl Transcript ID
- Gene Type
- Transcript Type
- Gene Name
Alternatively, download via MySQL (see here for more details):
mysql --user=anonymous --host=martdb.ensembl.org --port=5316 -A ensembl_mart_89 \ -e "select stable_id_1023 as 'Gene stable ID', stable_id_1066 as 'Transcript stable ID', \ biotype_1020 as 'Gene type', biotype_1064 as 'Transcript type', \ display_label_1074 as 'Gene name' from mmusculus_gene_ensembl__transcript__main" \ > ensembl_identifiers.txtTo change the species, replace the table name (e.g. for human, use
hsapiens_gene_ensembl__transcript__main). -
GENCODE gene prediction annotation table
Download from UCSC Table Browser or alternatively via MySQL (see here for more details). For example, to download mm10 gene predictions:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \ -e "select * from wgEncodeGencodeBasicVM9" mm10 > gencode.basic.txtAlternatively, if you are starting from a GTF/GFF file, you can convert it to genePred format using the UCSC tool
gtfToGenePred:gtfToGenePred -genePredExt custom_genes.gtf custom_genes.genePredNote that it is important to include the
-genePredExtoption!
B. Poly(A) site annotation (optional)
This step can be skipped, otherwise continue reading. Two options are available:
Option 1: standard approach using PolyASite and GENCODE poly(A) track (as described in the paper)
-
PolyASite database
Download BED files (human or mouse) from http://polyasite.unibas.ch/.
-
GENCODE poly(A) sites track
Download from UCSC Table Browser or alternatively via MySQL. For example, to download mm10 annotations:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \ -e "select chrom, txStart, txEnd, name2, score, strand \ from wgEncodeGencodePolyaVM9 where name2 = 'polyA_site'" -N mm10 \ > gencode.polyA_sites.bed
Option 2: use custom BED track of poly(A) sites
A custom BED file of poly(A) sites can be used to annotate 3′ UTRs. Each entry must contain the start (0-based) and end coordinate of a poly(A) site.
Once the data files have been prepared, we can then use build to create the 3'
UTR library. See qapa build -h for usage details. The following
describes several example use cases:
-
To extract 3′ UTRs from annotation, run:
qapa build --db ensembl_identifiers.txt -g gencode.polyA_sites.bed -p clusters.mm10.bed gencode.basic.txt > output_utrs.bed -
If using a custom BED file, replace the
-gand-poptions with-o:qapa build --db ensembl_identifiers.txt -o custom_sites.bed gencode.basic.txt > output_utrs.bed -
If using a custom genePred file converted from GTF, supply the file as in 1. (e.g. the first positional argument):
qapa build --db ensembl_identifiers.txt -o custom_sites.bed custom_genes.genePred > output_utrs.bed -
If bypassing the poly(A) annotation step, include the
-Noption:qapa build -N --db ensembl_identifiers.txt gencode.basic.txt > output.utrs.bed
Results will be saved in the file output_utrs.bed (default is STDOUT).
It is important that the sequence IDs are not modified as it will be parsed by
quant below.
Additional notes:
- 3' UTRs that contain introns will be skipped.
- Chromosome names that contain underscores are currently not supported and will be skipped.
- Only genes with
Gene Type = 'protein_coding'will be considered.
- Use
--debugoption to produce more verbose logging messages - Use
--saveand--temp <dir_path>to save intermediate files generated byqapa build.<dir_path>should be a user accessible directory.
To extract sequences from the BED file prepared by build, a reference genome in
FASTA format is required. e.g. http://hgdownload.soe.ucsc.edu/downloads.html.
Then, run the command:
qapa fasta -f genome.fa output_utrs.bed output_sequences.fa
Essentially fasta is a wrapper that calls bedtools getfasta. Note that
genome.fa must be uncompressed. Sequences will be saved in
output_sequences.fa.
Since the initial publication, Salmon added a 'selective alignment' procedure (Srivastava, Malik et al., Genome Biology, 2020). The 'full' selective alignment jointly aligns the reads to the input transcriptome and the genome, discarding alignments that map better to the genome than the transcriptome. To generate a fasta file compatible with Salmon's selective alignment procedure (a 'decoy-aware transcriptome'), run qapa fasta with the --decoys flag:
qapa fasta -f genome.fa --decoys output_utrs.bed output_sequences.fa
This will save the transcript sequences appended with the genome sequence from genome.fa to output_sequences.fa. This also generates a file named decoys.txt, which contains the sequence identifiers from the genome sequence to differentiate target transcripts from decoy sequences. The name of the text file can be controlled using the optional -d or --decoys-output-file argument.
Expression quantification of 3′ UTR isoforms must be carried out first using the
FASTA file prepared by fasta as the index. For example, to index the sequences
using Salmon:
salmon index -t output_sequences.fa -i utr_library
If you generated a genome-augmented fasta file with qapa fasta --decoys, additionally pass the decoys.txt file to salmon index:
salmon index -t output_sequences.fa -i utr_library --decoys decoys.txt
Following expression quantification, QAPA expects the results to be located inside its own sub-directory. For example, typical Sailfish/Salmon results may appear with the following directory structure:
project/
|-- sample1/quant.sf
|-- sample2/quant.sf
|-- (etc.)
The quant sub-command calls two R scripts:
create_merged_table.R and compute_pau.R. The first script combines the
quantifications from each sample into a single table. The second script computes
the relative proportion of each isoform in a gene, measured as Poly(A) Usage
(PAU) (compute_pau.R).
qapa quant --db ensembl_identifiers.mm10.txt project/sample*/quant.sf > pau_results.txt
Results will be saved in the file pau_results.txt (default is STDOUT).
For advanced usage, the R scripts can be run on its own. Run
create_merged_table.R -h and compute_pau.R -h for usage details.
The final output format is as follows:
| Column | Description |
|---|---|
| APA_ID | unique identifier consisting of in the format <Ensembl Gene ID>_<number>_<(P|D|S)>, where P = proximal, D = distal, and S = single |
| Transcript | one or more Ensembl Transcript IDs |
| Gene | Ensembl Gene ID |
| Gene_Name | gene symbol |
| Chr | chromosome |
| LastExon.Start | start coordinate of last exon |
| LastExon.End | end coordinate of last exon |
| Strand | + or - |
| UTR3.Start | start coordinate of 3′ UTR |
| UTR3.End | end coordinate of 3′ UTR |
| Length | length of the 3′ UTR |
| Num_Events | number of PAS per gene |
| sample1.PAU | PAU estimate for sample1 |
| sample2.PAU | PAU estimate for sample2 |
| sample1.TPM | TPM estimate for sample1 |
| sample2.TPM | TPM estimate for sample2 |
Ha, K.C.H., Blencowe, B.J., Morris, Q. (2018). QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biol. 19, 45. [link]