RNA Journal Club 5/27/10
Harm van Bakel, Corey Nislow, Benjamin J. Blencowe, Timothy R. Hughes
PLoS Biology, 8 (5): e1000371, 18 May 2010.
This week’s illuminating summary and analysis by Igor Ulitsky. It’s Igor’s second contribution to the blog:
The paper is one of at least five interesting RNA-Seq papers that came out in the past month or so (see also Guttman et al. and Trapnell et al. in the May issue of Nature Biotechnology, Kim et al. from Nature 465 (7295): 182-7 and De Santa et al. from the same issue of PLoS Biology). All these studies harness the awesome power of Illumina RNA-Seq to look at (mainly) the murine transcriptome and to try to figure out what it consists of, and how what RNA-Seq tells us differs from what we knew previously. Unfortunately, since the reads in those studies are still <75 long, RNA-Seq still can’t tell us exactly what the transcripts in the cell are, but rather what regions of the genome seem to give rise to RNA (or more precisely – which regions of the genome can we uniquely align reads to). This problem is partially alleviated by paired-end RNA-Seq used in the Nature Biotech papers, but it still has limited power for deciphering transcripts expressed at low levels. The five studies mentioned above tell three different stories about the transcriptome – the two Nature Biotech papers talk about how it is possible to identify thousands of novel exons in known genes, and also to give significantly more accurate exonic structures to some of the previously proposed long non-coding RNAs (lincRNAs). Kim et al. and De Santa et al. talk about a surprising amount of RNA coming from enhancer regions in the mouse genome, RNA whose exact function remains a mystery. The study we’re focusing on – van Bakel et al., tackles a more global question – how much of the polyA+ RNA comes from “known genes”, and how much from everything else – the “dark matter”. This question is naturally of high interest, but addressing it involves a wealth of caveats:
- What are “known genes” (i.e., “non-dark matter”) – protein-coding ones? miRNAs? Coding and non-coding ones with known functions?
- What expression levels can be considered functional? Are all transcripts with relatively low expression levels just noise?
- Are short RNA-Seq reads really informative in terms of the number of different RNA species?
- Are we confident enough in the annotation of the genome with pseudogenes and repeats, both of which can contribute to spurious mappings in intergenic regions?
Despite these caveats, van Bakel et al. do a thorough job of at least trying to answer this question, and do their best to convince the readers that, in fact, very little transcription from the mammalian genome is “dark matter”. When analyzing RNA-Seq data, the majority of the genome does not seem to give rise to detectable polyA+ RNA segments. Then how did ENCODE and related studies report as much as 80% of the genome as being transcribed? The authors begin by showing that tiling arrays are prone to give rise to many false positive calls. They do so by comparing their own tiling array data from human and mouse tissues to published and novel RNA-Seq datasets. Unfortunately, the sets are not completely matched (different labs/starting materials), but the data very convincingly shows that for transcripts with low expression data, the signal from tiling arrays is practically the same as the background – a fertile ground for false positive calls. It is interesting to note that the first part of the paper shows in fact that the data that the authors generated themselves (the tiling arrays) is worse than previously published data (RNA-Seq).
From that point on, the authors focus on RNA-Seq data. They find relatively few completely intergenic stand-alone transcripts that are not captured in some way in the “known genes” databases or at least in existing EST/mRNA collections. This is not very surprising given the effort involved in sequencing ESTs in mouse/human – it could hardly be expected that a lot of polyadenylated transcripts would be abundant in RNA-Seq, but missing from those datasets. It should be kept in mind though, that many of what the authors call “known genes” are in fact non-coding transcripts (based on lack of a long/conserved ORF) with completely unclear function. What about the sequence fragments (seqfrags) that do fall outside of the “known genes” boundaries? About 80% of those reads fall within 10kb of known genes and are likely to represent either unannotated parts of those genes, or transcripts whose biogenesis function is related to the gene adjacent to them, as their expression is generally highly correlated with their neighbors. What about the rest? Are there any interesting RNAs out there in the intergenic space? Well, there are some – the authors identify about 11,000-16,000 seqfrags that are located >10kb away from any known gene and that are significantly different from expected. The novel intergenic transcripts tend to overlap regions of open chromatin –identified using DNAse I hypersensitivity – which suggests that at least some of them could be the enhancer-associated transcripts reported in the parallel studies.
The authors then go on to show that by looking at splice junctions derived from the reads (using the popular TopHat tool) they can reach roughly the same conclusion – most of the spliced polyA+ RNA is already “known” to us. They can still identify about 5,000 novel exons in “known genes”, and those share the general characteristics of the known exons, albeit with lower expression and conservation levels. The imminent problem from this section, and from all the other recent RNA-Seq studies is this: Is anyone keeping track of all those new exons? Updating RefSeq/UCSC/Ensembl? How to update these databases is also an excellent question, as short-read-based studies cannot give us a complete (or close to complete) snapshot of the actual transcript. Anyhow, at this pace, we expect to see many additional papers re-discovering the same set of 5,000 novel exons.
The bottom line?
- The outback of the genome rarely gives rise to highly expressed and polyA+ transcripts. This does not mean that there is shortage of putative lincRNAs – hundreds of them are already in the “known genes” set, and others may be functional despite low expression levels/proximity to known genes. The jury is still out on the polyA-transcriptome.
- Annotation of the “canonical genes” in the mouse/human genomes is still not complete and both can be complemented with several thousand additional exons. Let’s hope somebody is keeping track.
- Many intergenic RNAs are likely to be enhancer-associated (but we still don’t understand why).
This paper (as well as the other recent RNA-Seq studies) was definitely interesting to read, and we can only look forward to what we will learn once long-read RNA-Seq (e.g., Pacific Biosciences) kicks in.
Citation for researchblogging.org:
van Bakel H, Nislow C, Blencowe BJ, & Hughes TR (2010). Most “dark matter” transcripts are associated with known genes. PLoS biology, 8 (5) PMID: 20502517