Chapter 2
Modern Transcriptomics and Small RNA Diversity
Kasey C. Vickers Department of Medicine, Vanderbilt University, School of Medicine, Nashville, TN, USA
Abstract
Due to significant advances in high-throughput RNA sequencing, there has been a tremendous growth in the diversity of small non-coding RNAs (sncRNA) and burgeoning interest in their pathophysiological relevance in cardiometabolic diseases. From the depths of small RNA sequencing datasets, the modern transcriptome has emerged and is both complex and exquisitely interconnected. Strikingly, many unexpected regions of the genome are producing transcripts that are cleaved into sncRNAs that participate in post-transcriptional gene regulation. Here we organize the main classes of sncRNAs and detail what is known about their biological functions. A comprehensive overview of microRNAs (miRNA), tRNA-derived small RNAs, small nuclear RNAs, Y RNA-derived miRNAs, small RNAs-derived from vault RNA, and many other classes is provided. Although sncRNAs have not been extensively investigated, they hold enormous potential to better understand regulatory RNA modules that may ultimately be used to treat and prevent cardiometabolic diseases.
Keywords
High-throughput RNA sequencing; MicroRNAs; Small nuclear RNAs; Small RNAs; Transcriptome; tRNA-derived small RNAs; Vault RNAs; Y RNAs
1. Introduction
Due to significant advances in high-throughput RNA sequencing (RNAseq) methods and informatics support after 2010, the field of modern genomics exploded with life and many new RNA species were identified in an ever deepening mammalian transcriptome. Nevertheless, to fully grasp the complexity of modern genomics, one must recognize the early history of genome-scale gene expression analysis. Since the mid-1990s, conventional hybridization arrays have been used to generate mRNA expression profiles and essentially created an entirely new field of transcriptomics that was accompanied by new statistical approaches and bioinformatics support. The term “transcriptomics” was the first of now many “-omics” that are found in the literature and was coined in 1996 as a term to classify the complete set of mRNA expression values in specific cells
[1]. The first reported profile of gene expression was published in 1991 by J. Craig Venter's group, which released a database of expressed sequence tags generated by automated Sanger sequencing
[2]. Dr Venter later became world recognized for his work on the human genome project. Over the next two decades, single and dual color hybridization gene expression arrays dominated whole-genome mRNA expression profiling. Microarray technologies set forth many of the systems biology and holistic scientific approaches that became favored over more conventional reductionist gene-by-gene biology. These techniques allowed investigators to survey complete pathways at genome scale for both candidate and blinded gene expression changes in biological contexts and diseases. Nonetheless, by 2014 gene expression arrays were being phased out in academic core labs and science in general for sequencing-by-synthesis techniques (next-generation sequencing, NGS), which arrived in 2008. Short-read massive parallel sequencing, the science behind NGS, has facilitated the rapid development of a diverse set of high-throughput DNA (DNAseq) and RNAseq strategies that are being applied to sequence whole genomes, classify DNA variance and single nucleotide polymorphisms, and profile small and long RNA transcripts. RNAseq is a class of methods that are designed to profile both coding (mRNA) and non-coding RNAs. The two most popular approaches are total RNAseq for gene (mRNA) expression profiling and small RNAseq (smRNAseq) for non-coding smRNA profiling, namely microRNAs (miRNA) analysis.
RNAseq methods have many advantages over conventional microarray platforms, including the ability to quantify gene expression at higher genomic resolution. Moreover, RNAseq methods provide absolute expression values compared with less reliable relative signals generated by microarrays. This gives more confidence in data quality, particularly for highly and lowly abundant transcripts, as RNAseq actually counts extreme transcripts opposed to fluorescent detection with microarrays. Likewise, background noise (signal) that is often a problem with microarrays is not an issue with RNAseq due to increased signal-to-noise ratios. Gene expression arrays rely on specific hybridization to immobilized probes, which creates high levels of background noise arising from cross-hybridization or sub-optimal hybridization kinetics. For smRNAs, namely miRNAs, the RNAseq has many scientific advantages over microarray technologies
[3]. For example, miRNA microarrays are often limited by probe design and specificity. miRNAs of the same family often only differ by one or two nucleotides, which can generate non-specific binding with microarrays. In addition, microarray probe melting temperatures can vary wildly and often require high-temperature hybridizations to overcome non-specific binding and cross-hybridization issues. Another key feature, in which RNAseq has a major advantage, is its ability to identify and characterize novel transcripts and unannotated RNA species. After 2013, many gene expression cores in academic institutions began to transition away from microarray platform support studies for both scientific and business reasons, and promote projects based on RNAseq technologies for mRNA and miRNA. Due to diminishing customer demand and ever-decreasing sample volumes, many cores are not able to financially support expression arrays, and to do so would drive microarray costs higher than RNAseq per sample costs. As with many technological advances, the costs associated with RNAseq were initially prohibitive; however, by 2014 the price had dropped significantly, making it a direct, albeit superior, competitor to microarray technology. The costs of downstream informatics and data storage now represent greater platform challenges. Nonetheless, with the combination of competitive pricing and superior informative data, RNAseq technologies are the tools of choice for expression profiling.
2. Small Non-coding RNAs
Most RNAs can be grouped into three functional classes: translational, regulatory, or other. Many long non-coding RNAs (lncRNAs) with established function, for example, transfer RNAs (tRNA) in protein translation, are cleaved into small non-coding RNAs (sncRNAs) which have a diverse set of alternative functions, including post-transcriptional gene regulation. Long RNAs are often classified as RNAs longer than 200 nucleotides (nts) in length and refer to as mRNAs, anti-sense RNAs, pseudogenes, and lncRNA. Many functional RNAs are much shorter (100–300 nts) than long (>1 kb) mRNAs or lncRNAs and are referred to here as intermediate RNAs. These include tRNAs, ribosomal RNAs (rRNA), Y RNAs, small nuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), and many others. Most interestingly, it is recognized that most, if not all, classes of long and intermediate RNAs are further processed to produce non-coding smRNAs that are generally less than 40 nts in length. Here we detail the diversity of smRNA classes in the mammalian transcriptome and highlight their biological functions; however, many smRNAs have not been extensively studied and their physiological relevance remains to be determined. Nevertheless, there is tremendous depth and complexity to smRNAs in both cells and extracellular fluids, and they have enormous potential as novel biomarkers and drug targets in cardiometabolic diseases.
2.1. MicroRNAs
The most widely studied non-coding smRNAs are miRNAs that are 19–22
nts in length and post-transcriptionally regulate mRNA targets through complementary binding sites. miRNAs were first discovered by Victor Ambros, Gary Ruvkun, and colleagues in 1993 when they discovered that lin-4S (short) in
Caenorhabditis elegans (
C. elegans) was produced by lin-4L (long) and was complementary to sequences in the LIN-14 mRNA, a molecule lin-4s was found to repress
[4,
5]. It was almost 7
years later when Gary Ruvkun and colleagues found a highly conserved miRNA (let-7) in
C. elegans and confirmed earlier studies that a non-coding smRNA regulatory pathway exists that suppresses gene expression through post-transcriptional mechanisms
[6,
7]. After 2000, research on miRNAs exploded, and by 2014 they have been studied in every biological context in plants and animals, as evidenced by over 29,000 miRNA papers in Pubmed Central (
pubmed.com, 2014), with over 7000 papers released in 2013 alone. Over time miRNAs have proven to be critical regulators of many biological processes and contribute significantly to metabolic homeostasis and cardiovascular function
[8–
10]. For example, miR-27b has been reported as a regulatory hub in lipid metabolism and regulates key metabolic genes, including peroxisome proliferator-activated receptor gamma and glycerol-3-phosphate acyltransferase
[11]. Many miRNAs, including miR-27b, miR-33a/b, and miR-144, have been found to regulate cholesterol transport, particularly through the regulation of ATP-binding cassette transporter A1 (ABCA1) and cholesterol efflux
[12–
20]. Moreover, numerous miRNAs have been found to regulate key molecular mechanisms associated with cardiovascular disease, including inflammation (miR-21 and miR-223) and atherosclerosis (miR-126)...