The human body contains hundreds of cell types. Thus, a central challenge of modern biology is to uncover the principles that guide the formation and maintenance of this complex cellular landscape, from a single cell and based on the same genome sequence. Indeed, large scale initiatives such as the ‘human cell atlas’ are ongoing, aiming to comprehensively map cell states with a variety of technologies, including single cell sequencing. However, while most human genes express multiple isoforms, differing primarily in transcription start and termination sites, expression is mostly described at the ‘gene level’. Furthermore, annotation of regulatory elements and their specific involvement in a particular cell remains sparse.
The term alternative cleavage and polyadenylation (APA) is used to describe the tissue-dependent choice of a poly(A) site among the many available within a gene. APA leads to expression of isoforms that differ in their protein-coding sequences (CDS) and/or in their 3’ untranslated regions (3’ UTRs). During animal evolution, 3’ UTRs have expanded from ~140 nucleotides in the worm Caenorhabditis elegans to 1-2 kilobases in humans, suggesting a parallel increase in the complexity of post-transcriptional regulation. Indeed, there is increasing evidence that APA contributes to cell identity. For instance, it has been reported that the cell type-specific pattern of expression of single-3’UTR genes is due preferentially to transcriptional regulation, and rather to changes in isoform ratios in the case of multi-UTR genes. Because 3’ UTR isoforms differ in relative stability, localization, and translation rate, APA-dependent remodeling of 3’ UTRs has many and far-reaching consequences on cell physiology. A very surprising finding has been that transcripts with distinct 3’ UTRs can direct the localization of the (same) encoded protein to distinct cellular compartments. Although these findings made clear that much of the diversity of cellular phenotypes could come from differences in APA-dependent isoform expression, this possibility that has received relatively little consideration.
Having established unique resources and methods in the field of APA, I here propose to develop a coherent set of synergistic methods to systematically annotate cell type-specific poly(A) sites and isoforms, quantify their usage at the single cell level, and model the regulation of cell type-specific isoform expression. Because the broadly used 10x Genomics technology for single cell sequencing captures precisely transcript 3’ ends, we will exploit such data in combination with comprehensive genome-wide annotations of alternative poly(A) sites to quantify APA in single cells. We will develop new methods to systematically infer regulatory motifs for a large set of RNA binding proteins (RBPs) using extensive eCLIP data from ENCODE. Using these RBP motifs in combination with RNA sequencing data from The Cancer Genome Atlas, the human cell atlas and other consortia, we will model poly(A) site usage in terms of configurations of RBP binding sites, to infer how cell-type-dependent processing of poly(A) sites is regulated in physiological and pathological conditions. Finally, we will develop methods to assess the downstream effects of APA on RNA stability, localization, and translation. The resources generated in this project will be combined with data from large-scale initiatives, e.g. for personalized health, to link isoform expression to cellular morphology and phenotype, support identification of biomarkers, and the development of individualized therapies.