GENERATING AND MANAGING LARGE SCALE PROTEOGENOMIC DATA FOR ENCODE CELL LINES
Morgan Corinne Giddings, Associate Professor
University Of North Carolina Chapel Hill, Office Of Sponsored Research, Chapel Hill, Nc 27599
Grant 1RC2HG005591-01 from National Human Genome Research Institute
Abstract: The first human genome sequence was published in 2001, yet as of now, eight years later, major questions remain, such as how many genes are encoded by the genome, and of those genes, how many functional products are encoded due to phenomena like alternative splicing. The Encyclopedia of DNA Elements (ENCODE) project has been coordinated by National Human Genome Research Institute (NHGRI) to answer these questions by comprehensively classifying functional elements on the human genome. The pilot phase of the project studied 1% of the genome in detail, revealing extensive transcription well beyond that predicted by classical gene models. The biological function of a significant portion of the discovered transcripts is unclear. The ENCODE project is now scaling up to examine the whole human genome. It is likely that results will echo the pilot project, revealing extensive transcription, a significant fraction of which has unexplained function. Proteomic technologies can be applied, in a process called proteogenomic mapping, to determine which of the myriad transcripts encode proteins. This approach has been used to reveal new genes, new alternative splice variants, new start sites, and upstream open reading frames (ORFs). While substantive progress has been made in developing proteogenomic mapping technologies, a significant hurdle in using proteogenomics to assist with the ENCODE project is the lack of proteomic data sets that are coordinated with the ENCODE transcription mapping efforts. Here we propose to generate large-scale proteomic data sets directly from the same tier I ENCODE cell lines studied by the transcription efforts, coordinating the results with the transcription mapping efforts to determine which of the pervasive transcripts are translated. Our specific aims are to 1) produce large scale proteomic data sets on ENCODE cell lines using the most advanced mass spectrometry methods, 2) use our database technologies to store, manage, and make accessible to the community all results of the project, and 3) use our software pipeline to map the results to the latest human genome drafts, producing a UCSC (University of California Santa Cruz) genome browser track with the results. We believe the result will be a significant advancement in knowledge about our genomes and the functional products they encode. The human genome is the blueprint for human life and human health, but we do not yet understand its language - the language of genes. The ENCODE project is deciphering that language systematically, and the goal of this proposal is to accelerate that effort by revealing which parts of the blueprint contain instructions to build proteins
Keywords: Affect; Alternate Splicing; Alternative Splicing; Assay; Bioassay; Biochemical; Biologic Assays; Biological; Biological Assay; Biological Function; Biological Process; California; Cell Line; Cell Lines, Strains; CellLine; Code; Coding System; Communities; Computer Programs; Computer software; Crossmatching, Tissue; DNA; DNA Sequence; Data; Data Banks; Data Bases; Data Coordinating Center; Data Coordination Center; Data Element; Data Set; Databank, Electronic; Databanks; Database, Electronic; Databases; Dataset; Deoxyribonucleic Acid; Elements; Exons; Faculty; Functional RNA; Gene Products, RNA; Gene Transcription; Genes; Genetic Transcription; Genome; Genome, Human; Goals; Grant; Health; Histocompatibility Testing; Human; Human Cell Line; Human Genome; Human, General; Immunology; Immunology (Including BRMP); Immunology (NCI Program); In element; Indium; Instruction; Investigators; Knowledge; Language; Life; Man (Taxonomy); Man, Modern; Management Information Systems; Maps; Mass Spectrum; Mass Spectrum Analysis; Methods; Modeling; Molecular and Cellular Biology; National Human Genome Research Institute; Nature; Non-Coding; Non-Coding RNA; ORFs; Open Reading Frames; Peptides; Phase; Photometry/Spectrum Analysis, Mass; Pilot Projects; Policies; Process; Protein Coding Region; Protein Splicing; Proteins; Proteome; Proteomics; Publishing; RNA; RNA Expression; RNA Splicing; RNA Splicing, Alternative; RNA, Non-Polyadenylated; Research Personnel; Researchers; Ribonucleic Acid; Site; Software; Spectrometry, Mass; Spectroscopy, Mass; Spectrum Analyses, Mass; Spectrum Analysis, Mass; Splicing; Structure; Technology; Tissue Crossmatchings; Tissue Typing; Transcript; Transcription; Transcription, Genetic; Translating; Translatings; Universities; Variant; Variation; Work; base; clinical data repository; clinical data warehouse; computer program/software; cultured cell line; data repository; develop software; developing computer software; experience; gene product; genome sequencing; histocompatibility typing; improved; insight; language translation; pilot study; public health relevance; relational database; repository; scale up; software development
Relevance: Narrative The human genome is the blueprint for human life and human health, but we do not yet understand its language - the language of genes. The ENCODE project is deciphering that language systematically, and the goal of this proposal is to accelerate that effort by revealing which parts of the blueprint contain instructions to build proteins
Project start date: 2009-09-26
Project end date: 2011-06-30
Budget start date: 26-SEP-2009
Budget end date: 30-JUN-2010
PFA/PA: RFA-OD-09-004
1RC2HG005591-01 (2009): $800000
Sponsored Links Excellgen http://Excellgen.com
Grants awarded to Morgan Corinne Giddings
SOFTWARE TO IDENTIFY POST-TRANSLATIONAL MODIFICATIONS FROM PROTEOMIC DATA SETS
Morgan Corinne Giddings, Associate Professor
University Of North Carolina Chapel Hill, Office Of Sponsored Research, Chapel Hill, Nc 27599
Abstract: This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. To accelerate and enhance related studies by producing an easy-to-use and fully validated software package for automatically finding modifications on proteins
Keywords: CRISP; Computer Programs; Computer Retrieval of Information on Scientific Projects Database; Computer software; Data Set; Dataset; Funding; Grant; Institution; Investigators; Modification; NIH; National Institutes of Health; National Institutes of Health (U.S.); Post-Translational Modifications; Post-Translational Protein Processing; Posttranslational Modifications; Protein Modification; Protein Modification, Post-Translational; Protein Processing, Post-Translational; Protein Processing, Posttranslational; Protein/Amino Acid Biochemistry, Post-Translational Modification; Proteins; Proteomics; Research; Research Personnel; Research Resources; Researchers; Resources; Software; Source; United States National Institutes of Health; computer program/software; gene product
Project start date: 2009-08-31
Project end date: 2011-08-30
Budget start date: 31-AUG-2009
Budget end date: 30-AUG-2011
PFA/PA: PA-07-070
3R01RR020823-05S1_8128 (2009): $249709
GENERATING AND MANAGING LARGE SCALE PROTEOGENOMIC DATA FOR ENCODE CELL LINES
Morgan Corinne Giddings
University Of North Carolina Chapel Hill, Office Of Sponsored Research, Chapel Hill, Nc 27599
Grant 5RC2HG005591-02 from National Human Genome Research Institute
Abstract: The first human genome sequence was published in 2001, yet as of now, eight years later, major questions remain, such as how many genes are encoded by the genome, and of those genes, how many functional products are encoded due to phenomena like alternative splicing. The Encyclopedia of DNA Elements (ENCODE) project has been coordinated by National Human Genome Research Institute (NHGRI) to answer these questions by comprehensively classifying functional elements on the human genome. The pilot phase of the project studied 1% of the genome in detail, revealing extensive transcription well beyond that predicted by classical gene models. The biological function of a significant portion of the discovered transcripts is unclear. The ENCODE project is now scaling up to examine the whole human genome. It is likely that results will echo the pilot project, revealing extensive transcription, a significant fraction of which has unexplained function. Proteomic technologies can be applied, in a process called proteogenomic mapping, to determine which of the myriad transcripts encode proteins. This approach has been used to reveal new genes, new alternative splice variants, new start sites, and upstream open reading frames (ORFs). While substantive progress has been made in developing proteogenomic mapping technologies, a significant hurdle in using proteogenomics to assist with the ENCODE project is the lack of proteomic data sets that are coordinated with the ENCODE transcription mapping efforts. Here we propose to generate large-scale proteomic data sets directly from the same tier I ENCODE cell lines studied by the transcription efforts, coordinating the results with the transcription mapping efforts to determine which of the pervasive transcripts are translated. Our specific aims are to 1) produce large scale proteomic data sets on ENCODE cell lines using the most advanced mass spectrometry methods, 2) use our database technologies to store, manage, and make accessible to the community all results of the project, and 3) use our software pipeline to map the results to the latest human genome drafts, producing a UCSC (University of California Santa Cruz) genome browser track with the results. We believe the result will be a significant advancement in knowledge about our genomes and the functional products they encode. The human genome is the blueprint for human life and human health, but we do not yet understand its language - the language of genes. The ENCODE project is deciphering that language systematically, and the goal of this proposal is to accelerate that effort by revealing which parts of the blueprint contain instructions to build proteins
Keywords: Affect; Alternate Splicing; Alternative Splicing; Assay; Bioassay; Biochemical; Biologic Assays; Biological; Biological Assay; Biological Function; Biological Process; California; Cell Line; Cell Lines, Strains; CellLine; Code; Coding System; Communities; Computer Programs; Computer software; Crossmatching, Tissue; DNA; DNA Sequence; Data; Data Banks; Data Bases; Data Coordinating Center; Data Coordination Center; Data Element; Data Set; Databank, Electronic; Databanks; Database, Electronic; Databases; Dataset; Deoxyribonucleic Acid; Elements; Exons; Faculty; Functional RNA; Gene Products, RNA; Gene Transcription; Genes; Genetic Transcription; Genome; Goals; Grant; Health; Histocompatibility Testing; Human; Human Cell Line; Human Genome; Human, General; Immunology; Immunology (Including BRMP); Immunology (NCI Program); In element; Indium; Instruction; Investigators; Knowledge; Language; Life; Man (Taxonomy); Man, Modern; Management Information Systems; Maps; Mass Spectrum; Mass Spectrum Analysis; Methods; Modeling; Molecular and Cellular Biology; National Human Genome Research Institute; Nature; Non-Coding; Non-Coding RNA; ORFs; Open Reading Frames; Peptides; Phase; Photometry/Spectrum Analysis, Mass; Pilot Projects; Policies; Process; Protein Coding Region; Protein Splicing; Proteins; Proteome; Proteomics; Publishing; RNA; RNA Expression; RNA Splicing; RNA Splicing, Alternative; RNA, Non-Polyadenylated; Research Personnel; Researchers; Ribonucleic Acid; Site; Software; Spectrometry, Mass; Spectroscopy, Mass; Spectrum Analyses, Mass; Spectrum Analysis, Mass; Splicing; Structure; Technology; Tissue Crossmatchings; Tissue Typing; Transcript; Transcription; Transcription, Genetic; Translating; Translatings; Universities; Variant; Variation; Work; base; clinical data repository; clinical data warehouse; computer program/software; cultured cell line; data repository; develop software; developing computer software; experience; gene product; genome sequencing; histocompatibility typing; improved; insight; language translation; pilot study; public health relevance; relational database; repository; scale up; software development
Relevance: Narrative The human genome is the blueprint for human life and human health, but we do not yet understand its language - the language of genes. The ENCODE project is deciphering that language systematically, and the goal of this proposal is to accelerate that effort by revealing which parts of the blueprint contain instructions to build proteins
Project start date: 2009-09-26
Project end date: 2011-06-30
Budget start date: 1-JUL-2010
Budget end date: 30-JUN-2011
PFA/PA: RFA-OD-09-004
5RC2HG005591-02 (2010): $800000
SOFTWARE TO IDENTIFY POST-TRANSLATIONAL MODIFICATIONS FROM PROTEOMIC DATA SETS
Morgan Corinne Giddings, Associate Professor
University Of North Carolina Chapel Hill, Office Of Sponsored Research, Chapel Hill, Nc 27599
Grant 3R01RR020823-05S1 from National Center For Research Resources
Abstract: Proteins are the workhorses of cells, comprising much of the machinery of life. Chemical changes due to co- or post-translational modifications, or amino acid substitutions resulting from genetic variation, can alter protein function and have significant consequences on the functioning of a cell. Pinpointing chemical changes in proteins in an automated manner remains an elusive goal. Mass spectrometry (MS) based methodologies are promising for examining such alterations, since they are exquisitely sensitive to the resulting shifts in mass. There are two main approaches that can be used for examining proteins by MS, one which measures the intact masses of proteins to detect shifts indicative of modifications (called top-down), and the other which enzymatically digests proteins into short peptides, then analyzes their chemical structure by tandem mass spectrometry (called bottom-up). Each of the existing MS methods has limitations, such as lack of complete protein coverage for bottom-up, and the inability to use top-down data to uniquely identify modifications; these drawbacks have motivated the development of hybrid combinations such as "top-down bottom-up" (TDBU) proteomics. Though these are seeing a surge of interest, there is an acute lack of comprehensive, automated software for combining measurements from the distinct MS approaches; thus, studies to date have relied upon extensive manual analysis and/or ad hoc program scripts, inhibiting progress in the field. We propose to address this issue using our two existing programs, PROCLAME for analyzing top-down data, and GFS for analyzing bottom-up data, to develop integrated, open-source software that combines data from multiple MS methodologies to pinpoint posttranslational modifications and amino acid substitutions in proteins. Our aims are 1) to integrate multiple MS data sources for determining the type and location of modifications on proteins, by adding a Markov chain Monte Carlo (MCMC) based engine to PROCLAME; 2) to improve the ability to analyze bottom-up data by enhancing GFS for the automatic determination of posttranslational modifications; 3) to manage and integrate results from multiple MS measurements and search engines, by developing a database system and scripts to tie the programs together; and 4) to assure program reliability and suitability through both alpha testing in-house and beta testing at external sites. Health Relevance Both amino acid substitutions and misregulation of enzymes that modify proteins play roles in human diseases such as Cancer, Diabetes, Sickle Cell Anemia, and many others. This proposal is to build generalized software that can be used by a broad base of researchers to pinpoint the chemical changes/modifications to proteins that perturb regulatory networks in cells to cause disease.NARRATIVE Both amino acid substitutions and misregulation of enzymes that modify proteins play roles in human diseases such as Cancer, Diabetes, Sickle Cell Anemia, and many others. This proposal is to build generalized software that can be used by a broad base of researchers to pinpoint the chemical changes and modifications to proteins that perturb regulatory networks in cells to cause disease, by integrating data from the latest proteomic technologies
Keywords: No Project Terms available
Project start date: 2009-08-31
Project end date: 2011-08-30
Budget start date: 31-AUG-2009
Budget end date: 30-AUG-2011
PFA/PA: PA-07-070
3R01RR020823-05S1 (2009): $249709
DEVELOPING PROTEOGENOMIC MAPPING FOR HUMAN GENOME ANNOTATION
Morgan Corinne Giddings, Associate Professor
University Of North Carolina Chapel Hill, Office Of Sponsored Research, Chapel Hill, Nc 27599
Grant 5R01HG003700-05 from National Human Genome Research Institute
Abstract: Genome sequencing efforts are producing ever greater quantities of raw DNA sequence, but the annotation process for locating and determining the function of genetic elements has not kept up. While many aspects of annotation are difficult, it is particularly challenging to determine which parts of a genome sequence encode proteins, and therefore how the processes leading to protein translation are regulated. Not only are technologies for examining proteins more limited than those for studying RNA transcription, in an extensive study of transcription by the Encyclopedia of DNA elements consortium, a picture of great complexity emerged. The project uncovered many novel exons, alternative splice forms, and novel regulatory elements. These results indicate that nearly 9/10ths of human genes undergo alternative splicing, and the average gene produces approximately 6 splice variants. Rather than solidify knowledge regarding the location and function of genes, these results question whether we accurately know what constitutes a gene, and how the products encoded by genes determine the function of cells. The results particularly obfuscate determination of which transcripts are selected for translation to protein, further complicating annotation efforts. To address that gap, our project will determine which transcripts encode proteins, and how these are affected in several tissue types and disease conditions. We will use large tandem mass spectrometry-based proteomic data sets, mapping the analyzed protein data directly to several available human genome sequences, along with sets of predicted transcripts produced by the N-SCAN and CONTRAST gene finders, to reveal which parts of transcripts are translated into proteins, and in which types of cells this translation occurs. To accomplish this, our project has three specific aims 1) to develop high-accuracy methods and software for mapping proteomic data from mass spec analyzed proteins directly to the genome locus encoding them; 2) to develop an analysis pipeline software system using a novel rule-based information management approach; and 3) to apply these developments for the high-throughput analysis of large proteomic data sets, identifying the transcripts that encode proteins in distinct tissue types and disease conditions, and placing the results in a publicly accessible track in the UCSC genome browser. We believe this project will yield significant knowledge about the location and timing of protein translation in cells, which will potentiate further investigation of how misregulation of the path from transcription to translation leads to human disease conditions. Sequencing of the human genome is complete, but figuring out where genes are located, how they function, and how they cause or prevent human diseases like cancer has only just begun. Genes act as blueprints for RNA and proteins, the workhorses of the cell. We are developing technologies to address the key challenges of determining which genes specify the building of which proteins and how this process is orchestrated to ultimately unravel how disease processes occur
Keywords: Address; Affect; Algorithms; Alternate Splicing; Alternative Splicing; Biochemical; Body Tissues; Cancers; Cell Function; Cell Process; Cell physiology; Cells; Cellular Function; Cellular Physiology; Cellular Process; Code; Coding System; Collaborations; Communities; Complex; Computer Programs; Computer Software Tools; Computer software; Crossmatching, Tissue; Custom; DNA; DNA Sequence; Data; Data Banks; Data Bases; Data Set; Databank, Electronic; Databanks; Database, Electronic; Databases; Dataset; Deoxyribonucleic Acid; Development; Disease; Disorder; Elements; Exons; Foundations; Funding; Gene Products, RNA; Gene Targeting; Gene Transcription; Genes; Genetic Transcription; Genome; Genome, Human; Goals; Grant; Histocompatibility Testing; Human; Human Genome; Human, General; Imagery; Information Management; Investigation; Investigators; Isotope Labeling; Knowledge; Learning, Machine; Link; Location; Machine Learning; Malignant Neoplasms; Malignant Tumor; Man (Taxonomy); Man, Modern; Maps; Mass Spectrum; Mass Spectrum Analysis; Measures; Methods; Mining; Minings; Modeling; Nature; Paint; Peptides; Photometry/Spectrum Analysis, Mass; Play; Procedures; Process; Protein Analysis; Proteins; Proteomics; Quality Control; RNA; RNA Expression; RNA Splicing; RNA Splicing, Alternative; RNA, Non-Polyadenylated; Regulation; Regulatory Element; RegulatoryElement; Research Personnel; Researchers; Ribonucleic Acid; Role; Sampling; Scanning; Software; Software Tools; Source; Specific qualifier value; Specified; Spectrometry, Mass; Spectroscopy, Mass; Spectrum Analyses, Mass; Spectrum Analysis, Mass; Speed; Speed (motion); Splicing; Structure; Subcellular Process; System; System, LOINC Axis 4; Targetings, Gene; Technology; Time; Tissue Crossmatchings; Tissue Typing; Tissues; Tools, Software; Transcript; Transcription; Transcription, Genetic; Translating; Translatings; Translations; Variant; Variation; Visualization; base; cell type; clinical data repository; clinical data warehouse; computer program/software; data repository; design; designing; disease/disorder; experience; flexibility; gene function; gene product; genetic element; genome sequencing; high throughput analysis; histocompatibility typing; human disease; improved; kernel methods; language translation; malignancy; neoplasm/cancer; new technology; novel; prevent; preventing; public health relevance; relational database; social role; software systems; statistical learning; support vector machine; tandem mass spectrometry; web interface
Relevance: NARRATIVE Sequencing of the human genome is complete, but figuring out where genes are located, how they function, and how they cause or prevent human diseases like cancer has only just begun. Genes act as blueprints for RNA and proteins, the workhorses of the cell. We are developing technologies to address the key challenges of determining which genes specify the building of which proteins and how this process is orchestrated to ultimately unravel how disease processes occur
Project start date: 2005-09-16
Project end date: 2012-03-31
Budget start date: 1-APR-2010
Budget end date: 31-MAR-2011
PFA/PA: PA-07-070
5R01HG003700-05 (2010): $435435
SOFTWARE TO IDENTIFY POST-TRANSLATIONAL MODIFICATIONS FROM PROTEOMIC DATA SETS
Morgan Corinne Giddings, Associate Professor
University Of North Carolina Chapel Hill, Office Of Sponsored Research, Chapel Hill, Nc 27599
Grant 5R01RR020823-06 from National Center For Research Resources
Keywords: Acute; Address; Amino Acid Substitution; Animal Welfare; Bibliography; Cancers; Cells; Chemical Structure; Chemicals; Computer Programs; Computer software; Country; Data; Data Banks; Data Bases; Data Set; Data Sources; Databank, Electronic; Databanks; Database, Electronic; Databases; Dataset; Development; Diabetes Mellitus; Disease; Disorder; Ecological impact; Environment; Environmental Impact; Enzymes; Equipment; Ethics Committees, Research; Gene variant; Genetic Diversity; Genetic Variation; Goals; Hb SS disease; HbSS disease; Health; Hemoglobin S Disease; Hemoglobin sickle cell disease; Hemoglobin sickle cell disorder; Housing; Hybrids; IACUC; IRBs; Impact, Environmental; Institutional Animal Care and Use Committee; Institutional Review Boards; International; Investigators; Life; Location; Malignant Neoplasms; Malignant Tumor; Manuals; Markov Chains; Markov Process; Mass Spectrum; Mass Spectrum Analysis; Measurement; Measures; Method LOINC Axis 6; Methodology; Methods; Modification; Names; Peptides; Photometry/Spectrum Analysis, Mass; Play; Post-Translational Modifications; Post-Translational Protein Processing; Posttranslational Modifications; Principal Investigator; Programs (PT); Programs [Publication Type]; Protein Modification; Protein Modification, Post-Translational; Protein Processing, Post-Translational; Protein Processing, Posttranslational; Protein/Amino Acid Biochemistry, Post-Translational Modification; Proteins; Proteomics; Research; Research Ethics Committees; Research Personnel; Research Resources; Researchers; Resources; Role; Sickle Cell Anemia; Site; Software; Sources, Data; Spectrometry, Mass; Spectroscopy, Mass; Spectrum Analyses, Mass; Spectrum Analysis, Mass; System; System, LOINC Axis 4; Testing; Variation (Genetics); Vertebrate Animals; Vertebrates; ing; allelic variant; base; clinical data repository; clinical data warehouse; computer program/software; data repository; diabetes; disease/disorder; expiration; gene product; human disease; human subject; improved; interest; malignancy; neoplasm/cancer; open source; programs; protein function; relational database; sickle cell disease; sickle disease; sicklemia; social role; tandem mass spectrometry; vertebrata
Project start date: 2004-09-24
Project end date: 2011-11-30
Budget start date: 1-DEC-2009
Budget end date: 30-NOV-2010
PFA/PA: PA-07-070
5R01RR020823-06 (2010): $324220