Proteomics investigations rely on reference proteomes for the identification of proteins. These reference proteomes reflect the proteins that can be produced by an ideal organism, and so explicitly exclude protein isoforms that may be produced as a result of genetic mutation. In order to identify non-reference, or non-canonical, proteoforms the results of genomics analyses must be incorporated into the protein identification workflow. I developed such a proteogenomics workflow for the comprehensive identification and validation of non-canonical proteins. This development was performed using MCF7 cells, a widely used in-vitro model of breast cancer, because it includes a large number of pathogenic mutations. The comprehensive proteogenomics analysis of MCF7 cells was performed using customized protein sequence database searches. In addition to confirming the protein forms of variants identified by next-generation sequencing, multiple novel proteoforms were identified and validated with synthetic isotopically-labeled standards. Peptides originating from single nucleotide variants, in-frame Insertion/Deletion, upstream open reading frames, transcripts in non-canonical reading frame, long non-coding RNA, transcripts with retained intron, exon extensions, novel exons, non-consensus splicing, variants not detected by next-generation sequencing, and novel isoforms were all identified and validated. Many of the proteins have previously been reported to play a role in tumor development, but many specific proteoforms are reported here for the first time. The results amply demonstrate that the reference proteome databases from UniProt, RefSeq and GENCODE widely underestimate the complexity of the oncoproteome space. The proteogenomics pipeline reported here was developed to be able to understand how cancer associated mutations affect the proteome, as many mutations do not lead to stable protein product. Furthermore, mutations may act through secondary routes and affect the regulation of which protein isoforms are produced, and so it is insufficient to limit the search to the direct protein analogues of the genetic mutation (i.e. altered peptide sequences produced by single-nucleotide variants and insertion/deletion events).

Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells / Yadav, Avinash; relatore esterno: McDonnell, Liam A.; Scuola Normale Superiore, 21-Dec-2018.

Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells

Yadav, Avinash
2018

Abstract

Proteomics investigations rely on reference proteomes for the identification of proteins. These reference proteomes reflect the proteins that can be produced by an ideal organism, and so explicitly exclude protein isoforms that may be produced as a result of genetic mutation. In order to identify non-reference, or non-canonical, proteoforms the results of genomics analyses must be incorporated into the protein identification workflow. I developed such a proteogenomics workflow for the comprehensive identification and validation of non-canonical proteins. This development was performed using MCF7 cells, a widely used in-vitro model of breast cancer, because it includes a large number of pathogenic mutations. The comprehensive proteogenomics analysis of MCF7 cells was performed using customized protein sequence database searches. In addition to confirming the protein forms of variants identified by next-generation sequencing, multiple novel proteoforms were identified and validated with synthetic isotopically-labeled standards. Peptides originating from single nucleotide variants, in-frame Insertion/Deletion, upstream open reading frames, transcripts in non-canonical reading frame, long non-coding RNA, transcripts with retained intron, exon extensions, novel exons, non-consensus splicing, variants not detected by next-generation sequencing, and novel isoforms were all identified and validated. Many of the proteins have previously been reported to play a role in tumor development, but many specific proteoforms are reported here for the first time. The results amply demonstrate that the reference proteome databases from UniProt, RefSeq and GENCODE widely underestimate the complexity of the oncoproteome space. The proteogenomics pipeline reported here was developed to be able to understand how cancer associated mutations affect the proteome, as many mutations do not lead to stable protein product. Furthermore, mutations may act through secondary routes and affect the regulation of which protein isoforms are produced, and so it is insufficient to limit the search to the direct protein analogues of the genetic mutation (i.e. altered peptide sequences produced by single-nucleotide variants and insertion/deletion events).
21-dic-2018
CHIM/02 CHIMICA FISICA
Chimica
breast cancer
Chemistry
genetics
MCF7 cells
proteins
Scuola Normale Superiore
McDonnell, Liam A.
Brancato, Giuseppe
File in questo prodotto:
File Dimensione Formato  
PhD_thesis_Avinash_Yadav.pdf

Open Access dal 22/12/2019

Descrizione: doctoral thesis full text
Tipologia: Tesi PhD
Licenza: Solo Lettura
Dimensione 10.09 MB
Formato Adobe PDF
10.09 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/85821
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact