Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells

Yadav, Avinash

Proteomics investigations rely on reference proteomes for the identification of proteins. These reference proteomes reflect the proteins that can be produced by an ideal organism, and so explicitly exclude protein isoforms that may be produced as a result of genetic mutation. In order to identify non-reference, or non-canonical, proteoforms the results of genomics analyses must be incorporated into the protein identification workflow. I developed such a proteogenomics workflow for the comprehensive identification and validation of non-canonical proteins. This development was performed using MCF7 cells, a widely used in-vitro model of breast cancer, because it includes a large number of pathogenic mutations. The comprehensive proteogenomics analysis of MCF7 cells was performed using customized protein sequence database searches. In addition to confirming the protein forms of variants identified by next-generation sequencing, multiple novel proteoforms were identified and validated with synthetic isotopically-labeled standards. Peptides originating from single nucleotide variants, in-frame Insertion/Deletion, upstream open reading frames, transcripts in non-canonical reading frame, long non-coding RNA, transcripts with retained intron, exon extensions, novel exons, non-consensus splicing, variants not detected by next-generation sequencing, and novel isoforms were all identified and validated. Many of the proteins have previously been reported to play a role in tumor development, but many specific proteoforms are reported here for the first time. The results amply demonstrate that the reference proteome databases from UniProt, RefSeq and GENCODE widely underestimate the complexity of the oncoproteome space. The proteogenomics pipeline reported here was developed to be able to understand how cancer associated mutations affect the proteome, as many mutations do not lead to stable protein product. Furthermore, mutations may act through secondary routes and affect the regulation of which protein isoforms are produced, and so it is insufficient to limit the search to the direct protein analogues of the genetic mutation (i.e. altered peptide sequences produced by single-nucleotide variants and insertion/deletion events).

Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells / Yadav, Avinash; relatore esterno: McDonnell, Liam A.; Scuola Normale Superiore, 21-Dec-2018.

Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells

Yadav, Avinash

2018

Abstract

Proteomics investigations rely on reference proteomes for the identification of proteins. These reference proteomes reflect the proteins that can be produced by an ideal organism, and so explicitly exclude protein isoforms that may be produced as a result of genetic mutation. In order to identify non-reference, or non-canonical, proteoforms the results of genomics analyses must be incorporated into the protein identification workflow. I developed such a proteogenomics workflow for the comprehensive identification and validation of non-canonical proteins. This development was performed using MCF7 cells, a widely used in-vitro model of breast cancer, because it includes a large number of pathogenic mutations. The comprehensive proteogenomics analysis of MCF7 cells was performed using customized protein sequence database searches. In addition to confirming the protein forms of variants identified by next-generation sequencing, multiple novel proteoforms were identified and validated with synthetic isotopically-labeled standards. Peptides originating from single nucleotide variants, in-frame Insertion/Deletion, upstream open reading frames, transcripts in non-canonical reading frame, long non-coding RNA, transcripts with retained intron, exon extensions, novel exons, non-consensus splicing, variants not detected by next-generation sequencing, and novel isoforms were all identified and validated. Many of the proteins have previously been reported to play a role in tumor development, but many specific proteoforms are reported here for the first time. The results amply demonstrate that the reference proteome databases from UniProt, RefSeq and GENCODE widely underestimate the complexity of the oncoproteome space. The proteogenomics pipeline reported here was developed to be able to understand how cancer associated mutations affect the proteome, as many mutations do not lead to stable protein product. Furthermore, mutations may act through secondary routes and affect the regulation of which protein isoforms are produced, and so it is insufficient to limit the search to the direct protein analogues of the genetic mutation (i.e. altered peptide sequences produced by single-nucleotide variants and insertion/deletion events).

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione
	
				21-dic-2018
			
	Settori scientifico-disciplinari (SSD) (validi fino a 24/06/2024)
	
				CHIM/02 CHIMICA FISICA
			
	Corso PhD
	
				Chimica
			
	Parole chiave
	
				breast cancer
Chemistry
genetics
MCF7 cells
proteins
			
	Relatore/i esterno/i
	
				McDonnell, Liam A.
			
	Supervisore interno
	
				Brancato, Giuseppe
			
	Editore
	
				Scuola Normale Superiore
			
	Appare nelle tipologie:
	
				9.1 Tesi PhD

File in questo prodotto:

File	Dimensione	Formato
PhD_thesis_Avinash_Yadav.pdf Accesso chiuso Descrizione: doctoral thesis full text Tipologia: Published version Dimensione 10.09 MB Formato Adobe PDF Richiedi una copia	10.09 MB	Adobe PDF	Richiedi una copia