Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells

Proteomics investigations rely on reference proteomes for the identification of proteins. These reference proteomes reflect the proteins that can be produced by an ideal organism, and so explicitly exclude protein isoforms that may be produced as a result of genetic mutation. In order to identify non-reference, or non-canonical, proteoforms the results of genomics analyses must be incorporated into the protein identification workflow. I developed such a proteogenomics workflow for the comprehensive identification and validation of non-canonical proteins. This development was performed using MCF7 cells, a widely used in-vitro model of breast cancer, because it includes a large number of pathogenic mutations. The comprehensive proteogenomics analysis of MCF7 cells was performed using customized protein sequence database searches. In addition to confirming the protein forms of variants identified by next-generation sequencing, multiple novel proteoforms were identified and validated with synthetic isotopically-labeled standards. Peptides originating from single nucleotide variants, in-frame Insertion/Deletion, upstream open reading frames, transcripts in non-canonical reading frame, long non-coding RNA, transcripts with retained intron, exon extensions, novel exons, non-consensus splicing, variants not detected by next-generation sequencing, and novel isoforms were all identified and validated. Many of the proteins have previously been reported to play a role in tumor development, but many specific proteoforms are reported here for the first time. The results amply demonstrate that the reference proteome databases from UniProt, RefSeq and GENCODE widely underestimate the complexity of the oncoproteome space. The proteogenomics pipeline reported here was developed to be able to understand how cancer associated mutations affect the proteome, as many mutations do not lead to stable protein product. Furthermore, mutations may act through secondary routes and affect the regulation of which protein isoforms are produced, and so it is insufficient to limit the search to the direct protein analogues of the genetic mutation (i.e. altered peptide sequences produced by single-nucleotide variants and insertion/deletion events).

Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells / Yadav, Avinash; relatore esterno: McDonnell, Liam A.; Scuola Normale Superiore, 21-Dec-2018.

Comprehensive proteogenomics identification and validation of cancer associated proteoforms in MCF7 cells

Yadav, Avinash

2018

Abstract

Proteomics investigations rely on reference proteomes for the identification of proteins. These reference proteomes reflect the proteins that can be produced by an ideal organism, and so explicitly exclude protein isoforms that may be produced as a result of genetic mutation. In order to identify non-reference, or non-canonical, proteoforms the results of genomics analyses must be incorporated into the protein identification workflow. I developed such a proteogenomics workflow for the comprehensive identification and validation of non-canonical proteins. This development was performed using MCF7 cells, a widely used in-vitro model of breast cancer, because it includes a large number of pathogenic mutations. The comprehensive proteogenomics analysis of MCF7 cells was performed using customized protein sequence database searches. In addition to confirming the protein forms of variants identified by next-generation sequencing, multiple novel proteoforms were identified and validated with synthetic isotopically-labeled standards. Peptides originating from single nucleotide variants, in-frame Insertion/Deletion, upstream open reading frames, transcripts in non-canonical reading frame, long non-coding RNA, transcripts with retained intron, exon extensions, novel exons, non-consensus splicing, variants not detected by next-generation sequencing, and novel isoforms were all identified and validated. Many of the proteins have previously been reported to play a role in tumor development, but many specific proteoforms are reported here for the first time. The results amply demonstrate that the reference proteome databases from UniProt, RefSeq and GENCODE widely underestimate the complexity of the oncoproteome space. The proteogenomics pipeline reported here was developed to be able to understand how cancer associated mutations affect the proteome, as many mutations do not lead to stable protein product. Furthermore, mutations may act through secondary routes and affect the regulation of which protein isoforms are produced, and so it is insufficient to limit the search to the direct protein analogues of the genetic mutation (i.e. altered peptide sequences produced by single-nucleotide variants and insertion/deletion events).

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione
	
			21-dic-2018
		
	Settore Scientifico Disciplinare della tesi
	
			CHIM/02 CHIMICA FISICA
		
	Corso PhD
	
			Chimica
		
	Parole chiave
	
			breast cancer
Chemistry
genetics
MCF7 cells
proteins
		
	Editore
	
			Scuola Normale Superiore
		
	Relatore/i esterno/i
	
			McDonnell, Liam A.
		
	Supervisore interno
	
			Brancato, Giuseppe
		
	Appare nelle tipologie:
	
			9.1 Tesi PhD

File in questo prodotto:

File	Dimensione	Formato
PhD_thesis_Avinash_Yadav.pdf Open Access dal 22/12/2019 Descrizione: doctoral thesis full text Tipologia: Tesi PhD Licenza: Solo Lettura Dimensione 10.09 MB Formato Adobe PDF	10.09 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/85821

Citazioni

ND

ND

ND

social impact