Challenges in Data Science for Complex Systems

Somazzi, Andrea

doi:10.25429/somazzi-andrea_phd2024-01-26

In the world of complex systems, which are omnipresent in various domains including economics, biology, and human-engineered systems, understanding their behavior poses significant challenges. The crux of comprehending these systems lies in the effective analysis of the data they produce, whose methodologies are provided by data science. However, a notable challenge in this realm is the confrontation with partial information which, if not addressed judiciously, can lead to biased interpretations or misconceptions.This thesis is structured into five main chapters: the first provides a broad introduction to the main topics of this work. The second chapter studies opinion dynamics across various social media platforms by defining an opinion dynamics model on a multiplex network, highlighting the interplay of multiple platforms in shaping opinions. It underscores the importance of considering the different network layers, corresponding to social media platforms, when analyzing how users interact and shape their opinions. I find that empirical studies focusing on a single platform, neglecting interactions on other layers, can result in misleading conclusions. Moreover, by considering the richer picture given by this multi-platform opinion dynamics model, segregation of extreme from moderate users emerges. The subsequent chapter concerns the Generalized Maximum Entropy Principle (GMEP), a general principled technique for treating partial information. I will introduce the uninformativeness axiom, which when applied to the Uffink-Jizba-Korbel or the Hanel-Thurner families of entropies selects only Rényi entropy as viable, bridging the consistency between the GMEP and the Maximum Likelihood (ML) principles. I will also showcase the potential of ML in estimating the entropic parameter characterizing Rényi entropy, providing numerical examples supporting my theoretical findings. The fourth chapter regards nonlinear data compression, where I will introduce a generalized Arithmetic Coding scheme to encode sequences in order to minimize the exponential average codeword length. Moreover, I will provide a simple yet general justification for the employment of the exponential average, instead of the linear one. Namely, if the main interest is to reduce the probability of exceeding a given codewords' length threshold, I find that the exponential average is the target quantity to minimize. All my theoretical findings will be supported and confirmed by applications on both simulated i.i.d. and real correlated data. In the last chapter, I will briefly summarize my results. In essence, this thesis addresses the challenges posed by complex systems to data science, offering insights and methodologies to treat complex-systems-generated data, which are often fragmentary.

Challenges in Data Science for Complex Systems / Somazzi, Andrea; relatore esterno: Garlaschelli, Diego; Scuola Normale Superiore, ciclo 34, 26-Jan-2024.

Challenges in Data Science for Complex Systems

SOMAZZI, Andrea

2024

Abstract

In the world of complex systems, which are omnipresent in various domains including economics, biology, and human-engineered systems, understanding their behavior poses significant challenges. The crux of comprehending these systems lies in the effective analysis of the data they produce, whose methodologies are provided by data science. However, a notable challenge in this realm is the confrontation with partial information which, if not addressed judiciously, can lead to biased interpretations or misconceptions.This thesis is structured into five main chapters: the first provides a broad introduction to the main topics of this work. The second chapter studies opinion dynamics across various social media platforms by defining an opinion dynamics model on a multiplex network, highlighting the interplay of multiple platforms in shaping opinions. It underscores the importance of considering the different network layers, corresponding to social media platforms, when analyzing how users interact and shape their opinions. I find that empirical studies focusing on a single platform, neglecting interactions on other layers, can result in misleading conclusions. Moreover, by considering the richer picture given by this multi-platform opinion dynamics model, segregation of extreme from moderate users emerges. The subsequent chapter concerns the Generalized Maximum Entropy Principle (GMEP), a general principled technique for treating partial information. I will introduce the uninformativeness axiom, which when applied to the Uffink-Jizba-Korbel or the Hanel-Thurner families of entropies selects only Rényi entropy as viable, bridging the consistency between the GMEP and the Maximum Likelihood (ML) principles. I will also showcase the potential of ML in estimating the entropic parameter characterizing Rényi entropy, providing numerical examples supporting my theoretical findings. The fourth chapter regards nonlinear data compression, where I will introduce a generalized Arithmetic Coding scheme to encode sequences in order to minimize the exponential average codeword length. Moreover, I will provide a simple yet general justification for the employment of the exponential average, instead of the linear one. Namely, if the main interest is to reduce the probability of exceeding a given codewords' length threshold, I find that the exponential average is the target quantity to minimize. All my theoretical findings will be supported and confirmed by applications on both simulated i.i.d. and real correlated data. In the last chapter, I will briefly summarize my results. In essence, this thesis addresses the challenges posed by complex systems to data science, offering insights and methodologies to treat complex-systems-generated data, which are often fragmentary.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione
	
				26-gen-2024
			
	Settori scientifico-disciplinari (SSD) (validi fino a 24/06/2024)
	
				Settore FIS/03 - Fisica della Materia
Settore INF/01 - Informatica
Settore FIS/07 - Fisica Applicata(Beni Culturali, Ambientali, Biol.e Medicin)
			
	Corso PhD
	
				Fisica
			
	Ciclo
	
				34
			
	DOI
	
				https://dx.doi.org/10.25429/somazzi-andrea_phd2024-01-26
			
	Parole chiave
	
				Complex systems; Opinion Dynamics; Information Theory; Maximum Entropy; Data compression
			
	Relatore/i esterno/i
	
				Garlaschelli, Diego
Ferragina, Paolo
			
	Editore
	
				Scuola Normale Superiore
			
	Appare nelle tipologie:
	
				9.1 Tesi PhD

File in questo prodotto:

File	Dimensione	Formato
Tesi.pdf accesso aperto Descrizione: Tesi PhD Tipologia: Published version Licenza: Creative Commons Dimensione 4.8 MB Formato Adobe PDF	4.8 MB	Adobe PDF