Why are learned indexes so effective?

Ferragina, Paolo; Lillo, Fabrizio; Vinciguerra, Giorgio

A recent trend in algorithm design consists of augmenting classic data structures with machine learning models, which are better suited to reveal and exploit patterns and trends in the input data so to achieve outstanding practical improvements in space occupancy and time efficiency. This is especially known in the context of indexing data structures where, despite few attempts in evaluating their asymptotic efficiency, theoretical results are yet missing in showing that learned indexes are provably better than classic indexes, such as B+ -trees and their variants. In this paper, we present the first mathematically-grounded answer to this open problem. We obtain this result by discovering and exploiting a link between the original problem and a mean exit time problem over a proper stochastic process which, we show, is related to the space and time occupancy of those learned indexes. Our general result is then specialised to five well-known distributions: Uniform, Lognormal, Pareto, Exponential, and Gamma; and it is corroborated in precision and robustness by a large set of experiments

Why are learned indexes so effective?

Ferragina, Paolo;Lillo, Fabrizio;Vinciguerra,Giorgio

2020

Abstract

A recent trend in algorithm design consists of augmenting classic data structures with machine learning models, which are better suited to reveal and exploit patterns and trends in the input data so to achieve outstanding practical improvements in space occupancy and time efficiency. This is especially known in the context of indexing data structures where, despite few attempts in evaluating their asymptotic efficiency, theoretical results are yet missing in showing that learned indexes are provably better than classic indexes, such as B+ -trees and their variants. In this paper, we present the first mathematically-grounded answer to this open problem. We obtain this result by discovering and exploiting a link between the original problem and a mean exit time problem over a proper stochastic process which, we show, is related to the space and time occupancy of those learned indexes. Our general result is then specialised to five well-known distributions: Uniform, Lognormal, Pareto, Exponential, and Gamma; and it is corroborated in precision and robustness by a large set of experiments

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Settore Scientifico Disciplinare (validi fino a 24/06/2024)
	
				Settore SECS-S/06 - Metodi mat. dell'economia e Scienze Attuariali e Finanziarie
			
	Titolo del Convegno
	
				International conference on machine learning
			
	Luogo del Convegno
	
				Virtuale (originariamente Vienna)
			
	Periodo del Convegno
	
				12-18 Luglio 2020
			
	Titolo del Volume
	
				International conference on machine learning
			
	ISBN
	
				978-171382112-0
			
	Parole chiave
	
				Learned index; information retrieval
			
	Progetti che finanziano la ricerca
	
	Finanziamento
	
									Fondi MUR
								
	Informazioni sul finanziamento della ricerca
	
				Part of this work has been supported by the Italian MIUR PRIN project “Multicriteria data structures and algorithms: from compressed to learned indexes, and beyond” (Prot. 2017WR7SHH), by Regione Toscana (under POR FSE 2014/2020), by the European Integrated Infrastructure for Social Mining and Big Data Analytics (SoBig-Data++, Grant Agreement #871042), and by PRA UniPI 2018 “Emerging Trends in Data Science”.
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno