A Comparison of Data Structures and Hash Functions to Manage URIs on the Web of Data

URIs are also exceedingly important on the Web of data, since RDF graphs and Linked Data both heavily rely on URIs to uniquely identify and connect online entities.

An Empirical Evaluation of Data Structures to Manage URIs on the Web of Data

Uniform Resource Identifiers (URIs) are one of the corner stones of the Web; They are also exceedingly important on the Web of data, since RDF graphs and Linked Data both heavily rely on URIs to uniquely identify and connect online entities. Due to their hierarchical structure and their string serialization, sets of related URIs typically contain a high degree of redundant information and are typically dictionary-encoded and managed through dedicated structures at the back-end (e.g., in the triplestore). Inserting and querying such data often account for a significant portion of query processing time. This paper presents, to the best of our knowledge, the first systematic comparison of the most common data structures used to handle URI data. We empirically evaluate a series of data structures (such as sparse hashes, hash tables, maps, B+trees, and lexicographic trees) in terms of their read/write performance and storage consumption.

Experiments results

Code for the experiments can be pulled from GitHub.

Download datasets from Google Drive.

Basic data structures

DS1 DS2 DS3 DS4 DS5 DS6 DS7
Marginal Load Time (orderded datasets)
Marginal Load Time (unordered datasets)
Total Load Time
Incremental Memory consumption (ordered datasets)
Incremental Memory consumption (unordered datasets)
Total Memory Consumption
Marginal Query Time (ordered datasets)
Marginal Query Time (unordered datasets)
DS1 DS2 DS3 DS4 DS5 DS6 DS7

RDF3X

DS1 DS2 DS3 DS4 DS5 DS6
Load Time
Memory consumption
Key query time
Id query time
DS1 DS2 DS3 DS4 DS5 DS6

HDT

DS1 DS2 DS3 DS4 DS5 DS6
Load Time
Memory consumption
Key query time
Id query time
DS1 DS2 DS3 DS4 DS5 DS6