A Comparison of Data Structures and Hash Functions to Manage URIs on the Web of Data

URIs are also exceedingly important on the Web of data, since RDF graphs and Linked Data both heavily rely on URIs to uniquely identify and connect online entities.

An Empirical Evaluation of Data Structures to Manage URIs on the Web of Data

Uniform Resource Identifiers (URIs) are one of the corner stones of the Web; They are also exceedingly important on the Web of data, since RDF graphs and Linked Data both heavily rely on URIs to uniquely identify and connect online entities. Due to their hierarchical structure and their string serialization, sets of related URIs typically contain a high degree of redundant information and are typically dictionary-encoded and managed through dedicated structures at the back-end (e.g., in the triplestore). Inserting and querying such data often account for a significant portion of query processing time. This paper presents, to the best of our knowledge, the first systematic comparison of the most common data structures used to handle URI data. We empirically evaluate a series of data structures (such as sparse hashes, hash tables, maps, B+trees, and lexicographic trees) in terms of their read/write performance and storage consumption.

Experiments results

Code for the experiments can be pulled from GitHub.

Download datasets from Google Drive.

Basic data structures

DS1	DS2	DS3	DS4	DS5	DS6	DS7
Marginal Load Time (orderded datasets)

Marginal Load Time (unordered datasets)

Total Load Time

Incremental Memory consumption (ordered datasets)

Incremental Memory consumption (unordered datasets)

Total Memory Consumption

Marginal Query Time (ordered datasets)

Marginal Query Time (unordered datasets)

DS1	DS2	DS3	DS4	DS5	DS6	DS7

RDF3X

DS1	DS2	DS3	DS4	DS5	DS6
Load Time

Memory consumption

Key query time

Id query time

DS1	DS2	DS3	DS4	DS5	DS6

HDT

DS1	DS2	DS3	DS4	DS5	DS6
Load Time

Memory consumption

Key query time

Id query time

DS1	DS2	DS3	DS4	DS5	DS6