Processing Lineage Queries over a Native RDF Store

An outline of the paper: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store, by Marcin Wylot, Philippe Cudré-Mauroux, and Paul Groth (WWW 2014)

Within the Web community, there have been several efforts in developing models and syntaxes to interchange provenance. However, less attention has been given to the efficient handling of provenance data within RDF database systems. While some systems store quadruples or named graphs, to the best of our knowledge, no current high-performance triple store is able to automaticallyderive provenance data for the results it produces.

We aim to fill this gap. We present TripleProv, a new database system supporting the transparent and automatic derivation of detailed provenance information for arbitrary queries. TripleProv is based on a native RDF store, which we have extended with two different physical models to store provenance data on disk in a compact fashion. In addition, TripleProv supports several new query execution strategies to derive provenance information at two different levels of granularity.

The picture below shows the architecture of TripleProv; the system takes as input queries (and optionally a provenance granularity level), and produces as output query results along with their corresponding provenance polynomials.

RDF data provenance can be modeled at different granularity levels. Our system, besides generating polynomials summarizing the complete provenance of results, also supports two levels of granularity. First, a lineage (i.e., an element appearing in a polynomial) can represent the source of a triple (e.g., the fourth element in a quadruple). We call this granularity level source-level. Alternatively, a lineage can represent a quadruple (i.e., a triple plus its corresponding source). This second type of lineage produces polynomials consisting of all the pieces of data (i.e., quadruples) that were used to answer the query, including all intermediate results. We call this level of granularity triple-level.

In addition to those two provenance granularity levels, TripleProv also supports two levels of aggregation to output the results. The default level aggregates the polynomials for all results, i.e., it gives an overview of all triples/sources used during query execution. The second level provides full provenance details, explaining—for each single result—the way (polynomial) through which this particular result was constructed. Both aggregation levels typically perform similarly (since one is basically derived from the other), hence, we mainly focus on aggregated polynomial results in the following.

TripleProv enables also to specify a list of sources we want to be used to answer query. We can also specify a list of sources we do not trust to avoid using then during query execution process. The specified lists are checked at every stage of query execution process, meaning that even at the level of intermediate results which are not necessarily presented as an output we do verify it.
Overall, the performance penalty created by tracking provenance in TripleProv ranges from a few percents to almost 350%. Clearly, we observe a significant difference between the two main provenance storage models implemented.

2014-03-10 DATA · PARTITIONING · PROVENANCE · RDF · MANAGEMENT