TripleProv: Tracking and Querying Provenance in Linked Data
Linked Data integrating a myriad of datasets from the Web brings new challenges to databases. Those challenges raise from all dimensions of Big Data. The one I want to touch in this post is variety; specifically the heterogeneity of Linked Data. RDF data can be integrated from multiple sources, generated automatically or created manually, verified or not. It brings together huge heterogeneous and uncertain collections of data linked across the Web. Processing such data we need a solution to trace how the results of a query were produced and to tailor our queries with information on data provenance.
In our work we presented a system named TripleProv capable to store, trace, and query provenance information in processing RDF queries. TripleProv returns an understandable description of the way the results of an RDF querywere derived; specifically it gives a detailed explanation which pieces of data and how were combined to produce the answer of a query. Moreover with TripleProv you can tailor query execution with provenance information. You can input a provenance specification of the data you want to use to derive the answer. For example, if you are interested with articles about “Obama”, but you want the answer to come only from sources attributed to “US News”.
As an input to the system you provide a query you want to execute (workload query) and an RDF query describing provenance of the data you want to be used in query processing (provenance query). The query execution process can vary depending of the strategy. Typically the system starts with executing the provenance query, then it optionally pre-materializes or co-locates data. Afterwards TripleProv executes the workload queries at the same time it collects information of entities used during the query execution and the way they are combined. The system returns:
- results of the workload queries, restricted to those which are following the provenance specification;
- the provenance polynomial describing the way the results were derived.
TripleProv provides detailed information on each piece of data used to produce the answer and the exact way it contributed to the results. To express this information the system uses the notion of a provenance polynomial, which is an algebraic structure describing how the data was combined. A provenance polynomial provided by TripleProv allows you to pinpoint and trace back the exact pieces of data used to produce the answer and the exact way how those pieces of data were combined. In order to express the way the pieces of data were combined TripleProv uses two basic algebraic operators. The first one (⊕) to represent a union, and the second (⊗) to represent a join.
The picture shows a simple star query (Basic Graph Pattern) and a provenance polynomial pinpointing how each part of the query is tackled. In this example the first triple pattern is satisfied with lineage l1, l2 or l3, while the second has been satisfied with l4 or l5, third was processed with elements having a lineage of l6 or l7, and the last one was processed with elements from l8 or l9. The triples were joined on variable ?a, which is expressed by the join operation (⊕) in the polynomial.
Besides providing an understandable and reliable description of the way the results were produced, TripleProv allows you to tailor RDF queries with provenance information. You can provide to the system your description of the data which will be used in the query processing. Such description is expressed in the same way as your workload query, and we call it a provenance query. Together the workload query and the provenance query give a provenance-enabled query. Such provenance-enabled query returns results of the workload query, limited to those derived from the data described by the provenance query.
Considering our query from the previous example, which is a workload query, we would like to retrieve results of this query, but using only data attributed to government, and verified by the Paris Tourist Office. The following provenance query can express such description of data:</div>
SELECT ? ctx WHERE {
? ctx prov:wasAttributedTo <government> .
? ctx prov:wasVeryfiedBy <PartisTouristOffice> .
}
Sending those two queries to TripleProv will give you information about geolocation of Eiffel Tower in France, additionally the information will be obtained from reliable and verified data following your description.
In TripleProv you can specify a workload query and an optional provenance query. The system displays the results from data satisfying both queries. The system also displays a provenance polynomial describing which parts of the results were derived using which pieces of data, so that you can examine how particular pieces of data contributed to the final result.
In the era of highly heterogeneous Linked Data, integrated from multiple sources, generated with different approaches (automatic or manual) tracing and querying provenance information becomes a must-have of a modern triplestore. TripleProv offers an interactive and pragmatic solution for this issue. TripleProv is the first provenance aware triplestore capable to store, trace, and query provenance information.
For more details you can have a look at our papers on TripleProv.
On storing and tracing provenance, WWW2014:
TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store
On tailoring queries with provenance information, WWW2015:
Executing Provenance-Enabled Queries over Web Data
A demonstration of TripleProv, VLDB2015:
A Demonstration of TripleProv: Tracking and Querying Provenance over Web Data</div>