LDIF – A Framework for Large-Scale Linked Data Integration

Publication Date: 
Monday, 16 April, 2012
Published in: 
World Wide Web Conference (WWW2012)
Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer and Christian Becker

This paper is published in the area of Developers Track, co-located with the World Wide Web Conference (WWW2012) on April 16th, 2012 in Lyon, France.


While the Web of Linked Data grows rapidly, the development of Linked Data applications is still cumbersome and hampered due to the lack of software libraries for accessing, integrating and cleansing Linked Data from the Web. In order to make it easier to develop Linked Data applications, we provide the LDIF - Linked Data Integration Framework. LDIF can be used as a component within Linked Data applications to gather Linked Data from the Web and to translate the gathered data into a clean local target representation while keeping track of data provenance. LDIF provides a Linked Data crawler as well as components for accessing SPARQL endpoints and remote RDF dumps. It provides an expressive mapping language for translating data from the various vocabularies that are used on the Web to a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on flexible, user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model. LDIF contains a data quality assessment and a data fusion module which allow Web data to be filtered according to different data quality assessment policies and provide for fusing Web data using different conflict resolution methods. In order to deal with use cases of different sizes, we provide an in-memory implementation of the LDIF framework as well as an RDF-store-backed implementation and a Hadoop implementation that can be deployed on Amazon EC2.

PDF icon DevTrack_017.pdf152.68 KB