Web Data Commons - Extracting Structured Data from the Common Web Crawl

News Type: 
Data set and Tool


The provisioning of the CommonCrawl dataset and the metadata extraction by the WebDataCommons project enables researchers and web enthusiasts world-wide to experiment with very large web corpora. PlanetData partners have finished developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. Web Data Commons is a joint effort of the Web-based Systems Group at Freie Universität Berlin (Christian BizerHannes Mühleisen) and the Institute AIFB at the Karlsruhe Institute of Technology (Andreas HarthSteffen Stadtmüller).

Some results of extraction statistic and extracted structured data can be found on the site. Please visit the website for further info.