Corpus of 147 million relational Web tables published!
Work package 4 is happy to announce the release of a corpus containing 147 million quasi-relational Web tables.
The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities. A corpus of Web tables can be useful for research and applications in areas such as data search, table augmentation, knowledge base construction, and for various NLP tasks.
The WDC Web Tables corpus has been extracted from the 2012 version of the Common Crawl, the largest Web crawl that is available to the public. The corpus contains the subset of the 11 billion HTML tables found in the Common Crawl that are likely quasi-relational. More information about the corpus, its application domains as well as information about how to download the corpus is found at: http://webdatacommons.org/webtables/
We want to thanks the Common Crawl Foundation for providing their great web crawl and thus enabling the creation of the WDC Web Tables corpus.
The creation of the WDC Web Tables corpus was supported by the German Research Foundation (DFG), the EU FP7 project PlanetData and by Amazon Web Services. We thank our sponsors a lot.