Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data
This talk has been given at the Astronomical Data Analysis Software and Systems (ADASS) XXI on November 06-10, 2011 in Paris, France by Ying Zhang (CWI).
The ever growing use of high precision experimental instruments in astronomical projects, e.g., SDSS, LSST and LOFAR, amounts to an avalanche of data to be stored, curated and analysed. Ingestion of gigabytes and even terabytes of data on a daily basis is taking place in many projects, while planned experimental devices are expected to scale ingestion up to petabytes soon. Efficient data management as part of a data exploration infrastructure has become a discriminative factor for scientific progress. Relational database management systems (RDBMSs) are the prime means to fulfill the role of application mediator for data exchange and data persistence.
Nevertheless, scientific applications are still poorly served by contemporary RDBMSs. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. To bridge the gap between the needs of the data-intensive scientific research fields like astronomy and the current DBMS technologies, we introduce SciQL (pronounced as `cycle'), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. SciQL provides a seamless symbiosis of array-, set-, and sequence- interpretation. A key innovation is the extension of value-based grouping in SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between the dimensional attributes of array cells. This leads to a generalization of window-based query processing with wide applicability in science domains.
In this talk, I will demonstrate the usefulness of SciQL for astronomical data processing with examples from transient radio phenomena detection. The Transients Key Science Project (KSP) of the LOw Frequency ARray (LOFAR) focuses on exploring and understanding the explosive and dynamic universe by observing transient and variable radio sources. With its 2688 dipoles (LBA), 200 MHz sampling, 2 polarisations and 12 bit digitisation, the LOFAR antennas are capable of producing 138 petabyte of raw data per day. To process this massive volume of data, extraordinarily efficient query plans and algorithms are a must. The key operations in the Transient KSP project include cross-catalogue correlation and radio pulsars detection. For traditional RDBMSs, they are extremely hard to express in SQL and optimise for query execution. With SciQL, however, such array data oriented operations can be expressed easily and concisely. Furthermore, by revealing the properties of array data, SciQL enables the potentials of the RDBMSs to better optimise query plans.