WP2: Big Data Science Software

The goal of this work package is first, to improve the basic software of the KCDC data centre. Then we will adapt this to move the data centre to the Big Data environment provided by KIT-SCC and MSU-SINP. Both experiences will be used to provide finally a concept and a standard software package for scientific data centres. Therefore, the project will serve as a general software solution for providing open access to (astroparticle) data independently of the exact kind of data. The software needs to be built as a modular, flexible framework with a good scalability (e.g. to large computing centres). The configuration is hold to be simple and doable also via a web interface. The entire software will be based solely on Open Source Software (Python, Django, mongoDB, HDF5, ROOT, etc.). Having done this step, the result enables to install and publish a dedicated Data Life Cycle Lab for the astroparticle physics community, where this project is aiming for. Dedicated access, storage, and interface software has to be developed.

A brief summary of the requirements, how they are implemented in KCDC and how they will be improved is compiled in the following: When publishing data it is not enough to put some ASCII files on a plain webpage. To ensure that the user can actually use the data, an extensive documentation on how the data has been obtained is needed. Depending on the kind of data, this is at least a description of the detector and the reconstruction procedures employed. Since this information, in addition to a license agreement, is not expected to change often, creating some static pages is a viable option. Another important aspect is the user and access management. For KCDC it was enough to ensure that only registered users get access to the data and to certain parts of the portal, such as the data shop. Therefore, it was sufficient to check if an account is active and the user is authenticated. While there is already a basic implementation of a permission based access limitation, a useful categorization of the users into - possibly hierarchical - groups is needed (no administrator should manually manage privileges of single users) to effectively use it. The heart of the data centre is the data shop. The design goals have to ensure an easy access to a user defined subset of the whole dataset, a natural way to configure additional detector components or observables without the need to change the codebase and a clear overview of previous requests with the possibility to use these selections as a template for future requests. For KCDC, the data is stored in a MongoDB. Its scheme-less design allows us to collect all available information of an event, although the available detector components may vary for each event. In principle KCDC can use any kind of input format, as these are implemented as plugins. Currently three output formats are supported, HDF5, Root and ASCII. These can be extended by adding 11additional plugins, too. The requests are processed using a Celery based task queue. The simultaneous processing of requests can be achieved by adding more worker processes, which can be distributed among several machines. Together with the possibility to run the MongoDB on a sharded cluster, a scaling of the needed processing power with the demand can be achieved. The use and configuration of a sharded cluster has yet to be studied, however. Not included within KCDC, yet a nice addition would be the possibility to publish results, such as energy spectra together with the option to select multiple spectra to compile and download plots. Adding a way to visually select the data would be possible. In addition, the appropriate hardware has to be installed and commissioned. The described concept and working plan is valid also for the next steps, i.e. to generalize and consolidate the software package of a global astroparticle data centre for public and scientific use.

Specific tasks:

Movement of KCDC to large-scale computing facility and adapting the new environment (KIT-SCC, KIT-IKP, MSU-SINP)
Optimizing data bank and access interfaces (MSU-SINP, ISDCT, KIT-SCC)
A distributed system of storage and archiving the data is developed (MSU-SINP, KIT-SCC)
Installation of appropriate hardware (KIT-SCC)
Installing the Data Life Cycle Lab” (KIT-SCC)