User Tools

Site Tools


archive:appds:wp2:main

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
archive:appds:wp2:main [05/07/2024 22:03] – removed - external edit (Unknown date) 127.0.0.1archive:appds:wp2:main [05/07/2024 22:03] (current) – ↷ Page moved from grants:archive:appds:wp2:main to archive:appds:wp2:main admin
Line 1: Line 1:
 +====== WP2: Big Data Science Software ======
  
 +The goal of this work package is first, to improve the basic software of the KCDC data centre. Then
 +we will adapt this to move the data centre to the Big Data environment provided by KIT-SCC and
 +MSU-SINP. Both experiences will be used to provide finally a concept and a standard software
 +package for scientific data centres. Therefore, the project will serve as a general software solution
 +for providing open access to (astroparticle) data independently of the exact kind of data. The
 +software needs to be built as a modular, flexible framework with a good scalability (e.g. to large
 +computing centres). The configuration is hold to be simple and doable also via a web interface.
 +The entire software will be based solely on Open Source Software (Python, Django, mongoDB,
 +HDF5, ROOT, etc.). Having done this step, the result enables to install and publish a dedicated
 +Data Life Cycle Lab for the astroparticle physics community, where this project is aiming for.
 +Dedicated access, storage, and interface software has to be developed.
 +
 +A brief summary of the requirements, how they are implemented in KCDC and how they will be
 +improved is compiled in the following: When publishing data it is not enough to put some ASCII
 +files on a plain webpage. To ensure that the user can actually use the data, an extensive
 +documentation on how the data has been obtained is needed. Depending on the kind of data, this
 +is at least a description of the detector and the reconstruction procedures employed. Since this
 +information, in addition to a license agreement, is not expected to change often, creating some
 +static pages is a viable option. Another important aspect is the user and access management. For
 +KCDC it was enough to ensure that only registered users get access to the data and to certain
 +parts of the portal, such as the data shop. Therefore, it was sufficient to check if an account is
 +active and the user is authenticated. While there is already a basic implementation of a permission
 +based access limitation, a useful categorization of the users into - possibly hierarchical - groups is
 +needed (no administrator should manually manage privileges of single users) to effectively use it.
 +The heart of the data centre is the data shop. The design goals have to ensure an easy access to
 +a user defined subset of the whole dataset, a natural way to configure additional detector
 +components or observables without the need to change the codebase and a clear overview of
 +previous requests with the possibility to use these selections as a template for future requests. For
 +KCDC, the data is stored in a MongoDB. Its scheme-less design allows us to collect all available
 +information of an event, although the available detector components may vary for each event. In
 +principle KCDC can use any kind of input format, as these are implemented as plugins. Currently
 +three output formats are supported, HDF5, Root and ASCII. These can be extended by adding
 +11additional plugins, too. The requests are processed using a Celery based task queue. The
 +simultaneous processing of requests can be achieved by adding more worker processes, which
 +can be distributed among several machines. Together with the possibility to run the MongoDB on a
 +sharded cluster, a scaling of the needed processing power with the demand can be achieved. The
 +use and configuration of a sharded cluster has yet to be studied, however. Not included within
 +KCDC, yet a nice addition would be the possibility to publish results, such as energy spectra
 +together with the option to select multiple spectra to compile and download plots. Adding a way to
 +visually select the data would be possible. In addition, the appropriate hardware has to be installed
 +and commissioned. The described concept and working plan is valid also for the next steps, i.e. to
 +generalize and consolidate the software package of a global astroparticle data centre for public
 +and scientific use.
 +
 +Specific tasks:
 +  * Movement of KCDC to large-scale computing facility and adapting the new environment (KIT-SCC, KIT-IKP, MSU-SINP)
 +  * Optimizing data bank and access interfaces (MSU-SINP, ISDCT, KIT-SCC)
 +  * A distributed system of storage and archiving the data is developed (MSU-SINP, KIT-SCC)
 +  * Installation of appropriate hardware (KIT-SCC)
 +  * Installing the Data Life Cycle Lab” (KIT-SCC)