archive:appds:wp2:main
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
archive:appds:wp2:main [05/07/2024 22:03] – removed - external edit (Unknown date) 127.0.0.1 | archive:appds:wp2:main [05/07/2024 22:03] (current) – ↷ Page moved from grants:archive:appds:wp2:main to archive:appds:wp2:main admin | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== WP2: Big Data Science Software ====== | ||
+ | The goal of this work package is first, to improve the basic software of the KCDC data centre. Then | ||
+ | we will adapt this to move the data centre to the Big Data environment provided by KIT-SCC and | ||
+ | MSU-SINP. Both experiences will be used to provide finally a concept and a standard software | ||
+ | package for scientific data centres. Therefore, the project will serve as a general software solution | ||
+ | for providing open access to (astroparticle) data independently of the exact kind of data. The | ||
+ | software needs to be built as a modular, flexible framework with a good scalability (e.g. to large | ||
+ | computing centres). The configuration is hold to be simple and doable also via a web interface. | ||
+ | The entire software will be based solely on Open Source Software (Python, Django, mongoDB, | ||
+ | HDF5, ROOT, etc.). Having done this step, the result enables to install and publish a dedicated | ||
+ | Data Life Cycle Lab for the astroparticle physics community, where this project is aiming for. | ||
+ | Dedicated access, storage, and interface software has to be developed. | ||
+ | |||
+ | A brief summary of the requirements, | ||
+ | improved is compiled in the following: When publishing data it is not enough to put some ASCII | ||
+ | files on a plain webpage. To ensure that the user can actually use the data, an extensive | ||
+ | documentation on how the data has been obtained is needed. Depending on the kind of data, this | ||
+ | is at least a description of the detector and the reconstruction procedures employed. Since this | ||
+ | information, | ||
+ | static pages is a viable option. Another important aspect is the user and access management. For | ||
+ | KCDC it was enough to ensure that only registered users get access to the data and to certain | ||
+ | parts of the portal, such as the data shop. Therefore, it was sufficient to check if an account is | ||
+ | active and the user is authenticated. While there is already a basic implementation of a permission | ||
+ | based access limitation, a useful categorization of the users into - possibly hierarchical - groups is | ||
+ | needed (no administrator should manually manage privileges of single users) to effectively use it. | ||
+ | The heart of the data centre is the data shop. The design goals have to ensure an easy access to | ||
+ | a user defined subset of the whole dataset, a natural way to configure additional detector | ||
+ | components or observables without the need to change the codebase and a clear overview of | ||
+ | previous requests with the possibility to use these selections as a template for future requests. For | ||
+ | KCDC, the data is stored in a MongoDB. Its scheme-less design allows us to collect all available | ||
+ | information of an event, although the available detector components may vary for each event. In | ||
+ | principle KCDC can use any kind of input format, as these are implemented as plugins. Currently | ||
+ | three output formats are supported, HDF5, Root and ASCII. These can be extended by adding | ||
+ | 11additional plugins, too. The requests are processed using a Celery based task queue. The | ||
+ | simultaneous processing of requests can be achieved by adding more worker processes, which | ||
+ | can be distributed among several machines. Together with the possibility to run the MongoDB on a | ||
+ | sharded cluster, a scaling of the needed processing power with the demand can be achieved. The | ||
+ | use and configuration of a sharded cluster has yet to be studied, however. Not included within | ||
+ | KCDC, yet a nice addition would be the possibility to publish results, such as energy spectra | ||
+ | together with the option to select multiple spectra to compile and download plots. Adding a way to | ||
+ | visually select the data would be possible. In addition, the appropriate hardware has to be installed | ||
+ | and commissioned. The described concept and working plan is valid also for the next steps, i.e. to | ||
+ | generalize and consolidate the software package of a global astroparticle data centre for public | ||
+ | and scientific use. | ||
+ | |||
+ | Specific tasks: | ||
+ | * Movement of KCDC to large-scale computing facility and adapting the new environment (KIT-SCC, KIT-IKP, MSU-SINP) | ||
+ | * Optimizing data bank and access interfaces (MSU-SINP, ISDCT, KIT-SCC) | ||
+ | * A distributed system of storage and archiving the data is developed (MSU-SINP, KIT-SCC) | ||
+ | * Installation of appropriate hardware (KIT-SCC) | ||
+ | * Installing the Data Life Cycle Lab” (KIT-SCC) |