![[ Sponsorenlogos: Förderverein CampusSource e.V., Helmholtz Open Science Office und de-RSE e.V. ]](/publikationen/csa2022/bilder/sponsoren.png)
campusSOURCE
vergibt im Jahr 2022 in Kooperation mit
dem
Helmholtz Open Science Office
und de-RSE e.V.
den
campusSOURCE Award 2022
1. Preis:
cookietemple
Lukas Heumos
lukas.heumos@helmholtz-muenchen.de
Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
Comprehensive Pneumology Center, Helmholtz Zentrum München, Munich, Germany
School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
Philipp Ehmele
philipp.ehmele@helmholtz-munich.de
Heinrich-Pette-Institut, Leibniz Institute for Experimental Virology, Hamburg, Germany
Department of Informatics, University of Hamburg, Hamburg, Germany
Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
Tobias Langes
Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
Abstract
Software written in academic settings is often times of low quality due to a lack of programming education, time and experience. This leads to the frequent abandoning of actually useful software. Furthermore, academia and industry collaborations are greatly hindered from forming due to very different coding standards. Here, we introduce cookietemple, a best-practice programming template collection which tackles the lack of professional standardization in academic software. cookietemple further provides utility tools such as the syncing of already existing cookietemple based projects to the latest standards. Several major scientific tools are based on cookietemple's templates and briefly introduced in this work.
Introduction
Software written in predominantly academic is often times of poor quality for manifold reasons. In academia, the key performance indicator is not necessarily a low number of bugs, but paper publications and citations, demonstrated in the publish or perish mantra. This goal is best achieved with quick proof of concepts and is contrary to rigorous planning and implementation of complex software. Software implementations of academic results are often solely side tracks to bump up the citation count. Moreover, most researchers have little programming experience and perceive programming solely as a tool to achieve a desired result, which was entirely self-learned when confronted with the problem at hand. Superficially learning a specific programming language is not inherently an issue by itself, but a lack of knowledge in how to produce and maintain quality software, is. This is a hurdle for self-taught programmers without a computer science background.
As researchers with a stronger programming background, we identified several additional problems with academic software throughout our studies or continuing academic and industry careers. Unfortunately, many researchers do not release their code along the corresponding publications, which not only makes it hard to reproduce the results, but also difficult to contribute and adapt the software. A further serious issue is that even when the code is published online, some software packages do not accompany the code with a license effectively preventing users from using and contributing the software. When contacting the authors of the software the responses posed a common theme. A lack of knowledge on how to and where to share software was frequently named as the reason for non-public code. Many researchers also named a lack of time as their reason for not uploading the code to an easily accessible repository. A few researchers also mentioned that they perceived their code as "being messy" and therefore not ready for public sharing. Moreover, researchers were not aware of the requirement to publish a license alongside the code even if they never intended to protect their code.
Due to wildly heterogeneous programming skill sets, not only between academic labs, but also already between lab members of the same group, there is a strong need for common software standards. Academic code is rarely rigorously tested with continuous integration (CI), a DevOps software development practice where developers regularly merge their changes into a central repository. This triggers automated builds and tests are run. Often times it is unclear whether the academic software still builds at all. Additionally, academic software is complex due to the nature of scientific research, but rarely well documented.
All of these issues hinder fruitful and efficient collaborations between academia and industry which poses a necessity to bring research into practice.
Here we introduce cookietemple, a Python based command-line tool providing best-practice project templates for several domains and programming languages which aims at solving all of the aforementioned issues.
Cookietemple as the solution
Initially, we tried distributing project templates among peers based on our existing projects, but quickly noticed several issues. First, these templates were not necessarily customizable and second, these templates were quickly out of date and there was no easy way to update all existing projects to the new standards.
- Figure 1: The key concepts of cookietemple
As a matter of necessity to ensure that our very own research software tools are of highest quality and can be developed in academic and industry collaborations we founded the cookiejar organization in 2020. The first major project of the cookiejar organization is cookietemple. cookietemple is a Python based command line tool providing best-practice project templates for various domains and languages with additional helper tools. As of November 2021 cookietemple has been implemented as a proof of concept with best-practice templates for Python packages, C++ libraries, Java based command line tools and graphical user interfaces. With more than 100 Github stars and 28000 downloads [cookietemple-dl] cookietemple enjoys some popularity and is well received in the community. Further, cookietemple was able to attract several developers beyond the two initiators of the project which contribute to the project regularly through bug reports and code contributions. We want to highlight that for example the support for the Microsoft Windows operating system was solely contributed by external contributors. Extensive documentation is available on Read The Docs. Currently, the community around cookietemple is organized through a Discord server as well as Github issues. A website showcasing cookietemple and the templates represent the public profile.
In the following subsections we will introduce the core ideas of cookietemple in more detail and outline how cookietemple will be developed further.
Philosophy
cookietemple is supposed to decrease the complexity and on boarding time of moderately experienced developers such as academics to highly modern best-practice projects. Therefore, many processes such as the increasing of a version across several files or the writing of release notes are automated. Further, to facilitate contributions to different projects from even different domains or programming languages, the cookietemple based projects feature common designs. This is especially important in an academic settings where collaborations and specific contributions to a high number of different projects are imperative and encouraged. As a result we also put an emphasis on ensuring that contributions to cookietemple based projects are correctly credited. Finally, developing is supposed to feel modern and be fun while teaching less experienced users throughout the process. Developers are not supposed to fight against, but to actually embrace the tooling.
- Figure 2: a) Project development without cookietemple. Developers work independently on a project with their own standards and
workflows, which may not be compatible with the other developers.
b) Project development using cookietemple. Creating a cookietemple project and work with it, static code analysis (linting), sync and other template features like GitHub Actions will ensure adherence to common standards which simplifies collaboration between the institutions.
Creating a project with cookietemple
cookietemple guides users interactively through the creation of a new project (Figure 3). Internally, cookietemple populates the template with user selected values using cookiecutter [cookiecutter] and copies the by all templates shared files into the template to finally create a user defined project. The created project is then automatically pushed to a newly created Github repository on several branches (main/master, development and TEMPLATE). This facilitates and encourages code sharing for research software from the very beginning.
- Figure 3: cookietemple interactively guides users to create projects. First, the user selects the template domain (e.g. cli or lib) with the arrow keys and then the primary language (e.g. Python or C++). Next, the project name, a short description of the project and the initial version are asked. After the user selected from one of 13 licenses, template specific settings such as specific library choices are asked. The project creation process ends with a newly created Github repository.
Templates
The best-practice templates are at the core of cookietemple. The currently available templates are:
- cli-python: Python based libraries or command line tools
- cli-java: Native Java command line interface tools
- lib-cpp: C++ based libraries
- gui-java: Java based graphical user interfaces with JavaFX
- web-website: Websites with a Python based backend
One of the key ideas behind all of cookietemple's templates is that they share a common design and feature set. All templates feature a Makefile unifying and abstracting common tasks like the installation make install of the tool or library, or the building of the documentation (make docs). Therefore, for simple tasks, deep knowledge of the build tools is not required. The common files for all templates also feature a consolidated Sphinx documentation setup. Hence, developers of cookietemple based projects can easily update the documentation of projects ensuring that it is always up to date. Modern developing entails continuous integration and testing which the templates provide in the form of pre-configured Github Actions. Although some of the template specific Github Actions workflows differ due to the language specific setup, all of them are based on the same service. Hence, learning a new CI service when contributing to other projects is not necessary. Here, we want to highlight one particular Github Action which all templates share. In academic settings it is especially important to highlight contributions. Cookietemple facilitates this through the integration of pre-configured release-drafter [release-drafter] and [labeler]. Whenever pull requests are made against the development branch, the type of contribution (feature, bug fix, etc.) is automatically detected from the branch name and the pull request gets the corresponding label. Later, when the pull request is merged, the draft release notes get automatically updated with a link to the pull request and the Github username of the contributor in the corresponding section (e.g. bugs for a bug fix).
As a result, if users are familiar with a single cookietemple template they can much more easily contribute to projects of even different domains and languages. Therefore, collaborations between different non-full-time software developers are strengthened because the start up time of often times stressed academics is greatly reduced and contributions are fully credited (Figure 4).
- Figure 4: Example of a new version released with release-drafter. All contributions clearly credit the respective contributors with links to the respective pull requests.
We will now introduce two templates in greater detail to highlight the extensive design of the templates.
cli-python
cli-python is the most popular cookietemple template. It is based on Claudio Jolowicz's [cookiecutter-hypermodern-python] but was slightly adapted to fit into cookietemple's design and use the by all templates shared files. It should be noted that cookietemple itself is bootstrapped on based on cookietemple's cli-python template.
Next to an extensive Github Actions, Sphinx documentation, cookietemple sync and bump-version (see below) and Makefile setup, cli-python is making heavy use of Poetry [poetry] and nox [nox]. Poetry is the modern way of building Python packages with deterministic builds, lock files and an intuitive command line interface to build and publish packages. The improved dependency solver ensures that the package builds and runs at every stage of development which is not guaranteed with the still commonly used Setuptools [setuptools] based build process. Further, the complete configuration is stored and managed with a single pyproject.toml file and not across several files such as the formerly required setup.py, setup.cfg, MANIFEST.in and requirements.tx. To allow for tests to run in multiple environments cli-python uses nox. As pre-configured in the template, a single nox call will format the code with black [black], lint the code with flake8 [flake8], run various pre-commit checks such as Check Yaml, sort imports with isor}, verify typehints with mypy [mypy], run tests with pytest [pytest], unit test coverage with Codecov [codecov]and more. Most of the style and code smells are fixed automatically with a single nox run and the ones that cannot be automatically fixed explain the issue and a proposed fix clearly to the user. This ensures that the code is always of high quality with a consistent style.
lib-cpp
The second most popular cookietemple template is lib-cpp which is based on and was contributed by Filip-Ioan Dutescu's modern-cpp-template [modern-cpp]. Based on CMake [CMake], the project can be build as either a header-only library, an executable or even as a static or shared library. The template further provides the option to use C++ package managers such as Conan and Vcpkg by default. While integrating all cookietemple common files, the Github Actions CI workflows are for Windows, Linux and MacOS are cache optimized to ensure a minimal run time. Easily maintainable code is ensured with a Clang-Format configuration inspired by the base Google model. This is complemented by static analyzers such as Clang-Tidy [clang-tidy] and Cppcheck [cpp-check]. Pre-configured GoogleTest [google-test] and GoogleMock provide unit testing support complemented with Codecov [codecov] support. The complete setup allows for multi platform development with high maintainability to ensure that researchers and industry in all environments can make use of the developed software which is unfortunately still a challenge in the C++ ecosystem.
helper tools
Beyond the creation of templates, the cookietemple command line interface provide several helper tools.
lint
To ensure that no essential files of the project is deleted, the version across all configuration files is consistent and more, cookietemple provides a cookietemple lint command which checks an existing project against a set of pre-defined rules. I any of the rules are found to be violated the user is made aware together with a link to allow him to resolve the issue (Figure 5). cookietemple lint is also a part of any template's rich CI setup.
- Figure 5: To force adherence to modern standards for project development, cookietemple provides detailed linting for each project that will ensure a defined set of conditions is always met.
sync
Users usually create their projects with the latest version of cookietemple wherein all templates itself are also versioned. However, if we, as developers of cookietemple, develop further features for a template or fix a critical bug, we want to update all already existing templates with these additions. This is possible with the tool cookietemple sync which is automatically triggered every night with a Github Actions workflow. If cookietemple sync detects that the version that the project was created with is lower than the latest corresponding template version of the latest cookietemple release, a pull request is made against the development branch. The pull request only contains the changes between the two respective template versions. This is possible with the during the project creation phase created TEMPLATE branch which is used for the git diff. All in all, cookietemple sync ensures that all existing projects benefit from continuous development of cookietemple and that all projects always use the same basis, a central promise of cookietemple.
- Figure 6: Whenever an updated existing template is released in cookietemple, existing projects using this template get a GitHub pull request with the latest changes. It's then up to the developers whether to integrate those changes or not.
bump-version
Increasing the version across several configuration files is not only a cumbersome, but also error-prone process. Hence, cookietemple provides an easily configurable cookietemple bump-version command which increases the version of all configuration files (Figure 7). When doing so cookietemple assures that the new version adheres to semantic versioning [semantic-versioning]. The per template pre-configured cookietemple.cfg configuration file which every template ships with allows for configuration files to be blacklisted (don't increase any version in this file except for specifically marked code lines) or whitelisted (increase all matching versions in this file except specifically marked code lines) for minimal configuration overhead.
- Figure 7: To facilitate the tedious task of updating project versions across the whole project, cookietemple provides a customizable bump-version command with integrated version checks.
Example use-cases
cookietemple effectively guides academic researchers to write high quality software which lead to several releases of impactful scientific software tools. A small selection will be introduced in the following section.
Deterministic machine learning with mlf-core
Although machine learning has shown huge growth in popularity in recent years, previous studies highlighted a reproducibility crisis in machine learning [ai_crisis]. In a collaboration across several academic groups, the researchers identified the technical reasons for non-deterministic machine learning and developed the first complete solution for deterministic machine learning termed mlf-core [mlf-core]. mlf-core has been downloaded more than 35000 times on PyPI and enjoys users from academia as well as industry [mlf-core-dl]. Since the first release in August 2020 more than 30 new versions have been released with contributions from 5 researchers demonstrating the effectiveness of automation. Moreover, every release clearly highlighted contributions from individuals ensuring that the work is correctly credited.
ncem - learning cell communication from spatial graphs of cells
Cellular variation in tissue niches is key to understanding tissue phenotypes in human and other species. Cell cell communication events can be examined by observing the interaction of a cell with its niche via molecular profiling assays of single cells. Based on the cookietemple cli-python template, node-centric expression modeling NCEM [ncem], a computational method on graph neural networks reconciling variance attribution and communication modeling in a single model of tissue niches was developed in rapid speed. This was only possible due to the familiarity of the lab members with cookietemple and the highly automated processes.
Plans and timeline
- Figure 8: Development plan for cookietemple in 2022.
Although cookietemple already works and enjoys a growing user base, development has only just begun. cookietemple is designed to be easily extendable with additional project templates. Hence, one of our next goals in the first quarter of 2022 is to introduce further templates to cookietemple for programming languages such as Rust for memory safe high performance software, Julia for scientific computing and Typescript for scientific visualizations. To ensure that the templates are of highest quality we will work together with experts for these languages and borrow from already existing popular templates if the licenses allow.
Moreover, we are aware of the fact that many companies use different CI services than Github Actions which might even be self hosted. Hence, in the second quarter of 2022 we want to provide additional configurations for Azure, Jenkins and other popular CI services to ensure that industry and academia can continue to use a common set of tools. Additionally, companies tend to use different git hosting services such as (self-hosted) Gitlab which we also want to provide support for. Users will be able to select whether they want to push their just created project to Github, Gitlab or any other git hosting service.
A web service with an intuitive web interface will allow for the generation of new projects based on cookietemple even for users who only work in the cloud. This programming paradigm is becoming more and more common since the introduction of Github Codespaces with VSCode.
To increase the visibility of scientific software in the research community we want to implement the automatic creation of a Zendodo DOI for every new project if the user so desires in the second quarter of 2022. This DOI will be automatically included in the projects README and therefore (rendered) documentation. Since Zenodo DOIs automatically include all contributors to a project it will ensure that all commits are credited to reward contributions.
We will further increase the visibility of cookietemple by giving talks at scientific and programming venues like PyCon 2022 and others. In addition we are planning to release a blog series on cookietemple.
Summary
In summary, we present cookietemple, a Python based command line tool to create best-practice projects based on highest quality templates for several domains and programming languages. The extensive automation and incorporation of code quality checks passively improves the quality and just as importantly maintainability of research software. The use of modern standard tools and high quality code enables collaborations with full-time software developers from academia and industry. Since all new projects are automatically pushed to Github with a standard open-source license, the developed research software is openly available for reuse, modification and contributions. The already existing release-drafter support and the planned support for Zenodo DOIs will ensure that contributions are correctly credited which will encourage continuous contributions and maintenance from busy academics and industry researchers.
Code availability
The code for cookietemple is available at https://github.com/cookiejar/cookietemple under the Apache 2.0 license with the corresponding documentation at https://cookietemple.readthedocs.io/en/latest/. The use-cases can be found at https://github.com/mlf-core/mlf-core/ and https://github.com/theislab/ncem respectively.
References
[ncem] Fischer, David S., Schaar, Anna C. and Theis, Fabian J.. Learning cell communication from spatial graphs of cells. In: bioRxiv, Cold Spring Harbor Laboratory, 2021. doi: 10.1101/2021.07.11.451750, https://www.biorxiv.org/content/early/2021/07/12/2021.07.11.451750
[ai_crisis] Matthew Hutson. Artificial intelligence faces reproducibility crisis. In: Science volume 359, number 6377 pages 725-726, 2018. doi: {10.1126/science.359.6377.725, https://www.science.org/doi/abs/10.1126/science.359.6377.725
[mlf-core] Heumos, Lukas, Ehmele, Philipp, Menden, Kevin, Cuellar, Luis Kuhn, Miller, Edmund, Lemke, Steffen, Gabernet, Gisela and Nahnsen, Sven. a framework for deterministic machine learning}. 2021. eprint; 2104.07651, archivePrefix: arXiv, primaryClass: cs.MS
[mlf-core-dl] mlf-core download numbers. https://pepy.tech/project/mlf-core (note = (Last Accessed: 2021-11-22)
[cookietemple-dl] cookietemple download numbers. https://pepy.tech/project/cookietemple (Last Accessed: 2021-11-22)
[release-drafter] release-drafter. https://github.com/release-drafter/release-drafter (Last Accessed: 2021-11-22)
[labeler] labeler. https://github.com/actions/labeler (Last Accessed: 2021-11-22)
[cookiecutter] cookiecutter. https://github.com/cookiecutter/cookiecutter (Last Accessed: 2021-11-22)
[cookiecutter-hypermodern-python] cookiecutter-hypermodern-python. https://github.com/cjolowicz/cookiecutter-hypermodern-python (Last Accessed: 2021-11-22)
[poetry] Poetry. https://python-poetry.org/ (Last Accessed: 2021-11-22)
[nox] nox. https://nox.thea.codes/en/stable/ (Last Accessed: 2021-11-22)
[setuptools] setuptools. https://setuptools.pypa.io/en/latest/ (Last Accessed: 2021-11-22)
[black] black. https://black.readthedocs.io/en/stable/ (Last Accessed: 2021-11-22)
[flake8] flake8. https://flake8.pycqa.org/en/latest/ (Last Accessed: 2021-11-22)
[pre-commit] pre-commit. https://pre-commit.com/ (Last Accessed: 2021-11-22)
[isort] isort. https://github.com/PyCQA/isort (Last Accessed: 2021-11-22)
[mypy] mypy. http://mypy-lang.org/ (Last Accessed: 2021-11-22)
[pytest] pytest. https://docs.pytest.org/en/6.2.x/ (Last Accessed: 2021-11-22)
[modern-cpp] modern-cpp-template. https://github.com/filipdutescu/modern-cpp-template (Last Accessed: 2021-11-22)
[CMake] CMake. https://cmake.org/}}, (Last Accessed: 2021-11-22)
[clang-format] Clang-Format}. https://clang.llvm.org/docs/ClangFormat.html (Last Accessed: 2021-11-22)
[clang-tidy] Clang-Tidy. https://clang.llvm.org/extra/clang-tidy/ (Last Accessed: 2021-11-22)
[cpp-check] cpp-check. https://cppcheck.sourceforge.io/ (Last Accessed: 2021-11-22)
[google-test] GoogleTest. https://google.github.io/googletest/ (Last Accessed: 2021-11-22)
[codecov] Codecov. https://about.codecov.io/ (Last Accessed: 2021-11-22)
[semantic-versioning] Semantic-Versioning. https://semver.org/ (Last Accessed: 2021-11-22)
[zenodo] Zenodo DOI. https://zenodo.org/ (Last Accessed: 2021-11-22)