campusSOURCE Award 2022
Das Citation File Format
Deutsches Zentrum für Luft- und Raumfahrt
Software is at the heart of modern research, as tooling for the creation, processing, analysis, visualization, and simulation of research data, as an enabler of new research methods, and as research output that embeds or implements research knowledge.
Despite the already great, and still growing, dependence of research on software, however, software plays a subordinate role in the academic ecosystem, where journal articles are still considered the be all and end all of any research activity. Citations to journal articles, publication counts, and journal impact factors feature prominently in the evaluation of researchers, research groups and whole institutions, and careers have been built on these metrics.
As a consequence, the people that are involved in the development and maintenance of research software – together making up the community of research software engineers (RSEs) – often lack the same, similar, or any career opportunities, simply because they write code, documentation, bug reports, requirements, or design documents, and not journal articles or monographs.
At the heart of this issue is the fact that software is an altogether different product than a written document, on a number of levels. There is hardly ever a final, “camera-ready” version of software, instead, it is continuously developed over many different versioned releases, and even more incremental changes. The formal – or even peer-reviewed – publication of software is not yet an established practice, and lacks both the large number of volunteer reviewers that work at the behest of academic journals, and established frameworks and processes. And last but not least, there is little incentive to publish software, as it is a work-intensive process that would have to be undertaken frequently, and because there is little to no reward for doing the work, due to a subpar practice of software citation (Howison and Bullard 2015).
This is highly problematic, because when software is not published and made citable on the one hand, and not properly cited on par with other research outputs on the other hand, this endangers the research endeavour as such: Computationally yielded research results cannot be evaluated, replicated and reproduced, as part of their genesis is obscured; existing software that has been successfully used for research cannot be identified, found, accessed, reused, and further developed to enable further research; researchers that create and maintain software cannot be attributed, credited and rewarded for their software work, and therefore capable people will be forced out of the academic sector; the provenance of research cannot be fully understood as software does not feature in research citation and knowledge graphs (Druskat 2020); and ultimately, research software cannot become sustainable (Druskat, Katz, and Todorov 2021).
To overcome these obstacles, a better practice of software citation is needed on all sides. Research software developers must make their software citable and publish it for citation, and researchers must properly cite the software they use in their research. To support better practice, the principles of software citation have been laid out in a paper published by a group of research software and scholarly communications experts (Smith et al. 2016).
The implementation of the software citation principles, however, is challenging (Katz et al. 2019). One of the core challenges is the provision and retrieval of the correct and complete metadata that is needed to cite software meaningfully. This is because, unlike papers, software does not wear the respective metadata on its sleeve, it does not have a title page or article website provided by a publisher that conveniently includes all relevant metadata in one place. Also, software citation needs some metadata that goes beyond what is standardly provided for text outputs, such as version information to support reproducibility. The Citation File Format (Druskat, Spaaks, et al. 2021) has been developed to alleviate these challenges.
What is the Citation File Format?
The Citation File Format (CFF) is an open community project that provides a format specification and schema, an implementation, and tooling, for software citation metadata. It enables researchers who create research software to receive credit for their software work, allows software to be cited in a way that fulfills the functions of citation (Druskat 2020), and enables reuse, reproducibility, and sustainability for research software.
The project is hosted publicly on GitHub, and its sub-projects are all licensed under open licenses and developed as FLOSS. For an overview of the project, see.
message: "If you use this software, please cite it as below."
- family-names: Druskat
title: "My Research Software"
Listing 1: Example of a simple CITATION.cff file.
The Citation File Format is specified in JSON Schema and implemented in YAML 1.2.0. The latter is a well-known language that is used widely in software development already, e.g., for configuration files. CFF files are named CITATION.cff and provide citation-relevant metadata for software. CFF version 1.2.0 (Druskat, Spaaks, et al. 2021) has also added experimental support for research datasets. A minimal example for a CITATION.cff file is given in Listing 1. The machine-readable YAML format also allows for a relatively high readability for humans. Arguably, it is also possible for developers to write CITATION.cff files manually, although tooling for this exists (see The Citation File Format – an ecosystem for software citation).
When developers provide a CFF with their software, e.g., in the source code repository or a downloadable distribution, users of the software have all the information they need to cite the software version they are using. Additionally, while the use of CFF files for the provision of software citation metadata does help solve the fundamental problem of providing citation metadata for software near the citable object itself, the fact that these files are written by the software authors themselves also solves other issues around software citation: The question of who the authors of software are, for example, cannot be reliably answered by collating version control metadata alone, as there may be authors who have not pushed any changes to a repository, but also contributors to a repository who may not qualify as authors. Only the software authors themselves can determine the correct and complete list of software authors, and must therefore provide this information themselves, which CFF enables.
The Citation File Format aims to cover common software citation use cases. Therefore, it also allows software developers to provide metadata for a research output other than the software the CFF file describes, for example for a journal article that describes the software or its use. This is done in a specific field called preferred-citation that is added to CITATION.cff in addition to the basic software metadata. The use of this field allows it for software authors to receive attribution, credit and acknowledgment even in cases where the citation of the software itself is not yet common practice or even understood at all. The necessary culture change that will enable a complete implementation of the software citation principles (Smith et al. 2016) will take a considerable amount of time, but in the meantime, developers and researchers can get credit for their software right now.
While the preferred-citation option is a pragmatic way for RSEs to operate under and benefit from the current system of academic credit, CFF also provides features that aim to implement better software citation practice, even beyond the baseline of the software citation principles. Research software that is used directly by researchers is an obvious citable object. However, software often reuses other software by taking them as dependencies. These dependencies may not offer a user interface, and although they provide research-relevant functionality and are actually used to yield research results, they are unlikely to be cited in publications. This in turn means that their authors are also unlikely to receive credit for their software work, and the dependencies effectively remain absent from citation and knowledge graphs (Druskat 2020). Therefore, to help making hitherto hidden research software visible, CITATION.cff files can contain a list of references, in analogy to a list of references in a paper. This list can (and arguably should) contain the dependencies of a research software, so that these become visible and citable. This CFF feature also allows to recursively build a complete citation graph for a software and its ecosystem, as dependencies themselves can make their own dependencies visible by using a CFF file, etc. The process of populating the references section in a CFF file is also automatable, and tooling exists already, e.g., for Java’s build system Maven (Krause and Druskat 2021). The Citation File Format does provide other features beyond those described here. For a full list of available fields, refer to the schema guide.
The Citation File Format as community project – a brief history
The Citation File Format was started based on a lightning talk (Druskat 2017b) presented at the Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE5.1) at the University of Manchester in September 2017, and its subsequent discussion (Druskat, Bast, et al. 2017) during the workshop.
The first version of the format specifications (Druskat 2017c) was subsequently published in October 2017. This was followed up by a series of community consultations and collaboration events that helped refine the format requirements and informed further development: a hackathon at FORCE11’s FORCE2017 conference (Druskat 2017a) in Berlin, Germany; a mini-workshop and hack day group at the Software Sustainability Institute’s Collaborations Workshop 2018 (Costa da Silva, Sufi, and Aragon Camarasa 2019) in Cardiff, UK; a dedicated hack day co-locating with the 3rd Conference of Research Software Engineers (RSE18) in Birmingham, UK in 2018; a hack session during the Scientific Software Registry Collaboration Workshop 2019 in College Park, MD, USA; and a two-day hackathon during the FORCE11 online conference FORCE2021 in December 2021.
Ever since the start of the Citation File Format project, it has been run as an open source, open access project, actively seeking – and receiving – contributions from the community. These included not only contributions to the format schema and specifications, but notably also new projects providing tooling to work with CFF (see The Citation File Format – an ecosystem for software citation below), both from individuals and from groups. The Netherlands eScience Center was among the first institutions to use CFF as source format in one of their projects, the flagship Research Software Directory (RSD, (Spaaks et al. 2020)), and in 2018, Jurriaan H. Spaaks, lead developer of the RSD joined the Citation File Format project as a co-lead. Further adoption included the provision of record metadata in CFF by the Astrophysics Source Code Library (Allen et al. 2020) from 2019.
Until July 2021, user uptake of CFF has grown steadily, with an estimated 500 (This estimate has been made together with Jurriaan Spaaks, based on previous searches for CITATION.cff files on public GitHub repositories.) CITATION.cff files pushed to public GitHub repositories (Due to restrictions in API functionality for, e.g., GitLab, Bitbucket and others, GitHub is the only platform for which global search for filenames is available, and therefore the only platform for which uptake can be measured.) until then. In July 2021, after discussion between GitHub, the FORCE11 Software Citation Implementation Working Group and the CFF project, GitHub announced support for the Citation File Format to provide software citation functionality through its user interface. In combination with other tools and platforms also announcing support for CFF shortly after (see the Citation File Format as a driver for better visibility of research software and its authors below), the uptake of CFF files accelerated considerably, see Figure 1, with more than 4,300 files available in public GitHub repositories at the time of writing.
- Figure 1: Number of CITATION.cff files in public GitHub repositories from October 2017. Daily counts were tracked from July 2021.
As a consequence, to reflect the recent impact and growth in community, and to safeguard the CFF project’s openness and inclusivity, as well as its vision, mission and scope, we currently develop a new governance model for the project and its sub-projects, as part of the first cohort of the Code for Science & Society Digital Infrastructure Incubator.
The Citation File Format – an ecosystem for software citation
While the CFF project’s core is the format schema and specifications (Druskat, Spaaks, et al. 2021), it also provides tooling to work with CITATION.cff files, with the aim to cover their whole lifecycle. To this end, there is an initializer implemented as a web form for manual creation as well as metadata retrieval tooling for creation and updating from existing metadata, and a number of validation and conversion tools. The table gives an overview over existing tools provided by the CFF project and external projects.
|Command line||cffconvert (Spaaks, Klaver, et al. 2021)||cffconvert (Spaaks, Klaver, et al. 2021)
bibtex-to-cff (Monperrus n.d.)
|GitHub Action||cff-validator (Hernangómez 2021)||cffconvert (Spaaks, Klaver, et al. 2021)
codemeta2cff (Morrell 2021)
|Docker||cffconvert (Spaaks, Klaver, et al. 2021)||cffconvert (Spaaks, Klaver, et al. 2021)|
|Go||datatools (Doiel 2021)|
plugin (Krause and Druskat 2021)
plugin (Krause and Druskat 2021)
plugin (Krause and Druskat 2021)
plugin (Willighagen 2021)
|Julia||Bibliography.jl (The Bibliography.jl authors 2021)||Bibliography.jl (The Bibliography.jl authors 2021)|
|PHP||bibtex-to-cff (Monperrus n.d.)|
|Python||doi2cff (Verhoeven and Spaaks 2018)||cffconvert (Spaaks, Klaver, et al. 2021)||cffconvert (Spaaks, Klaver, et al. 2021)
doi2cff (Verhoeven and Spaaks 2018)
|R||citation (Dietrich 2020)
handlr (Chamberlain 2020)
cffr (Hernangómez, Martins, and Chamberlain 2021)
|Ruby||ruby-cff (Haines and The Ruby Citation File Format Developers 2021)||ruby-cff (Haines and The Ruby Citation File Format Developers 2021)||ruby-cff (Haines and The Ruby Citation File Format Developers 2021)||ruby-cff (Haines and The Ruby Citation File Format Developers 2021)|
|Website||cffinit (Spaaks, Verhoeven, et al. 2021)|
Tools that support the work with CITATION.cff files. Tools provided under the umbrella of the Citation File Format project are in bold print.
In addition, CFF is convertible to many other formats via CodeMeta (Jones et al. 2017), which provides an exchange schema and crosswalk for software metadata. Further tools are currently under development, e.g., for automated updates to CITATION.cff files in repositories via continuous integration workflows. The existing tools, together with integrations into some major platforms in the open source, open access, research software and more generally academic ecosystem (see The Citation File Format as a driver for better visibility of research software and its authors), make it possible for the Citation File Format to support the complete software citation lifecycle.
The Citation File Format as a driver for better visibility of research software and its authors
Research software, and the central role that it plays in computational research, can only become sustainably more visible when it features in the places where people look for it and use it. Such places include software development platforms where researchers and RSEs collaborate on their code. They also include open access repositories where software can be published in their original state, i.e., as source code or binaries as opposed to as abstract descriptions in a text document. And they include the tools that research software users employ regularly in their workflows, such as reference managers.
The Citation File Format is integrated in one of the world’s largest software development platforms, GitHub, where it is used to provide citation information for software that is hosted there: When users add a CITATION.cff file to their repository, The GitHub repository landing page displays a popdown widget where users can copy the software citation from, as pre-formatted citation string and snippet, see Figure 2.
- Figure 2: The citation widget on GitHub showing the rendered metadata from the CITATION.cff file in the main CFF repository.
The open access repository Zenodo (Research and OpenAIRE 2013), which currently holds more than 66,000 research software records, supports automated publication of research software from GitHub. When the respective integration is activated by the owner of the source code repository on GitHub, Zenodo automatically pulls a snapshot of the repository contents each time a release is created on GitHub. In the process, the correct and complete citation metadata is read from the CITATION.cff file, if the GitHub repository contains one. This makes software publication for authors easier, as no manual editing of the Zenodo record metadata is necessary. It also supports better credit for software authors, who are determined by the authors themselves in the CFF files, not by the list of contributors on GitHub (see the brief discussion of authorship in What is the Citation File Format?).
And finally, when a source code repository on, e.g., GitHub, contains a CFF files, the respective metadata can be automatically inserted into reference managers, such as the open source software Zotero (Corporation for Digital Scholarship, n.d.) and JabRef (JabRef Development Team 2021), via their automatic import functionality. This works because both support the Citation File Format as input format.
In summary, the Citation File Format is used to cover the complete software citation workflow based on the software citation principles through the format itself, its tools ecosystem, and its integrations.
Current challenges and outlook
The CFF project has attracted increased interest in 2021, mostly due to its integrations with popular platforms. The community of users, but also of contributors, is growing organically. However, the fact that there is no formal financial institutional support is becoming challenging, especially as the leadership team will scale down in the imminent future. One measure to counteract this is the setup of a stable governance structure that will safeguard the survival of the project, as part of the project’s inclusion in the Digital Infrastructure Incubator (see above). Nevertheless, resources for development time and community building are needed to further develop both the project itself and its community. Therefore, should any prize be awarded to the Citation File Format, it will be used to directly fund further development of the Citation File Format.
In the future, the CFF schema and specifications will be further developed to meet existing and upcoming challenges for software citation. One of the upcoming features will, for example, include support for the diverse range of contribution roles to software in general, and research software in particular. This issue has been identified by both the project leadership and the community as one of the most pressing ones.
We will also work to grow the tool and integration ecosystem to better support developers and users of research software, but also development, publication and indexing platforms and external open source projects in software citation. Support for software citation via CFF in GitLab, similarly to GitHub, is currently being worked on. And new tools for programming language and metadata support (e.g., for Ruby and scholarly metadata via briard (Fenner 2021)), as well as automation are currently under development.
CFF also works with research projects in the research metadata space to better support research output publication with richer metadata, e.g., in the recently started HERMES project, funded by the Helmholtz Metadata Collaboration.
The Citation File Format and the campusSOURCE Award 2022
This section briefly summarizes how the Citation File Format supports the aims of the campusSOURCE Award 2022.
The Citation File Format improves the visibility of research software, and its creators and maintainers by enabling software citation that is based on the software citation principles, i.e., acknowledging the importance of research software by making the software itself citable, enabling credit and attribution for its authors, and by supporting the principles of specificity and accessibility through the provision of relevant metadata fields. Furthermore, it powers the increased visibility of software, and correctness and completeness of its citation metadata through supporting integrations in major infrastructure platforms. CFF furthermore creates a favourable environment for the development of open research software by providing a way to receive credit for software work, and thereby boostering incentives to develop research software in the open. The format further fosters research software sustainability, as research software citation and sustainability are interlinked through increased visibility, findability, and consequentially potential for reuse, and opportunities for funding and further development. Thereby CFF – albeit indirectly – creates potential to improve the quality of research software, as cited research software may attract reuse and additional resources to improve the quality of the software.
Allen, A., R. Nemiroff, P. Ryan, J. Schmidt, and P. Teuben. 2020. “Best Ways to Let Others Know How to Cite Your Research Software.” In American Astronomical Society Meeting Abstracts #235, 235:109.12. Honolulu, Hawai‘i, USA. https://ui.adsabs.harvard.edu/abs/2020AAS...23510912A
Chamberlain, Scott. 2020. “Handlr: Convert Among Citation Formats.” https://CRAN.R-project.org/package=handlr.
Corporation for Digital Scholarship. n.d. “Zotero.” https://www.zotero.org/.
Costa da Silva, Raniere Gaia, Shoaib Sufi, and Selina Aragon Camarasa. 2019. “Collaborations Workshop 2018 (CW18) Report , Productivity and Sustainability.” Research Ideas and Outcomes 5 (January): e30250. https://doi.org/10.3897/rio.5.e30250.
Dietrich, Jan Philipp. 2020. “Citation: Software Citation Tools.” https://CRAN.R-project.org/package=citation.
Doiel, Robert. 2021. “Datatools.” https://github.com/caltechlibrary/datatools.
Druskat, Stephan. 2017a. “Hacking the Future of Software Citation.” Software and Research. https://www.software.ac.uk/blog/2017-11-09-hacking-future-software-citation.
———. 2017b. “Should CITATION Files Be Standardized?” In Proceedings of the Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE5.1), edited by Neil Chue Hong, Stephan Druskat, Robert Haines, Caroline Jay, Daniel S. Katz, and Shoaib Sufi. Manchester, UK: figshare. https://doi.org/10.6084/m9.figshare.3827058.v4.
———. 2017c. “Citation File Format 0.9-RC1 (CFF),” October. https://doi.org/10.5281/ZENODO.1003150.
———. 2020. “Software and Dependencies in Research Citation Graphs.” Computing in Science & Engineering 22 (2): 8–21. https://doi.org/10.1109/MCSE.2019.2952840.
Druskat, Stephan, Radovan Bast, Neil Chue Hong, Alexander Konovalov, Andrew Rowley, and Raniere Silva. 2017. “A Standard Format for CITATION Files.” Software and Research. https://www.software.ac.uk/index.php/blog/2017-12-12-standard-format-citation-files.
Druskat, Stephan, Daniel S. Katz, and Ilian T. Todorov. 2021. “Research Software Sustainability and Citation,” March. https://doi.org/10.1109/BoKSS52540.2021.00008.
Druskat, Stephan, Jurriaan H. Spaaks, Neil Chue Hong, Robert Haines, James Baker, Spencer Bliven, Egon Willighagen, David Pérez-Suárez, and Alexander Konovalov. 2021. “Citation File Format.” https://doi.org/10.5281/zenodo.5171937.
Fenner, Martin. 2021. “Briard.” https://github.com/front-matter/briard.
Haines, Robert, and The Ruby Citation File Format Developers. 2021. “Ruby CFF Library.” https://doi.org/10.5281/zenodo.1184077.
Hernangómez, Diego. 2021. “Cff-Validator.” https://doi.org/10.5281/zenodo.5348444.
Hernangómez, Diego, João Martins, and Scott Chamberlain. 2021. “Cffr: Generate Citation File Format (’Cff’) Metadata for R Packages.” https://CRAN.R-project.org/package=cffr.
Howison, James, and Julia Bullard. 2015. “Software in the Scientific Literature: Problems with Seeing, Finding, and Using Software Mentioned in the Biology Literature.” Journal of the Association for Information Science and Technology 67 (9): 2137–55. https://doi.org/10.1002/asi.23538.
JabRef Development Team. 2021. “JabRef Open-Source, Cross-Platform Citation and Reference Management Software.” https://www.jabref.org/.
Jones, Matthew B., Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, et al. 2017. CodeMeta: An Exchange Schema for Software Metadata. Version 2.0. https://doi.org/10.5063/schema/codemeta-2.0.
Katz, Daniel S., Daina Bouquin, Neil P. Chue Hong, Jessica Hausman, Catherine Jones, Daniel Chivvis, Tim Clark, et al. 2019. “Software Citation Implementation Challenges.” http://arxiv.org/abs/1905.08674.
Krause, Thomas, and Stephan Druskat. 2021. “CFF Maven Plugin.” https://github.com/hexatomic/cff-maven-plugin.
Monperrus, Martin. n.d. “Bibtexbrowser: Publication Lists with Bibtex and PHP.” Accessed December 15, 2021. https://www.monperrus.net/martin/bibtexbrowser/.
Morrell, Thomas E. 2021. “Codemeta2cff.” https://github.com/caltechlibrary/codemeta2cff.
Research, European Organization For Nuclear, and OpenAIRE. 2013. “Zenodo: Research. Shared.” https://doi.org/10.25495/7GXK-RD71.
Smith, Arfon M., Daniel S. Katz, Kyle E. Niemeyer, and FORCE11 Software Citation Working Group. 2016. “Software Citation Principles.” PeerJ Computer Science 2 (e86). https://doi.org/10.7717/peerj-cs.86.
Spaaks, Jurriaan H., Tom Klaver, Stefan Verhoeven, Faruk Diblen, Jason Maassen, Erik Tjong Kim Sang, Pushpanjali Pawar, et al. 2020. “Research Software Directory.” Zenodo. https://doi.org/10.5281/ZENODO.1154130.
Spaaks, Jurriaan H., Tom Klaver, Stefan Verhoeven, Stephan Druskat, and Waldir Leoncio Netto. 2021. “Cffconvert.” Zenodo. https://doi.org/10.5281/zenodo.5521767.
The Bibliography.jl authors. 2021. “Bibliography.Jl.” https://github.com/Humans-of-Julia/Bibliography.jl.
Verhoeven, Stefan, and Jurriaan H. Spaaks. 2018. “DOI 2 Citation Format File Generator.” https://doi.org/10.5281/zenodo.1206049.
Willighagen, Lars. 2021. “@Citation-Js/Plugin-Software-Formats.” https://www.npmjs.com/package/@citation-js/plugin-software-formats.