Illustration of three figures cleaning data with brooms and brushes.
Data hygiene: Producing clean, well-organized data from the start makes it easier to later re-use it.
Illustration by Daniel Barreto

This series of scientist-written essays explores the benefits and challenges of data-sharing and open-source technologies in neuroscience.

The field of neuroscience has witnessed a sea change in its attitude toward open science over the past 10 years. Thanks to mandates from journals and funders, the establishment of large-scale public repositories, and broader shifts in academic culture, it is now routine for many researchers to deposit data for use by anyone, anywhere. This practice has numerous benefits—including secondary analyses of data, the discovery of errors and the development of hands-on pedagogical materials. But current practices for data-sharing often fall short.

As a user of deposited data, I frequently find myself poring over repositories that are extremely challenging to navigate. Many repositories lack sufficient metadata, and it takes a significant amount of sleuthing—and usually multiple emails with the authors—to decipher how the data are formatted. Also, the organization of the data is often idiosyncratic to a particular repository, making it difficult to apply standardized software tools. Most of my time is sunk into preprocessing.

There are several explanations for this messy state of affairs. Trainees who produce the data have not been tutored in this aspect of their work, nor does academic culture value it.

Here I want to make the case that useful open neuroscience is something everyone can and should strive for—even if only for their own benefit.

Consider this scenario I’ve found myself in multiple times: In my rush to get experiments done and submit a paper, I did not note why I organized my data in a particular way, why I added a certain preprocessing step or what the various variables were. Six months after submitting the paper, perhaps after several rejections, I’m preparing a revision. Now my most important collaborator is me six months ago, and sadly, past-me doesn’t answer emails.

From a purely selfish perspective, I should have produced hygienic data—clean, well-organized, comprehensible—to spare myself the subsequent pain. Had past-me spent the time to properly organize my data, not only would revisions go more smoothly, but it would also be easier to handle requests years down the line from researchers who want to use the data. Making data (and code) useful for yourself automatically makes it useful for others. It also leads to broader dissemination and citation of your work, new scientific insights and good will among the community of researchers.

G

iven the clear benefits, why aren’t hygienic data approaches the norm? Spending time on getting a paper published has a much greater impact on hiring and promotion than spending time on data formatting. One way to remedy this is to change our approach to hiring and promotion.

Committees could make open-science practices part of the evaluation process, such as by requesting a statement on open science. Importantly, this must be about more than just dumping data into repositories; researchers should be expected to engage comprehensively and systematically with the spirit of open science, which means useful open science.

Journals and funders can also be part of the solution, though evaluation and enforcement are labor-intensive. Perhaps the most important tool for promoting useful open science is through training: Students should take courses that teach these skills, which of course means that the courses need to exist. Professors can do their part by teaching those courses and setting standards for data-sharing practices in their labs. (For more on the need for data literacy training, see “Neuroscience graduate students deserve comprehensive data-literacy education.”)

There are many useful resources to build on: Patrick Mineault has written “The Good Research Code Handbook,” which elaborates on some of the points made above. The Center for Open Science offers a modular training course for openness and reproducibility. There have been several efforts to develop standard data formats, such as the Brain Imaging Data Structure and Neurodata Without Borders. The International Neuroinformatics Coordinating Facility supports a broad range of activities in the service of “findable, accessible, interoperable and reusable” (FAIR) practices. These efforts provide reasons to be optimistic about the future of open neuroscience. And I encourage principal investigators to develop their own guidelines to share with their team, tailored to the lab’s needs and constraints.

In my own lab, I created “Lightweight guidelines for code and data sharing,” a manual that helps lab members develop workflows that are user-friendly and reproducible both for ourselves and others. Although I try not to micromanage my trainees, I do make sure that they observe basic standards for depositing data, and I discuss these issues whenever confusion arises. Because open science has become a norm in my lab, trainees can also receive guidance from each other. My hope is that as these practices become more widely accepted, trainees will become less dependent on top-down enforcement by principal investigators.

Sign up for our weekly newsletter.

Catch up on what you may have missed from our recent coverage.