Perspectives / The big picture

To keep or not to keep: Neurophysiology’s data dilemma

An exponential growth in data size presents neuroscientists with a significant challenge: Should we be keeping all raw data or focusing on processed datasets? I asked experimentalists and theorists for their thoughts.

By Nima Dehghani

25 November 2024 | 5 min read

https://doi.org/10.53053/CAYW7288

Cite this article

https://doi.org/10.53053/CAYW7288 Cite this article

Illustration of a funnel taking abstract shapes in at the top and spouting an organized flow of shapes out at the bottom. — **Challenging choices:** Though the ideal scenario might involve retaining both raw and processed data, storage costs and access limitations mean that many labs are forced to make difficult choices.

Illustrations by Daniel Barreto

Neuroscience is at a crossroads. The latest advances in electrophysiology and optophysiology, such as Neuropixels probes and light-sheet microscopy, have pushed the boundaries of what we can record from the brain. These technologies are generating vast amounts of data—single experiments can produce petabytes’ worth—far more than we have ever dealt with before, sparking a critical discussion: How do we store and access all this information? Should we be keeping all raw data or focusing on processed datasets? And if we can’t keep everything, how do we decide what to discard?

Raw data are the most complete and unfiltered record of an experiment, capturing every detail, including those that might seem irrelevant at first. It is indispensable for certain types of research, particularly when it comes to developing new methodologies or uncovering novel insights. Refined spike-sorting algorithms, for example, might extract meaningful patterns from what currently appears to be background activity.

Keeping raw data also enhances transparency and reproducibility, two pillars of rigorous scientific research. By preserving the original data, we enable other researchers to validate our findings and even uncover new insights that might not have been apparent initially. More recently, raw data have become an important training ground for artificial-intelligence models, an increasingly widespread tool in neuroscience research.

On the other hand, though raw data are incredibly valuable, processed data play an equally important role in the research ecosystem. Data that have undergone some type of preprocessing, such as spike sorting, filtering or deconvolution, are often much easier to share and work with.

Sharing processed data can also reduce the burden on those looking to reuse datasets. Rather than redoing all the preprocessing steps, researchers can build on the work of others and focus their efforts on new analyses or interpretations. This efficiency is particularly valuable in collaborative fields such as neuroscience, in which different experts may contribute to different stages of the research process, and for researchers who may not be experts in the nuances of data-preprocessing—such as theorists focused on modeling rather than data acquisition.

oth processed and raw data have their unique advantages and challenges. Understanding the trade-offs between the two is crucial for determining what to keep and how to make the most of the data we generate. Though the ideal scenario might involve retaining both, storage costs and access limitations mean that many labs are forced to make difficult choices.

Storing raw data can be costly, both in terms of physical storage infrastructure and the complexity of managing such large datasets. Cloud storage solutions may eventually grow with our data needs, but the expense and challenges of ensuring data integrity over time are not trivial. On the access front, the sheer size of these datasets can make it difficult for researchers to download and analyze data efficiently. This barrier has led to the development of strategies such as “lazy loading,” in which only the parts of the data necessary for a specific analysis are accessed at any given time. This approach, though effective, requires sophisticated data management infrastructure and presents a learning curve for researchers who are used to more traditional methods of data access.

Illustration of two cakes: one is a layer cake made of various charts and diagrams, and the other is a disorganized mess of shapes and squiggles. — **Quality control:** To encourage reuse, stored data should be of the highest quality. But the community has yet to converge on methods for quality assessment and control.

Neuroscience can learn from particle physics and astronomy in managing large datasets. For decades, CERN and NASA have handled massive data volumes, preserving necessary raw data from particle collisions and space missions for future analysis. Their success in supporting long-term scientific research is rooted in data collection via sophisticated centralized instruments and dedicated data-processing teams—infrastructure that makes it possible to manage and preserve even the largest datasets effectively.

Neuroscience might follow a similar path, moving toward shared advanced experimental resources and centralized data-processing teams. The Allen Institute’s OpenScope, the first neuroscience observatory of its kind, exemplifies this trend. Here, standardized data collection and processing provide broad access to high-quality datasets, enabling researchers to focus on specific scientific questions without getting bogged down in the details of data acquisition and preprocessing.

Despite these challenges, the retention of raw data is becoming increasingly crucial because of its potential to fuel future technological advancements. As we grapple with the practicalities of data management, we must also consider the opportunities that vast, rich datasets present.

I asked some experimentalist and theorist colleagues to share their thoughts on the pros and cons of keeping versus not keeping raw data.

As a computational neuroscientist who aims to develop new analytic tools to uncover latent structures or representations of large-scale neural data, I always feel that it is crucial to be able to access both raw and processed data, along with well-documented metadata. First, the ability to examine all relevant signals and “noise” in the raw format may enable us to test new methods and hypotheses and to discover new structures or features from commonly conceived “noise.” Meanwhile, having preprocessed data at hand also gives us the opportunity to directly compare with the “norm.” In practice, I found that it is a good idea to have some representative datasets in both formats and run comparative analyses on both forms. Metadata becomes critical for mining data from a public repository, especially when there are no collaborating experimentalists to troubleshoot or direct questions regarding data collection and experimental details.

The answer to the question of what data to save may depend on the nature of data. For behavioral and calcium-imaging data, for instance, keeping the raw format at the highest temporal resolution is preferred. For standard EEG or local field potential recordings, it may be more convenient to keep data at a lower sampling rate.

Finally, the development of foundation models based on multimodal neural recordings would benefit from having access to both raw and processed data. Processed data can be viewed as the first-stage feature extraction of raw data. But it remains untested whether the processed data are the optimal information carrier for self-supervised learning paradigms commonly used in large language models and foundation models. The use of raw versus processed data in foundation models may also depend on the downstream tasks of interest. Overall, I can see the value and need for keeping raw data based on scientific questions.

Data management is a growing concern for experimentalists, especially as datasets from technologies such as Neuropixels (about 50 gigabytes) and calcium imaging (about 30 gigabytes) continue to expand. When I started as a graduate student in 1992, we recorded 288 kilobyte data files using window discriminators connected to oscilloscopes, capturing spike counts and rat positions during 16-minute sessions. A decade later, we moved to tetrode-configured electrodes, generating 50 megabyte raw data files, from which we could isolate single units and create compact files of only 100 to 200 kilobytes per neuron. Today, we routinely extract 200 to 500 single units from Neuropixels data and store them in compact time series files (about 20 megabytes) for analysis, rarely revisiting the original raw data.

Despite this, storing raw data is essential. On multiple occasions, I’ve identified improvements or corrected errors years after data collection—such as improving single unit discrimination by 10 percent through resampling with cubic splines or detecting timing errors in data acquisition. These cases were possible only because we had preserved the raw data. The cost of storing these files is negligible compared with the human and financial resources required to redo experiments. For example, a 64-channel silicon probe costs about $1,000, the same price as 64 terabytes of storage, which can hold approximately 7,000 hours of raw recordings.

Although the costs of raw data storage are low, the burden of managing such data long term exceeds the capabilities of most labs. As a community, we need to adopt standards, such as Neurodata Without Borders, and formats, such as Multiscale Electrophysiology Data, to ensure data are both preserved and accessible. Transitioning from a culture of self-sufficiency to one of shared responsibility will require institutional infrastructure, funding and widespread adoption of these standards, ensuring that our datasets can drive future discoveries.

Being able to link the validity and reliability of findings to underlying raw data is a cornerstone for rigor and reproducibility in scientific research. In the realm of neuroscience, this is no different. But the complexity of neuroscientific raw data and metadata—and their provenance places significant resource challenges on data producers to make neuroscientific collections FAIR (findable, accessible, interoperable and reusable) for both people and machines. The diversity of instruments, species and experimental questions is far too great and provides a sparse smattering of bits of neuroscientific knowledge.

In this universe, given the increasing amount of data being collected and now disseminated, it will eventually become imperative to make informed decisions about which data to retain and which to discard. I am one of the maintainers of the DANDI neurophysiology data repository, which is housing close to 1 petabyte of data. Though we have a responsibility as a repository to archive these data, we do need to ask: Are all of these data equally useful?

There are several factors that make this a challenging question. Here, I discuss a critical few that we should consider when determining which neuroscientific raw data to keep.

First, the research question and study design are of paramount importance. The data that are relevant to the specific question being investigated should be prioritized. Exploratory analyses and pilot studies may require a more comprehensive collection of data and metadata, whereas confirmatory studies can often focus on specific aspects of the data that are directly related to the hypotheses being tested. Any data for which most of the knowledge has been extracted could be considered less important than data that offer more possibilities of reuse. Thus, an automatic sunsetting system could be placed on data. For example, instead of infinite archival, we could say you have five years. If in those five years we discover other untapped uses, we can extend the sunset period.

Second, the quality of data and metadata, including what is missing, is another crucial consideration. Neuroscientific raw data can be affected by noise, artifacts and technical issues. The data we store have to be of the highest quality in order to make reuse easier and findings more robust. Yet the community has not converged on methods for quality assessment and control, especially in the context of new instruments across the breadth and depth of neuroscience. Adopting common quality-control methods will be an important step toward systematic evaluation and eventual sunsetting of data.

Third, several kinds of biases can affect neuroscientific data. Given the diversity of neuroscience approaches, some datasets may come from a single researcher and others from thousands of researchers. Some may represent specific brain regions or circuits, and others may represent specific cohorts or socioeconomic groups. When preserving data, it is important to consider who or what aspects of neuroscience have not been represented. When the same genetic model, or cell lines, or the same brain sample is used, it may limit the generalizability of findings.

We need to keep in mind that not all data are created equal. Every dataset comes with a rationale that led to its generation. Some datasets in neuroscience are based on specific hypotheses, and others are related to more open processes of creating diverse and large biobanks. Deciding which neuroscientific raw data to keep involves careful consideration of the research question, data quality, diversity and ethical principles. It should also take into account the generation of future datasets. Such a universal decision-making approach, however, does not currently exist. Neither do we need to discard data today. Logistically, I believe, we can preserve the world’s neuroscientific data production. Thus, it is more important to keep everything and use that information to improve our preservation and selection processes and policies.

Data management is a major concern for experimentalists, especially for principal investigators who must balance data safety with the growing costs of managing large datasets. In many labs, hard drives are filling up, and cloud storage is becoming overwhelmed. Though data go through various phases—from collection to analysis and long-term storage—the preservation of raw data from successful experiments is crucial. We often don’t realize what we might have missed. For instance, though I rarely share or request raw data, my lab recently reanalyzed raw recordings and uncovered previously unknown events that were visible only at the original sampling frequency. Such discoveries, though unpredictable, could occur more often than we expect if we keep and share raw data.

Another reason to store raw data is the possibility of errors in the data-processing pipeline. If the processing is faulty, having access to the original data enables us to reprocess it as needed. And the cost of storing raw data is relatively low compared with the cost of conducting the experiment itself. For example, a 64-channel silicon probe costs roughly $1,000, which is about the same price as 64 terabytes of high-quality storage—enough to hold 7,000 hours of raw recordings. Although the cost of storing calcium-imaging data may be higher, it remains small compared with the overall investment in experimental resources.

In summary, although there are legitimate concerns about the resources required to store raw data, the benefits far outweigh the costs. Preserving raw data not only safeguards against missed opportunities for discovery but also enhances transparency and reproducibility in research. By making raw data from high-quality experiments available to the community, we can enable new analyses and accelerate the pace of scientific progress.

Subscribe to get notified every time a new piece in “The big picture” series is published.

Researchers ask colleagues to weigh in on important topics in the field.

tags:

The big picture, Data-sharing, Electrophysiology, Methods, Open neuroscience, Open neuroscience and data-sharing

Explore more from The Transmitter

By clicking to watch this video, you agree to our privacy policy.

Brain Inspired

Xiao-Jing Wang outlines the future of theoretical neuroscience

By Paul Middlebrooks

2 July 2025 | 112 min listen

Statistics

Memory study sparks debate over statistical methods

By Katie Moisse

2 July 2025 | 5 min read

Attention

Attention not necessary for visual awareness, large study suggests

By Kristel Tjandra

1 July 2025 | 5 min read

To keep or not to keep: Neurophysiology’s data dilemma

Christoph Bernard (experimentalist)

Zhe Sage Chen (theorist)

Liset M. de la Prida (experimentalist)

Anna Devor (experimentalist)

Gaute Einevoll (theorist)

André Fenton (experimentalist)

Satrajit Ghosh (informatics)

Lisa Giocomo (experimentalist)

Soledad Gonzalo Cogno (theorist)

Jérôme Lecoq (experimentalist)

Luca Mazzucato (theorist)

Earl K. Miller (experimentalist)

Stephanie Palmer (theorist)

Adrien Peyrache (experimentalist)

Jakob Voigts (experimentalist)

Subscribe to get notified every time a new piece in “The big picture” series is published.

Recommended reading

What makes memories last—dynamic ensembles or static synapses?

What are mechanisms? Unpacking the term is key to progress in neuroscience

What, if anything, makes mood fundamentally different from memory?