Automating data integration and publishing for neuroimaging via LSLAutoBIDS

Manpa Barman; Jan Range; Benedikt Valerian Ehinger

doi:10.52294/001c.159415

Introduction

Modern neuroscience research often relies on collecting complex, multimodal datasets, for example, combining behavioral, eye-tracking, and neuroimaging (e.g., electroencephalography (EEG)) data. Such datasets are valuable for understanding cognitive processes, but are often challenging to collect due to the resource-intensive nature of experiments and the logistical demands. They are often collected by trainees or research assistants with limited prior experience or training. And even to more experienced data collectors, the following examples of what can go wrong will sound very familiar: 1) the same participant ID was used twice in a row, and data was irreversibly overwritten. 2) While running a study, the experiment was modified to fix a bug, but it was not documented after which subject it was fixed - later it remained unclear whether early subjects data were retrogradely fixed or not. 3) A master’s student left the lab without documenting their last dataset, leaving a mess of various files on various computers. 4) To move files from the various recording computers to their analysis system, every lab user seems to have their own idiosyncratic workflow, from USB sticks to WiFi hotspots. In this paper, we show that these issues can be prevented while at the same time saving time and training effort.

To increase the robustness of data recording pipelines, as well as to encourage data sharing, we present an approach to automate parts of this workflow. The automation we propose can be understood under the umbrella of open science by design.¹ The central idea is to change infrastructure and workflows in a way that they are open from the point of creation, making open science a by-product and not an afterthought.¹ In our case, the new data is curated and inserted into an existing dataset, versioned, and (privately) archived in a remote repository immediately after the recording. Additionally, many open data check boxes are also immediately marked without additional effort by the experimenter.

Our proposed workflow automates four main aspects of such data recording pipelines: data integration, data curation, data versioning, and data publishing.

Data Integration: Collected data often comes from many sources. LabStreamingLayer (LSL)² has become a standard way to record multiple data streams and synchronize their time series. In addition, other data sources need to be collected from devices that cannot send them via LSL (e.g., experimental logging files, proprietary eye-tracking data).

Data Curation: To allow machines and humans to navigate such a diverse data collection, we want to structure them. The Brain Imaging Data Structure (BIDS)³ standard provides a well-described data curation environment, offering consistent directory structure, file naming conventions, and metadata descriptors for neuroimaging data. Once data is structured in BIDS format, the dataset can be seamlessly integrated into BIDS-compliant analysis pipelines.

Data Versioning: The next aspect is to include datasets into version control. While much progress has been made in version control for analysis code, datasets are only rarely versioned by default. DataLad⁴ combines git and git-annex, allowing for efficient versioning of not only code, but also binary files. Crucially, adding new subjects, annotations, or corrections will leave a version trace, making the data collection robust to accidentally deleting or overwriting data and improving transparency of data provenance.⁵

Data Publishing: Our last step, which traditionally is often only taken after the data has been fully analyzed, is the long-term archiving and (access-restricted) publishing. Immediately publishing the raw data has the benefit of frontloading (and externalizing) the additional effort required to publish the data. Automating this is made possible by linking the versioned DataLad⁴ datasets directly with FAIR⁶ (findability, accessibility, interoperability, and reusability) enabling data repositories such as Dataverse,⁷ which will provide Digital Object Identifiers (DOI), customizable access controls, archival service, and further automatic versioning.

In our work, we addressed all four aspects in a single automated workflow. Our contributions are: 1) we discuss workflow to integrate community standards for data curation, automated publishing, and version control, with the potential to generalize across modalities and experimental paradigms, 2) we introduce LSLAutoBIDS, a Python package as an example implementation for such a workflow, using Lab Streaming Layer,² converting to BIDS, versioning via DataLad,⁴ and publishing via Dataverse.

The pioneering work of Dobson et al.⁸ should be highlighted: They present ReproIn, a software package quite similar to ours, which converts DICOM (Digital Imaging and Communications in Medicine) data from a magnetic resonance (MR) scanner to BIDS, and finally saves it in a DataLad repository. Compared to our implementation, ReproIn is optimized to work with different MR scanners, whereas we use LSL to integrate time-series data more directly. Instead of heudiconv, we use mnelab, pylsl, and mne-bids for the data curation stage. While ReproIn does not automatically link the DataLad repository to a Dataverse, this additional step could be readily implemented for ReproIn as well.

Another pioneering approach is taken at the Donders Institute for Brain and Cognition.⁹ There, all functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) data are automatically archived in a data acquisition collection (DAC). Other data needs to be added manually to the DAC. While the DAC is centralized, it is typically only used internally, and later, a data sharing collection (DSC) is created, which can be publicly shared.

A different example for a data curation workflow, incorporating standardization and publishing, i.e., from raw data acquisition over standardization to analysis, is presented in Stawiski et al.¹⁰ In this project, they collected heterogeneous raw data of more than 100 participants in two clinics, transformed them to BIDS, and used SQLite to manage the resulting dataset. Even though they do not discuss versioning and publishing, it is easily conceivable how these steps could be added to their workflow.

The previously introduced Brain Imaging Data Structure,³ originally proposed for magnetic resonance imaging (MRI), is a community standard for data curation and sharing of brain data within research communities. This standard is designed using the FAIR principles,⁶ contributing to efficient scientific data curation and stewardship. Following the release of the BIDS standard, numerous extensions were developed, concerning integration of BIDS with different neuroimaging techniques like MEG,¹¹ intracranial electroencephalography (iEEG),¹² EEG,¹³ positron emission tomography (PET),¹⁴ and others. Data sharing and reuse of these BIDS-complaint datasets is even further simplified using open-source sharing platforms, for instance, OpenNeuro,¹⁵ which currently hosts more than 600 datasets in compliance with BIDS.

To convert datasets to be BIDS compliant, packages like mne-bids^16,17 or BIDSCoin¹⁸ (using the sova2coin plugin) are essential to move and convert source data to BIDS-compliant datasets, and subsequently, the BIDS validator³ can be used to validate the data collection. To make the data conversion even more user-friendly for researchers, BIDSCoin¹⁸ offers a graphical user interface, making it accessible even to those without programming experience. As far as we know, there is no dedicated LSL-to-BIDS converter available.

Unlike software and code versioning, data versioning is more rarely practiced, given issues in practicality to version large binary files. Nevertheless, version control is an important aspect in data collection to ensure data traceability and an error-free workflow. DataLad,⁴ based on git-annex, nicely extends the capabilities of classical git versioning tools to large, binary datasets. DataLad is actively used in hundreds of studies and underlies all datasets on OpenNeuro.

LSLAutoBIDS

Figure 1.Flowchart of LSLAutoBIDS. We start with different data streams recorded using LabStreamingLayer and as other proprietary file formats. (1) We use LSLAutoBIDS to integrate data that is collected across different computers and synchronize the data streams temporally. (2) Next, we organize these data into the BIDS structure. (3) Using DataLad, we version them, and (4) finally upload them to an open repository like Dataverse or OpenNeuro.

LSLAutoBIDS is an open-source Python package developed and actively used by the Computational Cognitive Science Lab at the University of Stuttgart. It offers a modular and reproducible workflow tailored for studies using LSL based data acquisition, specifically targeting the integration of EEG and eye-tracking modalities.

In this setup, participant-level EEG data streams, and LSL streams (e.g., motion tracking, mobile eye-tracking), are recorded using the LSL protocol,¹⁹ which allows for sub-millisecond time-synchronization of heterogeneously sampled streams and results in an XDF container file, most commonly via the LSLRecorder²⁰ package. Project metadata such as authors, license, experimental description, but also Dataverse details are specified in one central configuration-toml file, which is then used to retrieve these project-specific metadata during the conversion process. The recorded raw data streams are then converted into the BIDS³ standard using the mnelab,²¹ pylsl,¹⁹ mne,²² and mne_bids¹⁶ packages. The converted EEG data are then validated with the BIDSValidator, while non-BIDS files like project metadata, configuration files, and additional modalities such as eye-tracking data are listed in a .bidsignore file to exclude them from validation. Thus, we are compliant with the BIDSValidator for our output. Once validated, the BIDS-compliant dataset is automatically deposited into a Dataverse repository, along with the experiment stimulus files and the raw source data.

In addition to LSL-based EEG data acquisition, LSLAutoBIDS supports the incorporation of non-LSL data. For example, in our use case, eye-tracking data are collected using the EyeLink 1000 Plus eye tracker simultaneously with EEG data collection, which produces proprietary EyeLink data format (EDF) files that cannot easily be streamed via LSL. Other examples are an electronic lab notebook record, log files of the experiment, and a compressed archive of the experimental code used in that individual session. LSLAutoBIDS accommodates these files as a secondary data modality currently implemented via user-defined folders, filename specifications, and regular expression matching, and is able to organize them within the appropriate BIDS subdirectories to publish them alongside EEG data.

Version control is integrated into the pipeline using DataLad,⁴ enabling precise tracking of all data and metadata changes across the research lifecycle, including modification or re-transformation of the individual files. DataLad is very powerful, but we use only the basic functionalities to simplify adoption for new users. We are not using the datalad run or datalad rerun capabilities, but are only concerned with the versioning of large files (datalad save and datalad push). The two minimal commands a user needs to know are datalad clone to checkout a repository and datalad get to download large files. This split between providing only symbolic links to all binary files first and only then explicitly downloading, e.g., a subset of the dataset, allows for easy exploration of terabyte-sized datasets.

After each recording session, the dataset is then uploaded to an initially private Dataverse repository, where it is persistently archived with the specified metadata under a DOI. After data collection is finished, the dataset can be publicly released and shared, potentially under a data sharing agreement.

The architecture of LSLAutoBIDS is deliberately designed to be extensible. While we developed this tool with our setup in mind (EyeLink 1000 Plus, ANT Neuro EEG, VR-setups), we will continuously add new devices and allow for further customizability. Already now, additional files, such as behavioral data, audio recordings, or physiological signals, can be integrated by extending configuration templates and adding corresponding processing steps. This generalizability makes LSLAutoBIDS not just a tool for EEG and eye-tracking studies, but a proof of concept for broader multimodal data workflows in cognitive neuroscience.

How do workflows like LSLAutoBIDS address the examples raised in the introduction? 1) If data is immediately versioned and archived, then overwritten data can always be recovered. 2) As experimental files are included in the archive for each subject, it is easy to find any changes in the experimental code during data recording. 3) Even if a student leaves without further documenting the dataset, it is already in a cleaned up state, including metadata, greatly simplifying further use of the data. 4) Training of new users is greatly simplified, manual copying of data via USB sticks and other idiosyncratic workflows are no longer necessary.

Conclusion

Practicing open science by design and frontloading the conversion to a citable data publication efficiently addresses many hurdles researchers face in data collection and publishing. Not only is any recorded dataset immediately converted to BIDS, it also is archived, findable, shareable, backed up, and versioned.

Whether the final dataset is publicly available or only used internally (e.g., due to privacy issues in non-defaced MRIs), we think such a workflow will be helpful to many laboratories.

Code and Data Availability

The LSLAutoBIDS package is continuously developed and freely available: https://github.com/s-ccs/LSLAutoBIDS or via Zenodo https://zenodo.org/records/15525822.

Conflict of Interest

The authors declare no conflicts of interest that may bias or could be perceived to bias this work.

Funding Sources

Funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundations) in the Emmy Noether Programme - Project-ID 538578433 - and Germanys Excellence Strategy EXC 2075 - 390740016.

Automating data integration and publishing for neuroimaging via LSLAutoBIDS

Abstract

Introduction

LSLAutoBIDS

Conclusion

Code and Data Availability

Conflict of Interest

Funding Sources

References

Automating data integration and publishing for neuroimaging via LSLAutoBIDS

Abstract

Introduction

Related Works

LSLAutoBIDS

Conclusion

Code and Data Availability

Conflict of Interest

Funding Sources

References