Database Credentialed Access

Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information

Alistair Johnson Jean-Christophe Bélisle-Pipon David Dorr Satrajit Ghosh Philip Payne Maria Powell Anaïs Rameau Vardit Ravitsky Alexandros Sigaras Olivier Elemento Yael Bensoussan

Published: Nov. 27, 2024. Version: 1.0


When using this resource, please cite: (show more options)
Johnson, A., Bélisle-Pipon, J., Dorr, D., Ghosh, S., Payne, P., Powell, M., Rameau, A., Ravitsky, V., Sigaras, A., Elemento, O., & Bensoussan, Y. (2024). Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (version 1.0). Health Data Nexus. https://doi.org/10.57764/qb6h-em84.

Abstract

The human voice contains complex acoustic markers which have been linked to important health conditions including dementia, mood disorders, and cancer. When viewed as a biomarker, voice is a promising characteristic to measure as it is simple to collect, cost-effective, and has broad clinical utility. Recent advances in artificial intelligence have provided techniques to extract previously unknown prognostically useful information from dense data elements such as images. The Bridge2AI-Voice project seeks to create an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health. Here we present Bridge2AI-Voice, a comprehensive collection of data derived from voice recordings with corresponding clinical information. Bridge2AI-Voice v1.0, the initial release, provides 12,523 recordings for 306 participants collected across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The initial release contains data considered low risk, including derivations such as spectrograms but not the original voice recordings. Detailed demographic, clinical, and validated questionnaire data are also made available.


Background

The production of human voice involves the complex interaction among respiration, phonation, resonation, and articulation. The respiratory system provides the air flow and pressure to initiate and maintain vocal fold vibration. The vocal folds generate the sound source which is then modified within the vocal tract by the oral and nasal cavities and the articulators involved in speech production. Each of these processes is influenced by the speaker’s ability to adjust and shape these interacting systems.    

Although many use the terms voice and speech interchangeably, it is important to understand the distinction between the different terms used to describe human sounds:

Voice: In the voice research field, refers to sound production and is the phonatory aspect of speech. In other words, it is the sound produced by the larynx and the resonators. For example, voice can be assessed by asking someone to do a prolonged vowel sound like /e/.

Speech: Speech is the result of the voice being modified by the articulators and is produced with intonation and prosody. For example, a patient having a stroke can have abnormal speech production due to difficulty with articulating words but have a normal voice. For this project, the term Voice as a Biomarker of Health will include speech in its definition.

For voice to emerge as a biomarker of health, there is a pressing need for large, high quality, multi-institutional and diverse voice database linked to other health biomarkers from various data of different modality (demographics, imaging, genomics, risk factors, etc.) to fuel voice AI research and answer tangible clinical questions. Such an endeavor is only achievable through multi-institutional collaborations between voice experts and AI engineers, supported by bioethicists and social scientists to ensure the creation of ethically sourced voice databases representing our populations.

Based on the existing literature and ongoing research in different fields of voice research, our group identified 5 disease cohort categories for which voice changes have been associated to specific diseases with well-recognized unmet needs. These categories were:

  1. Voice Disorders

    Laryngeal disorders are the most studied pathologies linked to vocal changes. Benign and malignant lesions can affect the shape, mass, density, and tension of the vocal folds resulting in changes in vibratory function resulting in changes in phonation.

  2. Neurological and Neurodegenerative Disorders 

    Changes in voice have been linked to depression, and other mood disorders. Individuals with depression have been found to have decreased fundamental frequency (f0) as well as a monotonous speech, while individuals with anxiety disorders have a significant increase in F0. Regrettably, much of the literature examining the intersection of voice and speech changes in psychiatric conditions have used small datasets with limited demographic diversity reporting, lack of standardized data collection protocol precluding meta-analysis and possible confounders, all limiting external validity and clinical usability.

  3. Mood and Psychiatric Disorders 

    Voice and speech are altered in many neurological and neurodegenerative conditions. Acute strokes can present with slurred speech (Dysarthria) or expressive deficits speech (Aphasia). Voice and speech changes can be the presenting symptoms of many neurodegenerative conditions, such as Parkinson’s and ALS with changes such as slowed, low frequency, monotonous speech as well as vocal tremor.

  4. Respiratory disorders 

    Respiratory sounds, including breath, cough and voice have long been used for diagnostic purposes. For instance, pediatric croup can be suspected based on the presence of barking cough, stridor and dysphonia. With advances in acoustic recording and analysis in the second half on the twentieth century, increasing interest has emerged in the use of respiratory sounds for disease screening and therapeutic monitoring, especially with cough sounds.

  5. Pediatric Voice and Speech Disorders

    The literature is sparser in terms of pediatric voice and speech analysis partly due to ethical concerns and challenges in data acquisition for this cohort. However, many studies have investigated the use of machine learning models for voice and speech analysis for detection of Autism and Speech Delays in the pediatric population.

The protocols used for data collection in this study have been extensively described [1].


Methods

Patients presenting at specialty clinics and institutions were considered for enrolment. Patients were selected based on membership to five predetermined groups (Respiratory disorders, Voice disorders, Neurological disorders, Mood disorders, Pediatric). Patients presenting at the given clinic were screened for inclusion and exclusion criteria prior to their visit by the project investigators. If eligible for enrolment, patient consent was sought for the data collection initiative and to share the acquired research data. Once consented, a standardized protocol for data collection was adopted. This protocol involved the collection of demographic information, health questionnaires, targeted questionnaires inquiring about known confounders for voice, disease specific information, and voice recording tasks such as sustained phonation of a vowel sound. Data collection was conducted using a custom application on a tablet with a headset used for data collection when possible. For most participants a single session was sufficient to collect all relevant data. However, a subset of participants required multiple sessions to complete the data collection. As a result, there may be more than one session per participant in the current dataset. Data were exported and converted from RedCap using an open source library developed by our team [2].

Raw audio was preprocessed by converting to monaural and resampling to 16 kHz with a Butterworth anti-aliasing filter applied. From this standardized audio, we extracted four types of derived data:

  • Spectrograms - Time-frequency representations were computed using the short-time Fast Fourier Transform (FFT) with a 25ms window size, 10ms hop length, and a 512-point FFT.
  • Acoustic features were extracted using OpenSMILE, capturing temporal dynamics and acoustic characteristics.
  • Phonetic and prosodic features were computed using Parselmouth and Praat, providing measures of fundamental frequency, formants, and voice quality.
  • Transcriptions were generated using OpenAI's Whisper Large model.

The following de-identification steps were taken in the process of preparing the dataset:

  • HIPAA Safe Harbor identifiers were removed.
    • While not all relevant to this dataset, these identifiers include: names, geographic locators, date information (at resolution finer than years), phone/fax numbers, email addresses, IP addresses, Social Security Numbers, medical record numbers, health plan beneficiary numbers, device identifiers, license numbers, account numbers, vehicle identifiers, website URLs, full face photos, biometric identifiers, and any unique identifiers.
      • State and province were removed. Country of data collection was retained.
  • Transcripts of free speech audio were removed.
  • In this release, audio waveforms were omitted, and only spectrograph data and other derived features are made available.

We aim to include voice data on future releases with additional precautions taken to ensure data security.


Data Description

As of v1.0, only data from the adult cohort is available.

The dataset has been made available in three files:

  • spectrograms.parquet - a Parquet file storing dense data derived from voice waveforms
  • phenotype.tsv - Information collected during the visit including demographics, acoustic confounders, and responses to validated questionnaires.
  • phenotype.json - A data dictionary for the phenotype data.
  • static_features.tsv - Features derived from the raw audio, with one feature per audio recording.
  • static_features.json - A data dictionary for the features data.

The above data dictionaries have the same overall structure: a dictionary where keys are the column names matching the associated data file, and values are dictionaries with further detail. The description value in the data dictionary provides a one sentence summary of the respective column.

The spectrograms.parquet file contains the majority of the data derived from the raw audio. Each element of the parquet formatted dataset contains a unique identifier for the participant (participant_id), a unique identifier for the recording session (session_id), the task performed (task_name), and the a 513xN dimension spectrogram of the raw audio waveform.

Features derived from the open-source Speech and Music Interpretation by Large-space Extraction (openSMILE [3]), Praat [4], parselmouth [5], and torchaudio [6, 7] are provided. Each feature is present in the static_features.tsv file, with the data dictionary providing a description of each feature, and one row per unique recording. The phenotype.tsv file is similarly a tab delimited file with one row per unique participant. Each column is the response to a question asked during clinical data collection within the custom data collection app. The phenotype.json file provides a description of each column of data.

The code used to preprocess the raw audio waveforms into the parquet file and to merge the source data into the phenotype files has been made open source in the b2aiprep library [8].


Usage Notes

If using Python, the parquet dataset can be loaded in with the following code:

from datasets import Dataset
ds = Dataset.from_parquet("spectrograms.parquet")

A spectrogram can be plotted in decibels by converting it from its original power representation:

import librosa
spectrogram = librosa.power_to_db(ds[0]['spectrogram'])
plt.figure(figsize=(10, 4))
plt.imshow(spectrogram, aspect='auto', origin='lower')
plt.title('Spectrogram')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.colorbar()

The phenotype file can be loaded with any statistical analysis tool. For example, the pandas library in Python can read the data:

import pandas as pd
df = pd.read_csv("phenotype.tsv", sep="\t", header=0)

Release Notes

b2ai-voice v1.0 was the first release of the Bridge2AI voice as a biomarker of health dataset.


Ethics

Data collection and sharing was approved by the University of South Florida Institutional Review Board and has been submitted for review to the University of Toronto Research Ethics Board.


Acknowledgements

This project was funded by NIH project number 3OT2OD032720-01S1: Bridge2AI: Voice as a Biomarker of Health - Building an ethically sourced, bioaccoustic database to understand disease like never before. We would like to acknowledge that this release would not be possible without the graceful contribution of data from all the participants of the study. We would also like to thank the NIH for their continued support of the project.


Conflicts of Interest

None to declare.


References

  1. Rameau, A., Ghosh, S., Sigaras, A., Elemento, O., Belisle-Pipon, J.-C., Ravitsky, V., Powell, M., Johnson, A., Dorr, D., Payne, P., Boyer, M., Watts, S., Bahr, R., Rudzicz, F., Lerner-Ellis, J., Awan, S., Bolser, D., Bensoussan, Y. (2024) Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI.. Proc. Interspeech 2024, 1445-1449, doi: 10.21437/Interspeech.2024-1926
  2. Bensoussan, Y., Ghosh, S. S., Rameau, A., Boyer, M., Bahr, R., Watts, S., Rudzicz, F., Bolser, D., Lerner-Ellis, J., Awan, S., Powell, M. E., Belisle-Pipon, J.-C., Ravitsky, V., Johnson, A., Zisimopoulos, P., Tang, J., Sigaras, A., Elemento, O., Dorr, D., … Bridge2AI-Voice. (2024). Bridge2AI Voice REDCap (v3.20.0). Zenodo. https://doi.org/10.5281/zenodo.14148755
  3. Florian Eyben, Martin Wöllmer, Björn Schuller: "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor", Proc. ACM Multimedia (MM), ACM, Florence, Italy, ISBN 978-1-60558-933-6, pp. 1459-1462, 25.-29.10.2010.
  4. Boersma P, Van Heuven V. Speak and unSpeak with PRAAT. Glot International. 2001 Nov;5(9/10):341-7.
  5. Jadoul Y, Thompson B, De Boer B. Introducing parselmouth: A python interface to praat. Journal of Phonetics. 2018 Nov 1;71:1-5.
  6. Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., Kumar, A., Yu, C.-Y., Zhu, C., Liu, C., Kahn, J., Ravanelli, M., Sun, P., Watanabe, S., Shi, Y., Tao, T., Scheibler, R., Cornell, S., Kim, S., & Petridis, S. (2023). TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch. arXiv preprint arXiv:2310.17864
  7. Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-F., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E. Z., Lian, J., Mahadeokar, J., Hwang, J., Chen, J., Goldsborough, P., Roy, P., Narenthiran, S., Watanabe, S., Chintala, S., Quenneville-Bélair, V, & Shi, Y. (2021). TorchAudio: Building Blocks for Audio and Speech Processing. arXiv preprint arXiv:2110.15018.
  8. Bevers, I., Ghosh, S., Johnson, A., Brito, R., Bedrick, S., Catania, F., & Ng, E. (2017). My Research Software (Version 0.21.0) [Computer software]. https://github.com/sensein/b2aiprep

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
Bridge2AI Voice Registered Access License

Data Use Agreement:
Bridge2AI Voice Registered Access Agreement

Required training:
TCPS 2: CORE 2022

Discovery
Corresponding Author
You must be logged in to view the contact information.

Files