Post

MooreSpeechCorpora Toolkit, Collecting Mooré Data from the Bible

From data collection to Hugging Face ready dataset.

MooreSpeechCorpora Toolkit, Collecting Mooré Data from the Bible

Working with under-resourced languages like Mooré is both exciting and challenging. In this tutorial, I’ll walk you through the full pipeline I used to collect and prepare Mooré speech data scraped from the Bible-ready to be pushed to the Hugging Face Hub.


1. Environment Setup

Let’s kick things off by setting up the Python environment.

Create a new virtual env with preferably Python 3.10.11:

1
2
conda create -n mooredata python=3.10.11
conda activate mooredata

Also, I recommend downgrading pip to 24.0 (or below):

1
pip install --upgrade "pip<24.1"

You may be asking yourself why these specific constraints from the beginning. Well, I’ve got many errors while trying to work on this using Python 3.12.0 and 3.11.13. The raised error was:

1
2
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'>
for field common is not allowed: use default_factory

After quite a lot of investigations, I found out that the ValueError occurs when using Fairseq v0.12.2 with Python 3.11 or 3.12 due to its use of mutable dataclass defaults, which are disallowed in these Python versions. That’s why, to fix this, I downgraded to Python 3.10 and pip <24.1.

Trust me, just make sure to do the same to not waste a lot of time resolving package conflicts like me.


2. Install Audio & Text Processing Dependencies

Next, let’s ensure you have all required system-level packages.

Update the package list:

1
sudo apt-get update -y

Normally you should already have the necessary environment-level packages:

  • for audio processing: libsox-fmt-all, sox, and ffmpeg
  • for text processing and Unicode symbols: libicu-dev and pkg-config

If not, install them:

1
2
sudo apt-get install libsox-fmt-all sox ffmpeg
sudo apt install libicu-dev pkg-config

3. Install Torch (Nightly Version)

Now it’s time to install PyTorch. Make sure to get the nightly version from the official PyTorch website.

If you have a Nvidia GPU, select the CUDA version. Otherwise, stick with CPU.

Torch Install

For me, the command was:

1
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

Then install additional packages:

1
pip3 install -r requirements.txt

4. Clone Required Repos

We’ll need two external repositories to proceed.

Fairseq

1
2
3
4
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
pip install --editable ./ 
cd ..

Uroman

1
git clone https://github.com/isi-nlp/uroman.git

Make these changes to match what Fairseq expects:

1
mkdir -p uroman/bin && mv uroman/uroman/uroman.pl uroman/bin/uroman.pl && mv uroman/uroman/data uroman/data

Your folder structure should now look like:

1
2
3
4
5
6
7
moore-data-scraping-bible/
├── fairseq/
└── uroman/
    ├── bin/
    │   └── uroman.pl
    ├── data/
    └── (other folders)

5. Scraping Bible Data

With everything in place, let’s scrape the Mooré Bible data.

Execute:

1
sh ./crawlers/bible/crawl.sh

If you run into this error:

1
ImportError: /home/anyantudre/miniconda3/envs/mooredata/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /lib/x86_64-linux-gnu/libicuuc.so.74)

Fix it with:

1
conda install -c conda-forge gcc=12.1.0

Crawling Process


6. Resampling

Now let’s resample the scraped audio files to 16kHz.

Run:

1
bash preprocessing/resample.sh --input_folder datasets/moore/bible/raw --output_folder datasets/moore/bible/resampled

If the script fails due to formatting, convert it to Unix-style line endings:

1
2
sudo apt install dos2unix
dos2unix resample.sh

7. Forced Alignment

This is a crucial step: segmenting long audio clips into short, clean utterances.

We’ll use the MMS Forced Aligner. You can check if your language is supported here. For Mooré, the model code is mos.

Run the alignment script:

1
2
3
4
5
6
bash forced_alignement/align_and_segment.sh \
  --audio_folder datasets/moore/bible/resampled \
  --text_folder datasets/moore/bible/resampled \
  --output_folder datasets/moore/bible/aligned \
  --lang mos \
  --uroman_path ../uroman/bin

On an Nvidia RTX 4070 (8GB), this took ~5 hours.


8. Export to Hugging Face Format

Last step!

Log in to your Hugging Face account:

1
huggingface-cli login

Then push your dataset:

1
2
3
4
python data_export/prepare_hf_dataset.py \
  --input_folder datasets/moore/bible/aligned \
  --repo_id anyantudre/moore-speech-bible \
  --hf_token hf_xxxx

⚠️ Note: By default, datasets pushed to the Hub are public. You can update this later from your Hugging Face dashboard.


That’s it! 🎉 You now have a fully processed, aligned, and ready-to-train Mooré speech dataset, directly from Bible text and audio. Hope this helps you save some setup time and avoid the many little snags I ran into.

Feel free to fork the repo, tweak things for your own language, and contribute improvements.

This post is licensed under CC BY 4.0 by the author.