Transcribe spoken Classical Latin, for the first time!
Try it out at https://huggingface.co/lsb/wav2vec2-base-pemlsb-la!
This is the first Latin speech recognition system! It is powered by a new voice dataset of 88.3 hours of Latin speech, mostly synthetically generated from Poeta Ex Machina. The self-supervision of using speech synthesis to train speech recognition (like SynthASR) offers a few exciting new directions (like examining the inductive biases of neural language models with artificial language).
Self-supervised Speech Recognition
Modern deep neural network statistical modeling relies on fewer hand-engineered features and larger piles of data. Self-supervised learning is increasingly useful in many applications, where a task can be framed as learning mechanically-generated labels. These labels are generated at usually much lower cost and usually much greater scale than human-generated labels. Self-supervised learning often amounts to learning the inverse of a mechanical process: image recoloring for black-and-white photographs is learned as the inverse of stripping images of their color. Super-resolution is learned as the inverse of downsampling images. Language modeling is learned as the inverse of deleting a word in a sequence (at the end (‘causal’) or in the middle (‘masked’)). A self-supervised speech recognition approach would be to start with a pile of text, generate synthetic speech, and learn to recognize human speech based on that synthetic speech, similar to SynthASR.
Many speech recognition systems rely on meticulously labeled sound files, with accurate timing data for each letter. The relatively-new wav2vec uses unlabeled (text, sound) pairs, which allow it to consume much more data (per $ of acquired sample data, like Common Voice). However, spoken Latin is rare, and much more challenging1 to acquire than (say) Spanish or Japanese, so this self-supervised approach is crucial.
So: can Latin speech recognition learn from Latin speech synthesis? We can first create a dataset of Latin text, and then we can create a dataset of that text synthesized into speech, and we can try.
it requires High quality synthetic Latin speech in a classical pronunciation comes from Poeta ex Machina. Poeta ex Machina has a full database of scansions of single words, and we can use Poeta ex Machina to synthesize lines of (for example) Vergil for a multi-word corpus; all of Vergil’s extant works are all in dactylic hexameter, comprising over 21 hours of text.
ancient-latin-passages dataset is a compendium of 19MB of Latin text, written roughly between 50BC and 150AD, from a wide variety of Classical authors on the Latin Library. This dataset was used to create poetaexmachina-mp3-recitations, and we can synthesize much more poetry and add to that dataset. This is publicly available at https://huggingface.co/datasets/lsb/ancient-latin-passages .
poetaexmachina-mp3-recitations is divided into three parts: the 1-grams, individual words, from Poeta ex Machina’s internal database of word scansions, comprising 66.9 hours of recited speech; the lines of dactylic hexameter, all from Vergil, comprising 21.4 hours of recited speech; and recitations from yours truly of Cicero and Catullus, comprising half a minute of recited speech. This is publicly available at https://github.com/lsb/poetaexmachina-mp3-recitations, with one recitation per text file + mp3 file.
wav2vec2 model, trained on Italian
In contrast to older speech recognition systems that require speech waveforms expensively annotated with timing data per letter, wav2vec2 is designed to learn timing data from unannotated pairs of an entire waveform and an entire text (usually under 10 seconds of audio).
The community and infrastructure around wav2vec2 means that there are many wav2vec2 models trained on various modern languages. We can take a large pre-trained model whose training data is close to the target data distribution, and use it as a foundational starting point, instead of starting training from scratch. Poeta ex Machina uses an Italian voice, partly for its phonetic inventory (English, for instance, does not have sufficient phonetic inventory: we believe that ancient Latin trilled or flapped its Rs (medi(us)-dies = meridies, like British English rhyming edible with terrible)), partly for sentimental/aesthetic reasons (would Spanish work? Russian? Xhosa?). For similarly phonetic and sentimental reasons, and availability, we use a wav2vec2 model trained on the Italian dataset of Vox Populi, and fine-tune from there. Informal test results found that the word error rate improved faster when fine-tuning from this Italian-trained model, compared to the English-trained model. An obvious future direction is starting from other initial monolingual models, or multilingual models. Another obvious future direction is upgrading from the 5-gram post-processing model to other text models (transformers? sub-word tokenization strategies?).
We can make our prediction task by normalizing the orthography of the Latin text, by stripping punctuation and macrons, and normalizing letters invented after 500AD (“j”, “v”, “w”) by substituting “j” with “i” and “v” with “u”, and only using lower case.
Wav2vec2 uses Connectionist Temporal Classification to infer its transcription: at each 20ms timestep we can predict a token, either a letter or a special character like a break, and we can merge identical predictions between breaks. The Huggingface wav2vec2 library has built-in support for an additional 1-to-5-gram language model, for post-processing the audio predictions with a stochastic 🦜. Tuning the post-processing model is very much an open question, especially for Latin.
Results: Word Error Rate of 4.13%
At the end of initial training, the word error rate was 4.13% on the validation set of data, only slightly more than 1 in every 25 words incorrect.
Given the rate of improvement when including even a small amount of human-generated training data, this is very much a work in progress, especially when experimenting with data augmentation.
Text-To-Speech self-supervision can analyze inductive biases of speech recognition systems
There are examples of using artificial languages to examine the inductive biases of neural language models (https://arxiv.org/pdf/2106.01044.pdf), and using artifically generated speech can be similarly useful here. By varying the voice pitch or timbre, or experimenting with background acoustics, or by introducing speech disfluencies, it would be possible to compare the inductive biases of speech recognition for different types of speakers (and then (ideally!) engineer those away). Using generated speech takes away one variable for trans-linguistic comparison of the model (“how well does this perform against English versus against Polish/Sanskrit/Tagalog/Toki Pona/etc”).
Thanks for reading, let me know what you think!
The careful observer will ask, why does one need speech recognition at all, if spoken Latin is very rare. I have a truly marvelous rationale which this endnote is too space-constrained to contain. ↩