We propose Lux-ASR, our speech recognition system for the Luxembourgish language. In 2022, a first version has been proposed and in the meantime, the system has been developed constantly and the project is supported by the Chambre des Députes du Grand-Duché de Luxembourg.

Try Lux-ASR here

Luxembourgish, known as „Lëtzebuergesch“ to its speakers, is a West Central Germanic language spoken predominantly in the small but prosperous nation of Luxembourg. This lesser-known language, with approximately 390,000 native speakers, occupies a unique position in Europe’s linguistic landscape. Unlike major languages such as English, Spanish, or Mandarin, Luxembourgish is considered a low-resource language due to its limited availability of language resources, especially in digital forms. This attribute has traditionally posed a significant hurdle for implementing high-accuracy automatic speech recognition (ASR) systems, primarily due to a scarcity of comprehensive training datasets.

Challenges of Developing ASR for Low-Resource Languages

Building an ASR system for low-resource languages like Luxembourgish is a significant challenge, principally because of the scarcity of training data. Machine learning models used in ASR require extensive data to capture the nuances of a language, including its phonetics, grammar, and colloquialisms. With a large dataset, a system can efficiently learn and predict speech patterns. However, in the case of low-resource languages, these extensive datasets are often unavailable or non-existent, leading to suboptimal model performance.

ASR Algorithms: wav2vec 2.0 and Whisper

In the realm of ASR technology, two algorithms have shown exceptional promise: Facebook’s wav2vec 2.0 and OpenAI’s Whisper. Wav2vec 2.0 utilizes a self-supervised approach, training on raw audio data to capture linguistic information directly from waveforms. This method has proven effective in various applications, particularly when sufficient training data is available.

On the other hand, Whisper, an ASR system developed by OpenAI, has recently emerged as a frontrunner in the field. The Whisper ASR API uses vast amounts of multilingual and multitask supervised data from the web, making it especially suitable for dealing with low-resource languages. After conducting initial experiments with both algorithms, we found that Whisper delivered superior performance in recognizing Luxembourgish, despite the limited available data. Whisper builds upon the power of sequence-to-sequence learning, a technique widely employed in machine learning tasks involving sequential data. This approach, which transforms input sequences (speech signals) into output sequences (text), is particularly well-suited to speech recognition tasks.

In the context of Whisper, this sequence-to-sequence technique is bolstered by the use of mel spectrograms. A mel spectrogram is a representation of the short-term power spectrum of sound, which better mirrors human speech perception than linear-scaled spectrograms. By converting raw audio into mel spectrograms, the Whisper system can capture more intricate auditory details, thus enhancing its ability to understand and transcribe spoken Luxembourgish.

Both systems, however, cannot be used for Luxembourgish out of the box, as the models have to be fine-tuned for this new language. Although Whisper as such already supports Luxembourgish in its model, the quality is, however, very poor. To conduct fine-tuning a language-specific training dataset is necessary.

Compilation of a Training Data Set for Luxembourgish ASR

Collecting a substantial and diverse set of training data was a key priority in our Luxembourgish ASR development project. We sourced a variety of Luxembourgish recordings, primarily focusing on publicly available media and Parliamentary debates. These sources offer a rich variety of vocabulary, accents, and speaking styles, aiding the model’s ability to generalize across different speakers and contexts. Furthermore, we supplemented these public sources with our own recordings, aiming to capture more colloquial and day-to-day usage of the language. Our complete dataset, though small by the standards of larger languages, totaled approximately 90 hours of Luxembourgish speech. The data set consists of chunks of audio and their orthographic transcription:

Chamber_2020_11_18_0178.wav
lux_train/Chamber/Chamber_2020_11_18_0178.wav
Bon, et ass esou, dass bei verschiddenen Aktivitéiten, wa mer déi da
Chamber_2020_11_18_0179.wav
lux_train/Chamber/Chamber_2020_11_18_0179.wav
maachen am Kader vum Artemis-Programm, mussen déi eventuell iwwer Gesetz
Chamber_2020_11_18_0180.wav
lux_train/Chamber/Chamber_2020_11_18_0180.wav
autoriséiert ginn, jee no der Natur vun der Aktivitéit. Mee Dir wësst jo
Chamber_2020_11_18_0181.wav
lux_train/Chamber/Chamber_2020_11_18_0181.wav
och, dass mer do amgaange sinn e Space-Gesetz ze adoptéieren, wat scho ganz
Chamber_2020_11_18_0182.wav
lux_train/Chamber/Chamber_2020_11_18_0182.wav
wäit fortgeschratt, ass an der Chamberskommissioun, wat deemnächst ee méi
Chamber_2020_11_18_0183.wav
lux_train/Chamber/Chamber_2020_11_18_0183.wav
largë Kader gëtt fir Space-Aktivitéiten.

 

Fine-tuning of the models

As for the fine-tuning itself, we have chosen recipes developed by Hugging Face, whose transformers framework offers state-of-the-art scripts for the fine-tuning of wav2vec 2.0 and Whisper. The dataset was split up into a training set of 85 % and a test set of 15 % and early stopping has been applied to circumvent overfitting. The fine-tuning of these highly parameterized models is still quite demanding regarding computing power. While the smaller wav2vec 2.0 checkpoint could be trained on four V100 GPUs on the University’s HPC in approximately 30 hours, the large wav2vec 2.0 and the Whisper-large checkpoints could be trained only on A4 GPUs with 40 GB of Graphics memory in a reasonable time on Google Colab.

Examples and evaluation

After training, both systems yielded surprisingly good results and the WER (word error rate) is ranging between 10 and 13, i.e. out of 100, 10 to 13 words are not or erroneously recognized. As we are constantly expanding the data set, new models are being trained to optimize the WER. Some examples of the two models follow. Note that for wav2vec 2.0 the output does not contain capitalization of nouns and punctuation. One immediate advantage of Whisper is that these two features form part of the training and show up then also in the output.

Ground truth
wav2vec 2.0 (1 B)
Whisper (large-v2)
 
 
Villmools merci, Här President. Den Avis vun den Experten huet kloer gewisen, datt d’Covid-Kris nach net eriwwer ass, dass de Risk nach ëmmer do ass an datt d’Expektative fir September, zumindest wat d’Experten ugeet, déi sinn, datt mer eventuell virun enger neier Well kënne stoen. Op wellechem Datum, wellech Variant, mat wellecher Virulenz, welleger Ustiechlegkeet, dat wësse mer an dësem Moment selbstverständlech net. Déi Zuelen, déi mer haut kennen, soen dat selwecht. De Staatsminister huet gëschter zitéiert: 1200 Infektiounen den Dag. Dat ass eppes, muss ech soen, wat eis virun enger Rei Joer, virun enger Rei Méint jo weesentlech méi erschreckt huet, wéi et eis haut erschreckt, well d’Situatioun …
villmools merci här president den avis vun den experten huet kloer gewisen datt d’covidkris nach net eriwwer ass dass de risk nach ëmmer do ass an datt d’expektative fir september zumindest wat d’experten ugeet déi sinn datt mer eventuell virun enger neier well kënne stoen op wellechem datum well ech variant mat wellecher virulenz wellecher ustieche keet dat wësse mer an dësem moment selbstverständlech net déi zuelen déi mer haut kennen soen dat selwecht de staatsminister huet gëschter zitéiert den auszweehonnert infektiounen den dag dat ass eppes muss ech soen wat eis virun enger rei  t eis haut erschreckt war d’situatioun
Villmools merci, Här President. Den Avis vun den Experten huet kloer gewisen, datt d’Covidkris nach net eriwwer ass, dass de Risk nach ëmmer do ass an datt d’Expektative fir September zumindest wat d’Experten ugeet, déi sinn, datt mer eventuell virun enger neier Well kënne stoen. Op wellegem Datum, welleg Variant, mat welleger Virulenz, welleger Ustiechtegkeet, dat wësse mer an dësem Moment selbstverständlech net. Déi Zuelen, déi mer haut kennen, soen dat selwecht. De Staatsminister huet gëschter zitéiert: 1200 Infektiounen den Dag. Dat ass eppes, muss ech soen, wat eis virun enger Rei Joer, virun enger Rei Méint jo weesentlech méi erschreckt huet, wéi et eis haut erschreckt, well d’Situatioun
Se si gesond, gesi schéin aus, schmaache gutt a schéin Nimm hunn se och nach! Ech schwätze vun den Uebst- a Geméiszorten, vun de Kraider, de Gewierzer an den Nëss. An dësem Buch huet den Zenter fir d’Lëtzebuerger Sprooch ronn 300 Nimm fir déi geleefegst Produite vun der Natur gesammelt. Dat ass natierlech just eng Selektioun, mee déi meescht Zorten, déi kann een iessen an déi et hei am Gaart oder am Buttek gëtt, fënnt een an dësem drëtte Band vun der beléifter Serie Lëtzebuerger Wuertschatz.
se si gesond gesi schéin a mech maachen a se nach ech schwätze vun den uebstzorten deezer an den nëss an dësem buch huet den zenter fir d’lëtzebuerger sprooch ronn dräihonnert nimmt fir déi geleeft produite vun der natur gesammelt dat ass natierlech just eng selektioun mee déi meescht zorten déi kann een iessen an déi déi et hei am gaard uerder am buttek gëtt fënnt een an dësem drëtte bann vun der beléifter seng lëtzebuerger wirtschaft
se si gesond, gesi schéin aus, schmaache gutt a schéin Nimm hu se och nach ech schwätze vun den Uebes a Geméiszorte vun de Kräider, de Gewierzer an den Nëssen an dësem Buch huet de Centre fir d’Lëtzebuerger Sprooch ronn dräihonnert Nimm fir déi geleeftegst Produite vun der Natur gesammelt dat ass natierlech net eng Selektioun wéi déi meescht Zorten, déi kann een iessen an déi déi et hei am Gaart oder am Buttek gëtt, fënnt een an dësem drëtte Band vun der beléifter Serie Lëtzebuerger Wuertschatz.

Multilingual Speech Recognition

An ASR system for Luxembourgish requires also the ability to recognize switches into other languages, as multilingualism is at the very core of the speech community. Here, especially French plays an important role and Lux-ASR recognizes switches from Luxembourgish to French already in a quite promising way. In the following example, the speaker started in Luxembourgish, then switches to French for a few sentences and then switches back to Luxembourgish.

… Mir sollen houfreg sinn op d’Diversitéit an den Zesummenhalt an eise Gesellschaft. A cet endroit, je voudrais remercier les non luxembourgeois qui resident ou qui travaillent à notre pays pour leur contribution précieuse à notre société. cohésion économique, mais aussi la cohésion sociale de notre pays sont des atouts qui nous appartiennent de défendre à tout prix. Ils sont au coeur de notre projet et de notre réussite. C‘ est notre bien commun à tous. Haut, op dësem chrëschtel Wënd, wëll ech meng Unerkennung awer net nëmmen op de politesche Plang begrenzen. …

 

This challenge of multilingual ASR is handled already quite well by Lux-ASR for French and also English. Due to the closeness of German and Luxembourgish, however, the model performs poorly in distinguishing between both languages.

 

Demonstration and Test Systems

The demonstration systems are freely available via Hugging Face Spaces, both for our wav2vec 2.0 and Whisper models. Here, you can also test the quality of the recognition for your own voice or for an uploaded recording (it is possible that the Space has to be restarted, which may take two minutes). Although still under development, Lux-ASR can be used already now to recognize interviews, podcasts, videos, or other recordings. As our system is based on the large (v2) model of Whisper, it is not yet running inference in real time, rather it takes around 30 tp 40 % of the duration of the recording to deliver the transcription (running on GPUs).

Contact

Lux-ASR is under constant development by Peter Gilles, Nina Hosseini-Kivanani, and Léopold Hillah at the University of Luxembourg and is supported by the Chambre des Députes du Grand-Duché de Luxembourg. Contact us for more information.

References

Gilles, Peter, Nina Hosseini Kivanani & Léopold Edem Ayité Hillah. 2023. LUX-ASR: Building an ASR system for the Luxembourgish language. In 2022 IEEE Spoken Language Technology Workshop (SLT) SLT 2022, 1147–1149. Doha, Qatar. https://orbilu.uni.lu/handle/10993/55105.
Gilles, Peter, Nina Hosseini Kivanani & Léopold Edem Ayité Hillah. 2023. ASRLux: Automatic speech recognition for the low-resource language Luxembourgish. In International Conference of Phonetic Sciences (ICPhS), August 7-11, 2023, 3091–3095. Prague: Guarant International. https://hdl.handle.net/10993/55819.