Based on two available tools for automatic phonetic transcription, an adaption for the Luxembourgish language has been developed recently. The tool converts any written input – orthographically correct or not – into the corresponding phonetic transcription (grapheme-to-phoneme conversion, g2p). Existing algorithms g2p systems like Sequitur g2p or the more recent Gramophone take an amount of training data to generate conversion models. In the statistical modeling all possible grapheme combinations are extracted and matched with the corresponding phonemes. After training is completed, the transcription model can be applied to any text.
To adapt these systems for Luxembourgish it is first necessary to set up appropriate training data. For Luxembourgish a list of some 7000 manually transcribed words serve as the basis (see some examples below). These pairs of orthographical word forms and corresponding phonetic transcription then form the input for the training process. These manual transcriptions are canonical, ‚phonological‘ transcriptions based on the idea of an ‚underlying form‘. No phonetic variants, stylistic variation or connected speech processes are taken into account. For the system to perform successfully it is imperative to produce a highly consistent set of input data.
anzwousch ɑntswəʊʃ apaart ɑpaːʀt apaken ɑpaːkən apdikt ɑpdikt apdikter ɑpdiktɐ apel aːpəl apostroph ɑpostʀof apparat ɑpɑʀaːt äppeljus æpəlʒyː appetit ɑpətit appetitlech ɑpətitleɕ applaus ɑplɑʊs approuvéieren ɑpʀuvɜɪəʀən approximatiivt ɑpʀoksimɑtiːft approximative ɑpʀoksimaːtiːvə
Try the online tool: Automatesch foneetesch Transkription
The system can be used to generated pronunciation dictionaries, e.g. for speech synthesis systems (i.e. our MarryLux TTS).
For sake of an example, a recent news item from RTL.lu will be transcribed as follows (note that numbers are treated as words and transcribed as well).
Victoire fir de Peter Sagan, Drucker gëtt 19.
Um Sonndeg war den 100. Tour de Flandres an der Belsch.
De Weltmeeschter Peter Sagan wënnt déi 100. Tour de Flandres.
Bei den Hären ass dem Jempy Drucker säi Coequipier a Leader Greg van Avermaet 100 Kilometer virun der Arrivée gefall an huet missen opginn.
Transcription by Sequitur g2p
[viktwaːʀ fiɐ də peːtɐ zɑgaːn dʀukɐ gət nontseŋ
um zondeɕ vaːʀ dən eːnhonɐt tuːʀ də flɑndʀəs ɑn dɐ bælʃ
də væltmeːʃtɐ peːtɐ zɑgaːn vənt dɜɪ eːnhonɐt tuːʀ də flɑndʀəs
bɑɪ dən hɛːʀən ɑs dəm ʑəmpiː dʀukɐ zæːɪ koekipjeː aː leaːdɐ gʀeɕ vaːn ɑfɐmɑeː eːnhonɐt kilomeːtɐ fiːʀun dɐ ɑʀiveː gəfɑl ɑn huət misən opgin]
Transcription by Gramophone
[viktwɑʀ fiɐ də peːtɐ zɑgaːn dʀukɐ gət nontseŋ
um zondeɕ vaːʀ dən eːnhonɐt tuːʀ də flɑndʀəs ɑn dɐ bælʃ
də væltmeːʃtɐ peːtɐ zɑgaːn vənt dɜɪ eːnhonɐt tuːʀ də flɑndʀəs
bɑɪ dən hɛːʀən ɑs dəm ʒɔ̃ːpiː dʀukɐ zæːɪ koːəkipjeː aː leaːdɐ gʀeɕ vaːn ɑvɐmɑeː eːnhonɐt kilomeːtɐ fiːʀun dɐ ɑʀiveː gəfɑl ɑn huət misən opgin]
Both systems perform quite well. Especially nearly all Germanic-Luxembourgish words are transcribed with an amazing correctness. This applies also to all inflected word forms where e.g. final devoicing is handled correctly. Of course, transcription errors remain, and they concern mostly loan words from French (*[flɑndʀəs] > [flɔ̃ːndʀə]), German and English (*[leaːdɐ] > [liːdɐ]) – which are very frequent in the lexicon of the multilingual speech community – and names (*[ʑəmpiː], *[ʒɔ̃ːpiː] > [ʑæmpiː]). A further extension of the training data by including much more French words will lead to improvement.