Promoting fairness for under-represented languages in multilingual LLMs (2025-2026)

This project is funded by the University of Washington (UW) Global Innovation Fund (GIF) Research award (20 000 $) for a UPCIté/UW collaboration to promote fairness for under-represented languages in multilingual LLMs.

PIs for University of Washington (UW) : Gina-Anne Levow and Richard Wright

PI for Université Paris Cité (UPCité) : Nicolas Ballier

Of the 7,000 languages in the world, a handful are well-represented in language technologies, such as LLMs. Of those, English is vastly over-represented, and only a few dialects are represented. This sampling imbalance is especially acute for spoken language models, such as Whisper (Radford et al, 2023), where English alone accounts for more than ¾ of the training data. Underrepresentation in models leads to poor performance, leaving recent innovations in generative AI out of reach for most of the world’s population. Our collaboration will measure and minimize the training-data bias in multilingual language models like ChatGPT (OpenAI, 2023), Whisper (Radford et al., 2023), and XLM-R (Conneau et al., 2019). We want to address how the lack of training data adversely affects the representations for lower-resource languages, and seek to mitigate these harms through improved representations enhancing both task accuracy and model efficiency.

This first event focuses on audio large language models and highlight the role of speech tokenisers. We develop our case studies on Whisper and Wav2vec. A second event will be held in-person in fall 2025 at UW entitled “Speech and text LLMs revisited: under the hood investigation of multilingual models” with a mixed audience of linguists, engineers and computational scientists.

FIRST EVENT:

Zoom session : 26 May 2025

zoom : https://u-paris.zoom.us/j/86356943682?pwd=llaCqKu8aOajFauHBU36X6DDMddZcb.1
room : tba (first floor)

Bâtiment Olympe de Gouges

8 Place Paul Ricoeur

75013 PARIS

SECOND EVENT : to be announced

Accès au bâtiment Olympe de Gouges

This second workshop is intended to discuss various approaches to probing audio LLMs such as Whisper (Radford et al., 2019) or Wav2vec (Baevski et al., 2018).

PROVISIONAL PROGRAMME

16h Nicolas Ballier (UPCité), Gina Levow & Richard Wright (UW) : introduction
We briefly present current research in the making on speech language models.

16h30 Guillaume Wisniewki (UPCité) : Measuring distance with wav2vec & Whisper
In relation to the Deeptypo project.

17h00 Taylor Arnold (Richmond) Analysing LLM outputs with UNICODE

17h30 Behnoosh Namdarzadeh (UPCité) Whisper transcriptions for lesser-resourced languages : the case of Persian
[This talk reports our findings for the transcription task of a language trained with 24 hours of speech]

18h15 Discussion and coffee break
18h30 Sara Ngang (Western Washington University)
19h00 the case of automatic alignment for Arabic (tbc)
19h30 TBA
20h closing remarks and end

The 13th International Conference on Technical Communication

Altae, Colloque

"Empowering Technical Communicators: Skills, Growth, and Education" March 7, 2025, 9h15-17h00 Salle 720 Bâtiment Olympe de Gouges 8 place Paul Ricoeur, 75013 Paris Contact Ismael RAMOS RUIZ Program 9h15 Welcome coffee. Registration and documentation. 9h45 Opening...

Journées d’Étude de la Communication Culturelle et des Innovations Numériques

Altae, Colloque