ALOES 2024 pre-conference workshop

Pre-conference Workshop on Internet Spoken Corpora of English

Thursday 28  March l 2024

14h 00  Opening
session 1. Youtube scraping: three methods

14h 15 Adrien Méli the PEASYV pipeline

14h 45 Peter Uhrig (Erlangen)  A pipeline for the creation of multimodal corpora from YouTube (zoom presentation)


15h 15 Coffee break

session 2. Hands-on session : Using Python notebooks to stream your data
15h 45  convenant : Steven Coats (University of Oulu)

Participants need to bring their own laptops and to have a Google account as we will use a Google Colaboratory notebook


session  3. Post-processing and corpus curation

17h Steven Coats : CoANZSE/CoANZSE Audio: The Corpus of Australian and New Zealand Spoken English on CLARIN



4. round table and conclusion

17h30 Richard Wright : some extra requirements for corpora  : the ATAROS corpus
discussants : Sylvain Navarro (UPCité), Rory Turnbull (Newcastle) & Richard Wright (Seattle)

For inquiries please contact and sylvain.navarro@u-paris.f

Participation to the workshop is free of charge, but participants must register

Coming to us: Bâtiment Olympe de Gouges, Place Paul Ricoeur  75013 Paris (building 10 on the map)

Room 720 (7th floor)

Ask for a badge at the information desk (“accueil”)


Zoom link:




Accès au bâtiment Olympe de Gouges



Main Pointers :  
  • Steven Coats. 2023b. A new corpus of geolocated ASR transcripts from Germany. Language Resources and Evaluation.
  • Steven Coats. 2023c. A pipeline for the large-scale acoustic analysis of streamed content. In Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC- Corpora 2023), page 51–54. Mannheim: Leibniz- Institut für Deutsche Sprache.
  • Méli, Adrien, Steven Coats and Nicolas Ballier. (2023). Methods for phonetic scraping of Youtube videos. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), 244–249.
  • Adrien Méli and Nicolas Ballier. 2023. PEASYV: A procedure to obtain phonetic data from subtitled videos. Proceedings of the International Congress of Phonetic Sciences, pages 3211 – 3215
  • Adrien’s presentation:


  • Dykes, N., Wilson, A., & Uhrig, P. (2023, September). A Pipeline for the Creation of Multimodal Corpora from YouTube Videos. In Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing (pp. 1-5).



Contact person : Nicolas Ballier

À lire aussi



Projet ANR JCJC (2024-2027) porté par Maud Pélissier.Résumé Deux techniques d’apprentissage sont particulièrement efficaces pour la mémorisation : l’effort de rappel (essayer de se rappeler d’un élément plutôt que simplement le relire) et l’apprentissage espacé...