ALOES 2024 pre-conference workshop

Pre-conference Workshop on Internet Spoken Corpora of English

Thursday 28  March l 2024

Provisional Programme
14h 00 Opening
session 1. Youtube scraping: three methods

14h 15 Adrien Méli the PEASYV pipeline

14h 45 Peter Uhrig (Dresde) Using youtube for multimodal data
(zoom presentation)

15h15 Steven Coats (University of Oulu) : Python pipeline

15h 45 Coffee break

session 2. Hands-on session : Using Python notebooks to stream your data
16h15  convenant : Steven Coats (University of Oulu)

Bring your own laptop, we’ll see how we can use a jupyter notebook to collect data.

session  3. Post-processing and corpus curation

17h Steven Coates : The CoANZSE Audio: The Corpus of Australian and New Zealand Spoken English on CLARIN

4. round table and conclusion

17h30 Richard Wright : some extra requirements for corpora  : the ATAROS corpus
discussants : Sylvain Navarro (UPCité), Rory Turnbull (Newcastle) & Richard Wright (Seattle)

For inquiries please contact and sylvain.navarro@u-paris.f

Participation to the workshop is free of charge, but participants must register

Coming to us: Bâtiment Olympe de Gouges, Place Paul Ricoeur  75013 Paris (building 10 on the map)

Accès au bâtiment Olympe de Gouges


Call for Papers
Deadline for Submissions: January 8, 2024
The vast amount of spoken data available on internet platforms such as YouTube, Dailymotion, etc. constitute an invaluable source of raw material for linguistic research and language teaching. However, this type of data generally requires some processing, such as segmentation, annotation, transcription, alignment, outlier detection, and efficient data management, etc. before they can be properly exploited. 
The pre-conference workshop  seeks to bring together the latest research and innovative approaches in the field of Internet spoken corpora of English. We invite papers that address various aspects of data collection, pipelines, outlier identification, and data management in the context of Internet English spoken corpora.
We welcome submissions on a wide range of topics related to Internet corpora of English, including but not limited to:
    Data Collection Methodologies: Papers discussing novel approaches and tools for collecting spoken data from the internet, considering ethical and legal considerations.
    Data Preprocessing Pipelines: Research on data cleaning, normalization, tokenization, and other preprocessing steps tailored to Internet spoken corpora.
    Outlier Detection: Methods and techniques for identifying and handling outliers, anomalies, and noise in Internet spoken data.
    Data Management Strategies: Effective strategies for storing, indexing, and managing large-scale Internet corpora for efficient access and analysis.
    Quality Control and Annotation: Approaches for ensuring data quality, reliability, and automatic annotation of Internet spoken corpora.
    Use Cases and Applications: Research on practical applications of Internet spoken corpora in linguistics, NLP, social sciences, and conception of pedagogical resources.
Submission Guidelines:
    One-page abstracts should be sent to and by January 8, 2024.
This pre-conference workshop will take place at Université Paris Cité, Olympe de Gouges building, room 730 (seventh floor, to be confirmed). Please note that the ALOES 2024 conference takes place at Université Sorbonne Paris Nord in Villetaneuse on Friday 29th and Saturday 30th.
    Submission Deadline: January 8, 2024
    Decision sent  January 15, 2024
    Pre-conference Workshop: March 28,2024



Contact person : Nicolas Ballier

