Investigating criterial features of learner English and AI-driven automatic language level assessment.
Funding
PHC Hubert Curien Ulysse 2019 (ref 43121RJ) (5 000 Euros)
With the financial support of the French Ministry for Europe and Foreign Affairs (Ministères de l’Europe et des affaires étrangères, MEAE) and the French Ministry of Higher Education, Research and Innovation (Ministère de l’Enseignement supérieur, de la Recherche et de l’Innovation, MESRI).
Partners
- Irish Partner: Insight
- PI: Manel Zarrouk
- Bernardo Stearns
- Annanda Sousa
- Andrew Simpkin – Insight Centre for Data Analytics/NUI Galway
- French Partner: CLILLAC-ARP
- PI: Nicolas Ballier
- Thomas Gaillat
- Manon Bouyé
Resume
Un des projets en SHS retenus pour le PHC Ulysses 2019 (24 projets retenus au final pour 78 soumissions). Ce projet binational implique côté français un ancien doctorant (Thomas Gaillat), une doctorante du laboratoire (Manon Bouyé) et Nicolas Ballier, coordinateur du projet pour la partie française.
Aim
This project aims to investigate criterial features in learner English and to build a proof-of-concept system for language level assessment in English. Our research focus is to identify linguistic features and to integrate them within a system based on Artificial Intelligence (AI). The purpose is to create a system to analyse learner English essay writings and map them to specific levels of the language levels of the Common European Framework of Reference for Languages (CEFRL).
Background Research
Benefiting from Xiaofei Lu‘s invitation to Paris, we have experimented the tools he designed (Lexical Complexity Analyzer and L2 Syntactic Complexity Analyzer) as well as R packages such as {ZipfR}, {KoRpus} and {quanteda} for the investigation of complexity metrics for learner data.
TALN2016 : vgc paper
This paper (in French, sorry) discussed the possibility of assigning levels to the ANGLISH corpus (Tortel, 2009) based on vocabulary growth curves and complexity metrics. Metrics were applied to intermediate, advanced and native spontaneous productions of the ANGLISH corpus. These metrics proved much less convincing for the precision of the classification than the timing of syllables measured by types of syllables (stressed, reduced) as evidenced by (Ballier, Martin and Amand, 2016).
Ballier, Nicolas, Thomas Gaillat. 2016. “Classification d’apprenants francophones de l’anglais sur la base des métriques de complexité lexicale et syntaxique”. JEP-TALN-RECITAL 2016, Jul 2016, Paris, France. Actes de la conférence conjointe JEP-TALN-RECITAL 2016, 9, pp.1-14, 2016, ELTAL. PDF
Ballier, Nicolas ; Martin, Philippe & Amand, Maelle. 2016. Variabilité des syllabes réalisées par des apprenants de l’anglais. Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 1 : JEP 2016, Paris, France. pp.732-740. PDF
LCR2017, CILC2017 papers
For Paula Lissón’s M1 MA thesis, we experimented metrics to measure progression between first year and second year students in the longitudinal corpus Diderot-Longdale (Goutéreaux 2013).
Ballier, N. & Lissón P. (2017) A corpus-based evaluation of readability metrics as indices of syntactic complexity in EFL learners’ written productions, Bolzano, LCR2017, 5-7 October 2017.
Ballier, N & Lissón, P. (2017). Estudio de la aplicabilidad de la ley de Zipf y de la ley de Heaps en los corpus de aprendientes de inglés. CILC2017, Paris, 30-31 mai 2017
CAp 2018 dataset and paper
We designed the dataset for the data challenge of the yearly conference of the French machine learning community CAp2018 : A conference Competition : My Tailor is rich! Predicting English level by analyzing writing styles. We explained the metrics applied to the French component of the EFCAMDAT corpus. The training and testing datasets were hosted by the owner of the corpus for copyright reasons. A paper summarizing the mutual benefits of this competition has been submitted to an international journal.
A paper, adding syntactic complexity metrics to the CAp2018 dataset, was also accepted for this French conference of Machine Learning. Arnold, T., Ballier, N, Gaillat, T. & Lissón, P., 2018 , Predicting CEFRL levels in learners of English on the basis of metrics and full texts, CAp2018 conference. Université de Rouen. 19-21 juin 2018. Paper 31 in the proceedings of the conference. ARXIV
Discours : we have applied random forest techniques to classify three stages of the first year of Spanish learners of French at university.
Paula Lissón & Nicolas Ballier (2018) ‘Investigating lexical progression through lexical diversity metrics in a corpus of French L3’, Discours, 23, 2. PDF
The main results of Paula Lissón’s M2 MA thesis, which investigated the complexity variable(s) for French learner choices between that and zero in restrictive relative clauses in English, was presented at LCR2019. Lissón,P. Ballier, N. & Gerdes, K. (2019) On relativizer use in learner English: a corpus-based study, LCR2019, Warsaw, 12-14 September 2019.
Meetings
Galway meeting (May 7-10)
- papers submitted to EC-TEL2019
Paris meeting (Oct 27-31)
- Workshop on CEFR level prediction of texts in learner corpora, feedback to learners and learning analytics. Save the date : Oct 30th
Deliverables
- EIAH2019 This poster presents a prototype for Data visualisation based on learner texts to be used for formative feedback.
- EUROCALL2019 : 2 papers accepted : research track / project track:
- Nicolas Ballier, Thomas Gaillat, Manel Zarrouk, Andrew Simpkin, Manon Bouyé Annanda Sousa, Bernardo Stearns (2019) A Franco-Irish project for the automatic identification of criterial features in learners of English , Eurocall2019, project track, Louvain, 28-31 Aug 2019.
- Thomas Gaillat, Nicolas Ballier, Manel Zarrouk, Andrew Simpkin, Manon Bouyé, Annanda Sousa, Bernardo Stearns (2019) Investigating criterial features of learner English and predicting CEFR levels in French learners of English , Eurocall2019, research track, Louvain, 28-31 Aug 2019.
- Gaillat, T & Ballier, N. (2019) Investigating the scope of textual metrics for learner level discrimination and learner analytics, LCR2019, Warsaw, 12-14 September 2019. (This paper discusses Random forest techniques for CEFR classification of ESP learners) LCR2019, Warsaw, 12-14 September 2019. Abstract
- Ballier, N., Gaillat, T., Simpkin, A., Stearns, B., Bouyé, M., & Zarrouk, M. (2019). A Supervised Learning Model for the Automatic Assessment of Language Levels Based on Learner Errors. In European Conference on Technology Enhanced Learning (pp. 308-320). Springer. EC-TEL 2019 Proceedings
- a Research paper (submitted) describing the micro-systems and the precision rate
- a conference paper describing the architecture : Annanda Sousa, Nicolas Ballier, Thomas Gaillat, Bernardo Stearns, Manel Zarrouk, Andrew Simpkin and Manon Bouyé (2020) From Linguistic Research Projects to Language Technology Platforms : A Case Study in Learner Data . ACL / LREC2020 1st International Workshop on Language Technology Platforms. Marseilles, 16 May 2020, 112-120. PDF.
- a demo paper summarising our project (in French) : Thomas Gaillat, Nicolas Ballier, Annanda Sousa, Manon Bouyé, Andrew Simpkin, et al. (2020) Un prototype en ligne pour la prédiction du niveau de compétence en anglais des productions écrites. 6e conférence conjointe Journées d’Études sur la Parole (JEP, 31e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition), 2020, Nancy, France. pp.30-33. ⟨hal-02768504⟩. PDF. See the presentation on the conference website.
- User interface (work still in progress) to be hosted on HUMA-NUM
- Link to the private gitlab (password protected).
- datasets (on request)
- film demo of the interface
Resources
- pipeline
- prototype demo UI thanks to HUMA-NUM!
- Inventory of features
- Dataset to be hosted by Cambridge
In the making (2020-2021)
- a comparison of our pipeline with the REALEC interface for Russian learners of English
- a version for Arabic to be tested for a chapter on bilingual writer skills to be submitted to Routledge
- Vizualisation for automatic feedback: Follow this project on ResearchGate
- a version for French to be tested with linguistic data on Simple English / ‘style clair’ in French
- using the pipeline to investigate the workings of neural networks for neural translation.
À lire aussi
The 13th International Conference on Technical Communication
"Empowering Technical Communicators: Skills, Growth, and Education" March 7, 2025 Contact Ismael RAMOS RUIZ Call for Papers In a field constantly shaped by technological advancements, technical communication professionals must hone their skills continuously. The roles...
Séminaire de recherche en langues de spécialité, corpus et traductologie, 2024-2025
Contacts : Professeurs Natalie Kübler et Christopher Gledhill. 25 novembre 2024 14h-16h (salle 720 Olympe de Gouge, Paris) Patrick DROUIN, Professeur, Université de Montréal. Extraction automatique de termes : un regard sous le capot Résumé : Dans cet exposé, je...
Interaction in TED Talks – TransQuest Project
September 13, ODG 830 Université Paris Cité, CLILLAC-ARP Journée d'Études du projet TransQuest Organiser: Agnès Celle Accès au bâtiment Olympe de Gouges Programme 9:30-10:15 Fiona Rossette-Crake, guest speaker, Université Paris Nanterre, CREATED Talks : Oratory, “New...
SILES – Séminaire International sur la Langue Espagnole (2024-2025)
SILES est un groupe de travail, d’échange et de recherche autour de l’espagnol animé par l’équipe de linguistes hispanistes de l’UFR EILA de l’Université Paris Cité, rattaché à l’équipe de recherche CLILLAC-ARP. Ce séminaire se donne pour but de réunir périodiquement...