PHC Ulysse (2019)

Investigating criterial features of learner English and AI-driven automatic language level assessment.

Funding

PHC Hubert Curien Ulysse 2019 (ref 43121RJ) (5 000 Euros)

With the financial support of the French Ministry for Europe and Foreign Affairs (Ministères de l’Europe et des affaires étrangères, MEAE) and the French Ministry of Higher Education, Research and Innovation (Ministère de l’Enseignement supérieur, de la Recherche et de l’Innovation, MESRI).

Partners

Irish Partner: Insight
- PI: Manel Zarrouk
- Bernardo Stearns
- Annanda Sousa
- Andrew Simpkin – Insight Centre for Data Analytics/NUI Galway
French Partner: CLILLAC-ARP
- PI: Nicolas Ballier
- Thomas Gaillat
- Manon Bouyé

Resume

Un des projets en SHS retenus pour le PHC Ulysses 2019 (24 projets retenus au final pour 78 soumissions). Ce projet binational implique côté français un ancien doctorant (Thomas Gaillat), une doctorante du laboratoire (Manon Bouyé) et Nicolas Ballier, coordinateur du projet pour la partie française.

Aim

This project aims to investigate criterial features in learner English and to build a proof-of-concept system for language level assessment in English. Our research focus is to identify linguistic features and to integrate them within a system based on Artificial Intelligence (AI). The purpose is to create a system to analyse learner English essay writings and map them to specific levels of the language levels of the Common European Framework of Reference for Languages (CEFRL).

Background Research

Benefiting from Xiaofei Lu‘s invitation to Paris, we have experimented the tools he designed (Lexical Complexity Analyzer and L2 Syntactic Complexity Analyzer) as well as R packages such as {ZipfR}, {KoRpus} and {quanteda} for the investigation of complexity metrics for learner data.

TALN2016 : vgc paper

This paper (in French, sorry) discussed the possibility of assigning levels to the ANGLISH corpus (Tortel, 2009) based on vocabulary growth curves and complexity metrics. Metrics were applied to intermediate, advanced and native spontaneous productions of the ANGLISH corpus. These metrics proved much less convincing for the precision of the classification than the timing of syllables measured by types of syllables (stressed, reduced) as evidenced by (Ballier, Martin and Amand, 2016).

Ballier, Nicolas, Thomas Gaillat. 2016. “Classification d’apprenants francophones de l’anglais sur la base des métriques de complexité lexicale et syntaxique”. JEP-TALN-RECITAL 2016, Jul 2016, Paris, France. Actes de la conférence conjointe JEP-TALN-RECITAL 2016, 9, pp.1-14, 2016, ELTAL. PDF

Ballier, Nicolas ; Martin, Philippe & Amand, Maelle. 2016. Variabilité des syllabes réalisées par des apprenants de l’anglais. Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 1 : JEP 2016, Paris, France. pp.732-740. PDF

LCR2017, CILC2017 papers

For Paula Lissón’s M1 MA thesis, we experimented metrics to measure progression between first year and second year students in the longitudinal corpus Diderot-Longdale (Goutéreaux 2013).

Ballier, N. & Lissón P. (2017) A corpus-based evaluation of readability metrics as indices of syntactic complexity in EFL learners’ written productions, Bolzano, LCR2017, 5-7 October 2017.

Ballier, N & Lissón, P. (2017). Estudio de la aplicabilidad de la ley de Zipf y de la ley de Heaps en los corpus de aprendientes de inglés. CILC2017, Paris, 30-31 mai 2017

CAp 2018 dataset and paper

We designed the dataset for the data challenge of the yearly conference of the French machine learning community CAp2018 : A conference Competition : My Tailor is rich! Predicting English level by analyzing writing styles. We explained the metrics applied to the French component of the EFCAMDAT corpus. The training and testing datasets were hosted by the owner of the corpus for copyright reasons. A paper summarizing the mutual benefits of this competition has been submitted to an international journal.

A paper, adding syntactic complexity metrics to the CAp2018 dataset, was also accepted for this French conference of Machine Learning. Arnold, T., Ballier, N, Gaillat, T. & Lissón, P., 2018 , Predicting CEFRL levels in learners of English on the basis of metrics and full texts, CAp2018 conference. Université de Rouen. 19-21 juin 2018. Paper 31 in the proceedings of the conference. ARXIV

Discours : we have applied random forest techniques to classify three stages of the first year of Spanish learners of French at university.

Paula Lissón & Nicolas Ballier (2018) ‘Investigating lexical progression through lexical diversity metrics in a corpus of French L3’, Discours, 23, 2. PDF

The main results of Paula Lissón’s M2 MA thesis, which investigated the complexity variable(s) for French learner choices between that and zero in restrictive relative clauses in English, was presented at LCR2019. Lissón,P. Ballier, N. & Gerdes, K. (2019) On relativizer use in learner English: a corpus-based study, LCR2019, Warsaw, 12-14 September 2019.

Meetings

Galway meeting (May 7-10)

papers submitted to EC-TEL2019

Paris meeting (Oct 27-31)

Workshop on CEFR level prediction of texts in learner corpora, feedback to learners and learning analytics. Save the date : Oct 30th

Deliverables

EIAH2019 This poster presents a prototype for Data visualisation based on learner texts to be used for formative feedback.
EUROCALL2019 : 2 papers accepted : research track / project track:
- Nicolas Ballier, Thomas Gaillat, Manel Zarrouk, Andrew Simpkin, Manon Bouyé Annanda Sousa, Bernardo Stearns (2019) A Franco-Irish project for the automatic identification of criterial features in learners of English , Eurocall2019, project track, Louvain, 28-31 Aug 2019.
- Thomas Gaillat, Nicolas Ballier, Manel Zarrouk, Andrew Simpkin, Manon Bouyé, Annanda Sousa, Bernardo Stearns (2019) Investigating criterial features of learner English and predicting CEFR levels in French learners of English , Eurocall2019, research track, Louvain, 28-31 Aug 2019.
Gaillat, T & Ballier, N. (2019) Investigating the scope of textual metrics for learner level discrimination and learner analytics, LCR2019, Warsaw, 12-14 September 2019. (This paper discusses Random forest techniques for CEFR classification of ESP learners) LCR2019, Warsaw, 12-14 September 2019. Abstract
Ballier, N., Gaillat, T., Simpkin, A., Stearns, B., Bouyé, M., & Zarrouk, M. (2019). A Supervised Learning Model for the Automatic Assessment of Language Levels Based on Learner Errors. In European Conference on Technology Enhanced Learning (pp. 308-320). Springer. EC-TEL 2019 Proceedings
a Research paper (submitted) describing the micro-systems and the precision rate
a conference paper describing the architecture : Annanda Sousa, Nicolas Ballier, Thomas Gaillat, Bernardo Stearns, Manel Zarrouk, Andrew Simpkin and Manon Bouyé (2020) From Linguistic Research Projects to Language Technology Platforms : A Case Study in Learner Data . ACL / LREC2020 1st International Workshop on Language Technology Platforms. Marseilles, 16 May 2020, 112-120. PDF.
a demo paper summarising our project (in French) : Thomas Gaillat, Nicolas Ballier, Annanda Sousa, Manon Bouyé, Andrew Simpkin, et al. (2020) Un prototype en ligne pour la prédiction du niveau de compétence en anglais des productions écrites. 6e conférence conjointe Journées d’Études sur la Parole (JEP, 31e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition), 2020, Nancy, France. pp.30-33. ⟨hal-02768504⟩. PDF. See the presentation on the conference website.
User interface (work still in progress) to be hosted on HUMA-NUM
Link to the private gitlab (password protected).
datasets (on request)
film demo of the interface

Resources

pipeline
prototype demo UI thanks to HUMA-NUM!
Inventory of features
Dataset to be hosted by Cambridge

In the making (2020-2021)

a comparison of our pipeline with the REALEC interface for Russian learners of English
a version for Arabic to be tested for a chapter on bilingual writer skills to be submitted to Routledge
Vizualisation for automatic feedback: Follow this project on ResearchGate
a version for French to be tested with linguistic data on Simple English / ‘style clair’ in French
using the pipeline to investigate the workings of neural networks for neural translation.