Neurocorpus (2020) | Clillac ARP

Neurocorpus: neural translations and augmented corpora (aspects of a roadmap for corpus linguists in NMT).

PI

Nicolas Balier

Resume

This exploratory project is intended as a proof of concept for a resubmission to the Université de Paris émergence 2020 call for projects. The Neurocorpus project received 5 000 dollar credits on Google Computing Cloud.

Background

Neural machine translation (NMT) is a fairly recent approach (Kalchbrenner & Blunsom, 2013; Bahdanau et al, 2014; Sutskever et al., 2014), which has not only significantly improved machine translation performance, but has radically changed the intellectual landscape in the field. By reviving a formalism of the 1980s, it literally buried statistical translation, which was too technical to be accessible to linguists by training. This considerably changes the situation for corpus linguists who have a long tradition of tools. Provided that they learn a programming language, it once again allows corpus linguists to be potential actors in the field, no longer just expert spectators of machine translation errors, and to contribute to the analysis of neural processes, possibly improving results in the processing of system input and output data. Since no one completely understands what the neural network does yet, the exploration of machine translation by linguists makes sense again, if only in a behaviorist approach to neural network. In a word, the generalization of deep learning gives back all its importance to linguists’ intuitions, especially in the processing of input data (see Valette (2015), who showed for sentiment analysis that the procedure of exclusion of pronouns, standard among computer scientists, considerably deteriorates the results obtained). Around this project, we have formed a team that brings together experts in the field of language and statistical learning. Beyond scientific production, the objective is to develop a demonstrator: a specialized translation engine for medical English that integrates the project’s progress.

Work packages

We have structured our project around two work packages.

The exploration of the black box
This fundamental research section seeks to understand and interpret (Montavon et al., 2018) what the neural network does: How can we learn to translate from input corpus and algorithms? The most ambitious and risky objective is to contribute to the study of the impact of neural network architecture on NMT performance and to develop a behaviourist approach to RNNs. Deliverables: articles will be submitted to conferences in the field, such as ACL, EMNLP or WMT or in its satellite events (blackbox).
Augmented corpora (case studies)
We have selected types of corpora to study input issues with annotation, including “augmented corpora” by several layers of syntax annotation with the collaboration of other linguists.

References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Kalchbrenner, N., & Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1700-1709).
Montavon, G., Samek, W., & Müller, K. R. (2018). Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73, 1-15.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
Valette Mathieu, Egle Eensoo (2015) « Une sémantique de corpus pour la fouille de textes », La sémantique et ses interfaces. Actes du colloque 2013 de l’Association des Sciences du Langage, textes réunis et présentés par A. Rabatel, A. Ferrara-Léturgie et A. Léturgie, éd., Lambert-Lucas, Limoges, 205-224, ISBN : 978-2-35935-129-3.

Accepted Papers

Nicolas Ballier, Nabil Amari, Laure Merat & Jean-Baptiste Yunès (accepted) The Learnability of the Annotated Input in NMT Replicating (Vanmassenhove & Way, 2018) with OpenNMT. LREC2020 REPROLANG worskhop. Marseilles, 11-16 May 2020, 8 p. (This replication study outlines our roadmap for corpus annotation of the target texts)

Github

À lire aussi

SILES – Séminaire International sur la Langue Espagnole (2024-2025)

Altae, Séminaire

SILES est un groupe de travail, d’échange et de recherche autour de l’espagnol animé par l’équipe de linguistes hispanistes de l’UFR EILA de l’Université Paris Cité, rattaché à l’équipe de recherche ALTAE. Ce séminaire se donne pour but de réunir périodiquement des...

ANR GLITCH

Altae, Projets en cours

Projet ANR JCJC (2024-2027) porté par Maud Pélissier.Résumé Deux techniques d’apprentissage sont particulièrement efficaces pour la mémorisation : l’effort de rappel (essayer de se rappeler d’un élément plutôt que simplement le relire) et l’apprentissage espacé...

Séminaire de recherche en langues de spécialité, corpus et traductologie, 2024-2025

Altae, Séminaire

Contact : Professeur Natalie Kübler 25 novembre 2024 14h-16h (salle 720 Olympe de Gouge, Paris) Patrick DROUIN, Professeur, Université de Montréal. Extraction automatique de termes : un regard sous le capot Résumé : Dans cet exposé, je présenterai d'abord une vue...

La parole comme fenêtre sur notre cerveau

Altae