Multilingual Natural Language Processing: From Upstream to Downstream Tasks
Minh Nguyen
Committee: Thien Huu Nguyen (chair), Thanh Hong Nguyen, Humphrey Shi
Directed Research Project(May 2021)
Keywords: Multilingual Natural Language Processing, Information Extraction, FourIE, Trankit

We present our recent work on developing multilingual Natural Language Processing (NLP) systems for different upstream and downstream tasks in NLP.

For the upstream tasks, we introduce Trankit, a light-weight Transformer-based Toolkit for multilingual NLP. It provides a trainable pipeline for the fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory use and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages.

For the downstream tasks, we introduce FourIE, a novel deep learning model to simultaneously solve the four main tasks (i.e., entity mention recognition, relation extraction, event trigger detection, and argument extraction) of Information Extraction (IE) in a single model. Existing works on IE have mainly solved the four tasks separately, thus failing to benefit from inter-dependencies between tasks. Compared to few prior work on jointly performing four IE tasks, FourIE features two novel contributions to capture inter-dependencies between tasks. First, at the representation level, we introduce an interaction graph between instances of the four tasks that is used to enrich the prediction representation for one instance with those from related instances of other tasks. Second, at the label level, we propose a dependency graph for the information types in the four IE tasks that captures the connections between the types expressed in an input sentence. A new regularization mechanism is introduced to enforce the consistency between the golden and predicted type dependency graphs to improve representation learning. We show that the proposed model achieves the state-of-the-art performance for joint IE on both monolingual and multilingual learning settings across English, Chinese, and Spanish.