NoDaLiDa 2023 - May 22-24, 2023


SESSION 1 - LANGUAGE MODELS

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

Lucas Georges Gabriel Charpentier, Sondre Wold, David Samuel, Egil Rønningstad


Full article

Abstract
Retrieval-based language models are increasingly employed in question-answering tasks. These models search in a corpus of documents for relevant information instead of having all factual knowledge stored in its parameters, thereby enhancing efficiency, transparency, and adaptability. We develop the first Norwegian retrieval-based model by adapting the REALM framework and evaluate it on various tasks. After training, we also separate the language model, which we call the \textit{reader}, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks. Results show that retrieval augmented language modeling improves the reader's performance on extractive question-answering, suggesting that this type of training improves language models' general ability to use context and that this does not happen at the expense of other abilities such as part-of-speech tagging, dependency parsing, named entity recognition, and lemmatization. Code, trained models, and data are made publicly available. 

Training and Evaluating Norwegian Sentence Embedding Models

Bernt Ivar Utstøl Nødland


Abstract
We train and evaluate Norwegian sentence embedding models using the contrastive learning methodology SimCSE. We start from pre-trained Norwegian encoder models and train both unsupervised and supervised models. The models are evaluated on a machine-translated version of semantic textual similarity datasets, as well as binary classification tasks. We show that we can train good Norwegian sentence embedding models, that clearly outperform the pre-trained encoder models, as well as the multilingual mBERT, on the task of sentence similarity. 

Probing structural constraints of negation in Pretrained Language Models

David Kletz, Marie Candito, Pascal Amsili

Abstract
Contradictory results about the encoding of the semantic impact of negation in pretrained language models (PLMs) have been drawn recently (e.g.  Kassner and Schütze (2020); Gubelmann and Handschuh (2022)).

In this paper we focus rather on the way PLMs encode negation and its formal impact, through the phenomenon of the Negative Polarity Item (NPI) licensing in English.

More precisely, we use probes to identify which contextual representations best encode 1) the presence of negation in a sentence, and 2) the polarity of a neighboring masked polarity item.

We find that contextual representations of tokens inside the negation scope do allow for (i) a better prediction of the presence of "not" compared to those outside the scope  and (ii) a better prediction of the right polarity of a masked polarity item licensed by "not", although the magnitude of the difference varies from PLM to PLM. Importantly, in both cases the trend holds even when controlling for distance to "not".

This tends to indicate that the embeddings of these models do reflect the notion of negation scope, and do encode the impact of negation on NPI licensing. 

Yet, further control experiments reveal that the presence of other lexical items is also better captured when using the contextual representation of a token within the same syntactic clause than outside from it, suggesting that PLMs simply capture the more general notion of syntactic clause. 

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Katya Artemova, Barbara Plank


Abstract
Bilingual word lexicons map words in one language to their synonyms in another language. Numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, framing a typical pipeline that consists of two steps: (i) unsupervised bitext mining and (ii) unsupervised word alignment. At the core of those steps are pre-trained large language models (LLMs).

In this paper we present the analysis of the BLI pipeline for German and two of its dialects,  Bavarian and Alemannic. This setup poses a number of unique challenges, attributed to the scarceness of resources, relatedness of the languages and lack of standardization in the orthography of dialects. We analyze the BLI outputs with respect to word frequency and the pairwise edit distance. Finally, we release an evaluation dataset consisting of manual annotations for 1K bilingual word pairs labeled according to their semantic similarity.