NoDaLiDa 2023 - May 22-24, 2023
Multi-CrossRE A Multi-lingual Multi-Domain Dataset for Relation Extraction
Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank
Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers. We run a baseline model over the 26 new datasets and--as sanity check--over the 26 back-translations to English. Results on the back-translated data are consistent with the ones on the original English CrossRE, indicating high quality of the translation and the resulting dataset.
Named Entity layer in Estonian UD treebanks
Kadri Muischnek, Kaili Müürisep
In this paper we will introduce two new language resources, two NE-annotated corpora for Estonian: Estonian Universal Dependencies Treebank (EDT, 440,000 tokens) and Estonian Universal Dependencies Web Treebank (EWT, 90,000 tokens). Together they make up the largest publicly available Estonian named entity gold annotation dataset. Eight NE categories are manually annotated in this dataset, and the fact that it is also annotated for lemma, POS, morphological features and dependency syntactic relations, makes it more valuable. We will also show that dividing the set of named entities into clear-cut categories is not always easy.
Generating Errors: OCR post-processing for Icelandic
Atli Jasonarson, Steinþór Steingrímsson, Einar Freyr Sigurðsson, Árni Davíð Magnússon, Finnur Ágúst Ingimundarson
We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data is scarce. We trained six models, four from scratch and two fine-tuned versions of Google's ByT5, on a combination of real data and texts populated with artificially-generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.
Alignment of Wikidata lexemes and Det Centrale Ordregister
Finn Årup Nielsen
Two Danish open access lexicographic resources have appeared in recent years: lexemes in Wikidata and Det Centrale Ordregister (COR). The lexeme part of Wikidata describes words in different languages and COR associates an identifier with each different form of Danish lexemes. Here I described the current state of the linking Wikidata lexemes with COR and some of the problems encountered.
Constructing a Knowledge Graph from Textual Descriptions of Software Vulnerabilities in the National Vulnerability Database
Anders Mølmen Høst, Pierre Lison, Leon Moonen
Knowledge graphs have shown promise for several tasks in cybersecurity, such as vulnerability assessment and threat analysis. In this work, we present a new method for constructing a vulnerability knowledge graph from information in the National Vulnerability Database (NVD). Our approach combines named entity recognition (NER), relation extraction (RE), and entity prediction using state-of-the-art approaches. With an average performance of 0.76 on Hits@10, our method helps to fix missing entities in knowledge graphs used for cybersecurity.
The Effect of Data Encoding on Relation Triplet Identification
Steinunn Rut Friðriksdóttir, Hafsteinn Einarsson
This paper presents a novel method for creating relation extraction data for low-resource languages. Relation extraction (RE) is a task in natural language processing that involves identifying and extracting meaningful relationships between entities in text. Despite the increasing need to extract relationships from unstructured text, the limited availability of annotated data in low-resource languages presents a significant challenge to the development of high-quality relation extraction models. Our method leverages existing methods for high-resource languages to create training data for low-resource languages. The proposed method is simple, efficient and has the potential to significantly improve the performance of relation extraction models for low-resource languages, making it a promising avenue for future research.
NoCoLA: The Norwegian Corpus of Linguistic Acceptability
David Samuel, Matias Jentoft
While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA-class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA-zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.
GiellaLT---a stable infrastructure for Nordic minority languages and beyond
Flammie A Pirinen, Sjur N. Moshagen, Katri Hiovain-Asikainen
Long term language technology infrastructures are critical for continued maintenance of language technology based software that is used to support the use of languages in digital world. In Nordic area we have languages ranging from well-resourced national majority languages like Norwegian, Swedish and Finnish as well as minoritised, unresourced and indigenous languages like Sámi languages. We present an infrastructure that has been build in over 20 years time that supports building language technology and tools for most of the Nordic languages as well as many of the languages all over the world, with focus on Sámi and other indigenous, minoritised and unresourced languages. We show that one common infrastructure can be used to build tools from keyboards and spell-checkers to machine translators, grammar checkers and text-to-speech as well as automatic speech recognition.
Adapting an Icelandic morphological database to Faroese
Kristján Rúnarsson, Kristín Bjarnadóttir
This paper describes the adaptation of the database system developed for the Database of Icelandic Morphology (DIM) to the Faroese language and the creation of the Faroese Morphological Database using that system from lexicographical data collected for a Faroese spellchecker project.
Scaling-up the Resources for a Freely Available Swedish VADER (svVADER)
Dimitrios Kokkinakis, Ricardo Muñoz Sánchez
With widespread commercial applications in various domains, sentiment analysis has become a success story for Natural Language Processing (NLP). Still, although sentiment analysis has rapidly progressed during the last years, mainly due to the application of modern AI technologies, many approaches apply knowledge-based strategies such as lexicon-based, to the task. This is particularly true for analyzing short social media content, e.g., tweets. Moreover, lexicon-based sentiment analysis approaches are usually preferred over learning-based methods when training data is unavailable or insufficient. Therefore, our main goal is to scale-up and apply a lexicon-based approach which can be used as a strong baseline to Swedish sentiment analysis. All scaled-up resources are made available, while the performance of this enhanced tool is evaluated on 2 short datasets, achieving adequate results.
Translated Benchmarks Can Be Misleading: the Case of Estonian Question Answering
Hele-Andra Kuulmets, Mark Fishel
Translated test datasets are a popular and cheaper alternative to native test datasets. However, one of the properties of translated data is the existence of cultural knowledge unfamiliar to the target language speakers. This can make translated test datasets differ significantly from native target datasets. As a result, we might inaccurately estimate the performance of the models in the target language. In this paper, we use both native and translated Estonian QA datasets to study this topic more closely. We discover that relying on the translated test dataset results in an overestimation of the model's performance on native Estonian data.
Predicting the presence of inline citations in academic text using binary classification
Peter Vajdecka, Elena Callegari, Desara Xhura, Atli Snær Ásmundsson
Properly citing sources is a crucial component of any good-quality academic paper. The goal of this study was to determine what kind of accuracy we could reach in predicting whether or not a sentence should contain an inline citation using a simple binary classification model. To that end, we fine-tuned SciBERT on both an imbalanced and a balanced dataset containing sentences with and without inline citations. We achieved an overall accuracy of over 0.92, suggesting that language patterns alone could be used to predict where inline citations should appear with some degree of accuracy.
DanSumT5: Automatic Abstractive Summarization for Danish
Sara Kolding, Katrine Nymann, Ida Bang Hansen, Kenneth C. Enevoldsen, Ross Deans Kristensen-McLachlan
Automatic abstractive text summarization is a challenging task in the field of natural language processing. This paper presents a model for domain-specific sum marization for Danish news articles, Dan SumT5; an mT5 model fine-tuned on a cleaned subset of the DaNewsroom dataset consisting of abstractive summary-article pairs. The resulting state-of-the-art model is evaluated both quantitatively and qualitatively, using ROUGE and BERTScore metrics and human rankings of the summaries. We find that although model refinements increase quantitative and qualitative performance, the model is still prone to factual errors. We discuss the limitations of current evaluation methods for automatic abstractive summarization and underline the need for improved metrics and transparency within the field. We suggest that future work should employ methods for detecting and reducing errors in model output and methods for referenceless evaluation of summaries.
ScandEval: A Benchmark for Scandinavian Natural Language Processing
Dan Saattrup Nielsen
This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained model on four different tasks in the Scandinavian languages. The datasets used in two of the tasks, linguistic acceptability and question answering, are new. We develop and release a Python package and command-line interface, scandeval, which can benchmark any model that has been uploaded to the Hugging Face Hub, with reproducible results. Using this package, we benchmark more than 80 Scandinavian or multilingual models and present the results of these in an interactive online leaderboard, as well as provide an analysis of the results. The analysis shows that there is substantial cross-lingual transfer among the the Mainland Scandinavian languages (Danish, Swedish and Norwegian), with limited cross-lingual transfer between the group of Mainland Scandinavian languages and the group of Insular Scandinavian languages (Icelandic and Faroese). The benchmarking results also show that the investment in language technology in Norway and Sweden has led to language models that outperform massively multilingual models such as XLM-RoBERTa and mDeBERTaV3. We release the source code for both the package and leaderboard.
Microservices at Your Service: Bridging the Gap between NLP Research and Industry
Tiina Lindh-Knuutila, Hrafn Loftsson, Pedro Alonso Doval, Sebastian Andersson, Bjarni Barkarson, Héctor Cerezo-Costas, Jon Gudnason, Jökull Snær Gylfason, Jarmo Hemminki, Heiki-Jaan Kaalep
This paper describes a collaborative European project whose aim was to gather open source Natural Language Processing (NLP) tools and make them accessible as running services and easy to try out in the European Language Grid (ELG). The motivation of the project was to increase inclusiveness and accessibility for more European languages and make it easier for developers to use the underlying tools in their own applications. The project resulted in the containerization of 60 existing NLP tools for 16 languages, all of which are now currently running as services in the ELG platform.