• Stars
    star
    196
  • Rank 198,553 (Top 4 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 5 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DaNLP is a repository for Natural Language Processing resources for the Danish Language.

Documentation Status

DaNLP is a repository for Natural Language Processing resources for the Danish Language. It is a collection of available datasets and models for a variety of NLP tasks. The aim is to make it easier and more applicable to practitioners in the industry to use Danish NLP and hence this project is licensed to allow commercial use. The project features code examples on how to use the datasets and models in popular NLP frameworks such as spaCy, Transformers and Flair as well as Deep Learning frameworks such as PyTorch. See our documentation pages for more details about our models and datasets, and definitions of the modules provided through the DaNLP package.

If you are new to NLP or want to know more about the project in a broader perspective, you can start on our microsite.


Help us improve DaNLP

  • 🙋 Have you tried the DaNLP package? Then we would love to chat with you about your experiences from a company perspective. It will take approx 20-30 minutes and there's no preparation. English/danish as you prefer. Please leave your details here and then we will reach out to arrange a call.

News

  • 🎉 Version 0.1.2 has been released with
    • 2 new models for hate speech detection (Hatespeech) based on BERT and ELECTRA
    • 1 new model for hate speech classification

Next up

  • new model and data for discourse coherence

Installation

To get started using DaNLP in your python project simply install the pip package. Note that installing the default pip package will not install all NLP libraries because we want you to have the freedom to limit the dependency on what you use. Instead we provide you with an installation option if you want to install all the required dependencies.

Install with pip

To get started using DaNLP simply install the project with pip:

pip install danlp 

Note that the default installation of DaNLP does not install other NLP libraries such as Gensim, SpaCy, flair or Transformers. This allows the installation to be as minimal as possible and let the user choose to e.g. load word embeddings with either spaCy, flair or Gensim. Therefore, depending on the function you need to use, you should install one or several of the following: pip install flair, pip install spacy or/and pip install gensim .

Alternatively if you want to install all the required dependencies including the packages mentionned above, you can do:

pip install danlp[all]

You can check the requirements.txt file to see what version the packages has been tested with.

Install from source

If you want to be able to use the latest developments before they are released in a new pip package, or you want to modify the code yourself, then clone this repo and install from source.

git clone https://github.com/alexandrainst/danlp.git
cd danlp
# minimum installation
pip install .
# or install all the packages
pip install .[all]

To install the dependencies used in the package with the tested versions:

pip install -r requirements.txt

Install from github

Alternatively you can install the latest version from github using:

pip install git+https://github.com/alexandrainst/danlp.git

Install with Docker

To quickly get started with DaNLP and to try out the models you can use our Docker image. To start a ipython session simply run:

docker run -it --rm alexandrainst/danlp ipython

If you want to run a <script.py> in your current working directory you can run:

docker run -it --rm -v "$PWD":/usr/src/app -w /usr/src/app alexandrainst/danlp python <script.py>
                  

Quick Start

Read more in our documentation pages.

NLP Models

Natural Language Processing is an active area of research and it consists of many different tasks. The DaNLP repository provides an overview of Danish models for some of the most common NLP tasks (and is continuously evolving).

Here is the list of NLP tasks we currently cover in the repository.

You can also find some of our transformers models on HuggingFace.

If you are interested in Danish support for any specific NLP task you are welcome to get in contact with us.

We also recommend to check out the list of Danish NLP corpora/tools/models maintained by Finn Ã…rup Nielsen (Warning: not all items are available for commercial use, check the licence).

Datasets

The number of datasets in the Danish language is limited. The DaNLP repository provides an overview of the available Danish datasets that can be used for commercial purposes.

The DaNLP package allows you to download and preprocess datasets.

Examples

You will find examples that shows how to use NLP in Danish (using our models or others) in our benchmark scripts and jupyter notebook tutorials.

This project keeps a Danish written blog on medium where we write about Danish NLP, and in time we will also provide some real cases of how NLP is applied in Danish companies.

Structure of the repo

To help you navigate we provide you with an overview of the structure in the github:

.
├── danlp		   			# Source files
│	├── datasets   			# Code to load datasets with different frameworks 
│	└── models     			# Code to load models with different frameworks 
├── docker         			# Docker image
├── docs	       			# Documentation and files for setting up Read The Docs
│   ├── docs	   			# Documentation for tasks, datasets and frameworks
│	    ├── tasks  			# Documentation for nlp tasks with benchmark results
│	    ├── frameworks 		# Overview over different frameworks used
│		├── gettingstarted 	  # Guides for installation and getting started  
│	    └── imgs   			 # Images used in documentation
│   └── library     		# Files used for Read the Docs
├── examples	   			# Examples, tutorials and benchmark scripts
│   ├── benchmarks 			# Scripts for reproducing benchmarks results
│   └── tutorials 			# Jupyter notebook tutorials
└── tests   	   			# Tests for continuous integration with Travis

How do I contribute?

If you want to contribute to the DaNLP repository and make it better, your help is very welcome. You can contribute to the project in many ways:

  • Help us write good tutorials on Danish NLP use-cases
  • Contribute with your own pretrained NLP models or datasets in Danish (see our contributing guidelines for more details on how to contribute to this repository)
  • Create GitHub issues with questions and bug reports
  • Notify us of other Danish NLP resources or tell us about any good ideas that you have for improving the project through the Discussions section of this repository.

Who is behind?

The DaNLP repository is maintained by the Alexandra Institute which is a Danish non-profit company with a mission to create value, growth and welfare in society. The Alexandra Institute is a member of GTS, a network of independent Danish research and technology organisations.

Between 2019 and 2020, the work on this repository was part of the Dansk For Alle performance contract (RK) allocated to the Alexandra Institute by the Danish Ministry of Higher Education and Science. Since 2021, the project is funded through the Dansk NLP activity plan which is part of the Digital sikkerhed, tillid og dataetik performance contract.

An overview of the project can be found on our microsite.

Cite

If you want to cite this project, please use the following BibTeX entry:

@inproceedings{danlp2021,
    title = {{DaNLP}: An open-source toolkit for Danish Natural Language Processing},
    author = {Brogaard Pauli, Amalie  and
      Barrett, Maria  and
      Lacroix, Ophélie  and
      Hvingelby, Rasmus},
    booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021)},
    month = june,
    year = "2021"
}

Read the paper here.

See our documentation pages for references to specific models or datasets.

More Repositories

1

responsible-ai

Responsible AI knowledge base
94
star
2

processing_websockets

A web socket library, including both server and client, for Processing
HTML
91
star
3

alexandra-trackmap-panel

Grafana map plugin to visualise coordinates as markers, hexbin, ant path, or heatmap.
TypeScript
77
star
4

node-red-contrib-postgresql

Node-RED node for PostgreSQL, supporting parameters, split, back-pressure
HTML
32
star
5

ckanext-realtime

Build realtime apps with CKAN
Python
27
star
6

node-red-contrib-ui-upload

Node-RED Dashboard UI widget node for uploading a file content by Socket.io streaming
JavaScript
9
star
7

node-red-contrib-json-multi-schema

Generic Node-RED nodes for a JSON data pipeline, suitable for continuous/streaming input, and with dynamic configuration
JavaScript
9
star
8

alexandra_ai_eval

Evaluation of finetuned models.
Python
9
star
9

coral

Danish ASR and TTS models associated with the CoRal project.
Python
9
star
10

alexandra_ai

A Python package for Danish data science.
Python
8
star
11

d3a-llm-workshop

This repository contains resources pertaining to the Danish Foundation Models workshop on Danish LLMs at the D3A conference, which was held by the Alexandra Institute and the Center for Humanities Computing at Aarhus University.
Jupyter Notebook
7
star
12

hatespeech

Hatespeech detection based on DR Facebook data
Python
6
star
13

node-red-contrib-mock-cli

A Node.js module to allow running Node-RED nodes from command-line
JavaScript
6
star
14

node-red-contrib-chunks-to-lines

Node-RED node to read line by line from a stream of chunks of text.
JavaScript
6
star
15

php-xlsx-fast-editor

PHP library to make basic but fast read & write operations on existing Excel workbooks
PHP
5
star
16

arip

Awesome real-time IoT platform
JavaScript
5
star
17

alexandra-ml-template

Template for Python-based data science projects in the Alexandra Institute.
Makefile
4
star
18

node-red-contrib-parser-ini

Node-RED node to parser/serialize INI configuration files
JavaScript
3
star
19

ScandiNLI

Natural language inference models for the Scandinavian languages (da, sv, nb).
Python
3
star
20

fresco-stat

Library for secure computation of statistics using the FRESCO framework
Java
3
star
21

TSAnomaly

Anomaly detection for time series data.
Python
3
star
22

AffectiveComputingKnowledgeExchange

This repository is a collection of datasets, models and approaches for affective computing. The goal is to provide a comprehensive overview of the current state of the art in the field of multimodal affect computing with a focus on emotion extraction from different modalities.
3
star
23

AIAI-data

Accessing external data sources.
Python
2
star
24

m_mmlu

Python
2
star
25

ScandiReddit

Construction of a Scandinavian Reddit dataset.
Python
2
star
26

AIAI-deploy

Deployment and monitoring of machine learning models.
Python
2
star
27

AIAI-anon

Anonymization and pseudonymization of texts.
Python
2
star
28

AIAI-train

Finetuning machine learning models.
Python
2
star
29

ScandiQA

Scandinavian question-answering models and datasets.
Python
2
star
30

alexandra_ai_data

Easy access to Danish data sources.
Python
1
star
31

node-red-http-basic-auth

Node-RED node for HTTP Basic Auth
JavaScript
1
star
32

node-red-contrib-opensearch

Node-RED node for OpenSearch
HTML
1
star
33

DanskeTestfaciliteter

Videreudvikling af https://github.com/Erhvervsfremmeplatformen/DigitalDanmarkskort
Vue
1
star
34

torch-trandsforms

A pytorch-first transform library for ND data, such as multi-channel 3D volumes
Python
1
star
35

iot-mapper-transformer

List of reusable IoT data transformations
1
star
36

dronedemo

A repository with all the tools you need to develop and build drone software/services relying on OpenCV and Tensorflow
Python
1
star
37

tts_text

Code for collection/generation of text for tts data collection
Python
1
star
38

caring

Secret sharing and Multiparty Computation
Rust
1
star
39

foqa

Faroese question answering dataset, generated by GPT-4.
Python
1
star
40

aidk-mqtt-demo

MQTT demo in Node.js with Docker
JavaScript
1
star
41

translation_eval

1
star
42

NordjyllandNews

Dataset containing news from Northern Jutland in Denmark.
Python
1
star
43

3dmap

3D visualization map with the possibility of overlaying geocoded real-time (and historic) data
JavaScript
1
star