• Stars
    star
    432
  • Rank 100,650 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created almost 9 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

If you want to use Python for text analysis, this course is for you!

Python for text analysis

As taught at the Vrije Universiteit Amsterdam in the Humanities Research Master: Linguistics (track Human Language Technology) and the Minor Digital Humanities and Social Analytics (BA).

In case you have questions about exemption, please first read Python-test.md.

This is a practical course in Python, geared towards those who want to get some hands-on experience working with language data. No knowledge of programming is required or presupposed. We will work with Python 3.9. We highly recommend installing Anaconda for this course.

(If you have worked with Python 3 before, be sure to check if Jupyter Notebook is installed on your machine. We will work extensively with notebooks. Make sure you are working with Python 3.9.)

This course is based on the material used in previous years and in this course.

Goals

This course is meant to introduce you to the basics of the Python programming language. There is a lot to discover about Python and programming in general, and you will probably learn something new every day if you continue programming after this course. Our goal for you is to become an independent programmer who is able to find solutions to new problems.

You will

  • learn how to work with the standard library of Python
  • learn how to deal with different file types (e.g., plain text, CSV/TSV, JSON)
  • learn how to use some external libraries (e.g., to analyze texts)
  • learn how to document and share your code and results

We will focus on readability and understandability, so that you will be able to share your code and results with others, and re-use your code in the future. This is a practical course, in which you will get a lot of hands-on experience. Due to the nature of this course, active participation is required.

As of 2021-22, we are offering a bachelor- and a master-level version of this course. For the bachelor version, we emphasize applications in Digital Humanities. For the Master-level, we emphasize a more thorough understanding of the fundamentals of python and independent problem-solving. These differences are reflected in the material of the second half of the course (Block III and IV).

Core principles

We strongly believe in a set of principles outlined by Mike Bostock in his article What makes software good?. We have designed our course around those principles and summarized them for you below:

  • Good software is approachable. It can be understood completely in independent, easy pieces. You don’t need to understand everything before you can understand anything.

  • Good software is consistent. It lets you take what you’ve learned about one part and extrapolate it to the rest. It doesn’t self-contradict. It is parsimonious, avoiding superfluous elements.

  • Good software explains itself. It has affordances for learning and discovery. It is role-expressive and minimizes hidden magic.

  • Good software teaches. It doesn’t just automate an existing task, but provides insight or imparts knowledge, such as a best practice or a new perspective on a problem.

  • Good software is for humans. It is cognizant of people and the reality in which they live. It does not expect elaborate and arbitrary rules to be memorized. It anticipates the need for learning and debugging.

What to do if you get stuck

Programming almost always involves running into problems and getting stuck. This is normal and even happens to very experienced programmers. We are trying to offer support to all students, but this means we have to prioritize and manage our time well. In order for this to work, please try to follow these strategies when you get stuck:

  • Check the class material for solutions. The chapters treated in the assignment are usually a good start. As the course progresses, you may have to also check the material from earlier blocks.
  • If you get error messages, read them carefully - they are informative! In particular, check the line in which the error occurs and the line immediately preceding it. If you don't understand what it says, try to google it (you will most likely find some explanation on Stackoverflow).
  • Break down your task into small steps using pen and paper. Sometimes, you lose sight of the bigger picture when dealing with complicated code. Breaking down a big task into small tasks helps you identify the problem.
  • Explain the problem to someone else (e.g., a classmate). Go through the code line by line and explain what it does (See pair programming and rubber duck debugging).
  • Finally, take a break! Very often, just having a fresh look at the code helps!
  • If none of these steps helped, please try to ask for help well before the assignment deadline. Please start by posting your questions on Piazza. If you email the teachers or TAs, please always email your code rather than a screenshot.

Learning how to help yourself is a valuable skill and will be very useful in your future programming projects.

Courseware structure

Our materials are structured as follows:

  • The Chapters folder contains our primary teaching material. Every week, you will work through a subset of these interactive notebooks. It is highly recommended to start looking at the material in preparation for the lectures. If you get stuck with an assignment, first see if you can find the solution in the Chapters.

  • The Assignments folder contains the assignments that you will be asked to submit during the course.

  • The Exam folder contains sample exams from previous years.

  • The Extra_Material folder contains some extra reading about the Python theory, which you may use for future reference. It also contains some information specifically related to natural language processing, and examples on how to organize your code and how to create a Flask website.

  • The Data folder contains all data used in this course and more, as well as the scripts used to obtain this data. (So you can see what techniques we used.)

This file serves as the syllabus and a general reference for this course.

Assignments and Grading

The course is worth 6 ECTS and will consist of 4 assignments and a final exam. The assignments have to be submitted after the content and tutorial session of each block (4 in total). In the third session of each block, we will discuss the assignment in class and provide feedback. The assignments and the exam are weighted as follows.

Part weight % Part weight %
Assignment 1 0* Total Assignments 60
Assignment 2 10 Exam 40
Assignment 3 20
Assignment 4 30
Total 100

*Assignment 1 is not graded, but you are required to submit a serious attempt by the deadline to pass the course.

Course assignments

You are asked to hand in 4 assignments in total. The deadlines are indicated below. Submission 1 day after the deadline and before the feedback session downgrades your grade by 2 (e.g., a 9 will result in a 7). After that (i.e. on day 2 or after the feedback session has started), the resulting grade is a 1. We have to be strict about this because we will discuss the assignments in class, and we need time to look at your submissions. In addition, the solutions will be discussed in the feedback session, and we cannot award points after the solutions have been discussed.

Please note that a passing grade for the assignments (in total) is a requirement for passing the course and you need at least a 5.0 for the final assignment (regular assignment or resit assignment).

Assignment submissions

Submission is made through Canvas. Please submit your assignments using the corresponding assignment submission on Canvas (e.g. labeled 'Assignment 1' for Assignment 1).

Please note that we cannot accept assignments submitted via email.

Resits for Assignments

We provide a resit opportunity for each assignment at the end of the course (deadline will be announced). The maximum grade for the resit assignments is 7.5.

The following Assignments have to be completed as resits for the regular assignments. The deadline for the resit assignments will be announced at the end of the course.

Assignment Resit
Assignment 1 Resit A1 (serious attempt)
Assignment 2 Resit A2
Assignment 3 Resit A3
Assignment 4 Resit A4

It is highly recommended to aim for passing grades at the first attempt. Please only make use of the resit opportunities in case you failed an assignment or were dealing with exceptional circumstances (e.g. illness).

Final exam

The exam tests your knowledge of the syntax of Python, and your knowledge of the standard library. It serves as an opportunity to show what you've learned and will ensure that you have sufficient knowledge to tackle your own code projects and continue improving your python skills by yourself. You cannot pass the course without a passing grade on the exam. But don't worry: if you are able to finish the assignments, you will be fine on the exam.

Resit exam

There will be an opportunity to take a resit exam. The exact date will be announced.

**It is highly recommended to aim for a passing grade at the regular exam date. Please only make use of the resit exam if you fail the regular exam or were dealing with exceptional circumstances (e.g. exam date conflict, illness).

Planning

There are 4 Blocks with associated chapters and assignments:

Block Chapters BA Chapters MA Assignment BA Assignment MA
I Chapters 1-4 Chapters 1-4 Assignment 1 Assignment 1
II Chapters 5-11 Chapters 5-11 Assignment 2 Assignment 2
III Chapters 12-15 Chapters 12-15 Assignments 3a and 3b (Exercises 3 and 4 of Assignment 3b are excluded for BA students) Assignments 3a and 3b
IV Chapters 16, 17, 22 Chapters 16-18 Assignments 4a and 4b-BA Assignments 4a and 4b-MA

The schedule for the entire course follows the same structure, illustrated below.

All blocks except block IV will consist of three lectures. There is one additional lecture for block IV.

Lecture 1

In the first lecture, we introduce some of the new topics. It is highly recommended to go through the chapter notebooks in preparation for the classes. After the first lecture, you are expected to start working on the assignment and consult the chapters for things that are unclear to you. Please be aware that the assignments can most likely not be completed in a single day. Also, solving code problems is much easier if you have sufficient time for breaks.

Lecture 2

In the second lecture, we will further highlight some of the theory, and you will have time to work on the assignment in class. Support will be provided by the teachers and student assistants. It is highly recommended to prepare questions you have about the assignment. We will dedicate time TO this in this lecture. You will finish the assignment between the second and third lecture and hand it in on either Tuesday or Friday.

Lecture 3

Finally, the third lecture is a feedback session where we will discuss some of the main problems that were encountered in the assignments. We will repeat this cycle 4 times (for each assignment).

week what when description
36 lecture Monday 2022-09-05
15:30 - 17:15
BLOCK 1: Introduction theory
36 lecture Thursday 2022-09-08
13:30 - 15:15
BLOCK 1: Theory and work time
36 DEADLINE Friday 2022-09-09
before 17:00
SUBMIT ASSIGNMENT 1
37 lecture Monday 2022-09-12
15:30 - 17:15
BLOCK 1: Feedback assignment
37 lecture Thursday 2022-09-15
13:30 - 15:15
BLOCK 2: Introduction theory
38 lecture Monday 2022-09-19
15:30 - 17:15
BLOCK 2: Theory and work time
38 DEADLINE Tuesday 2022-09-20
before 17:00
SUBMIT ASSIGNMENT 2
38 lecture Thursday 2022-09-22
13:30 - 15:15
BLOCK 2: Feedback assignment
39 lecture Monday 2022-09-26
15:30 - 17:15
BLOCK 3: Introduction theory
39 lecture Thursday 2022-09-29
13:30 - 15:15
BLOCK 3: Theory and work time
39 DEADLINE Friday 2022-09-30
before 17:00
SUBMIT ASSIGNMENT 3
40 lecture Monday 2022-10-03
15:30 - 17:15
BLOCK 3: Feedback assignment
40 lecture Thursday 2022-10-06
13:30 - 15:15
BLOCK 4: Introduction theory
41 lecture Monday 2022-10-10
15:30 - 17:15
BLOCK 4: Theory and work time
41 lecture Thursday 2022-10-13
13:30 - 15:15
BLOCK 4: Theory and work time
41 DEADLINE Friday 2022-10-14
before 17:00
SUBMIT ASSIGNMENT 4
42 lecture Monday 2022-10-17
15:30 - 17:15
BLOCK 4: Feedback assignment
42 lecture Thursday 2022-10-20
13:30 - 15:15
Exam preparation
43 EXAM Tuesday 2022-10-25
8:30-11:15 (11:45, extra time)
EXAM

Plagiarism

Cheating is serious: it is considered fraud and can lead to being excluded from your studies (https://vu.nl/en/student/your-faculty/examination-board). It is also harmful: not only for yourself (you can fool yourself and fail to learn this useful skill), but also for other students (if multiple students do better because of cheating, teachers may think a grading scheme is fair, even though it needs to be adjusted).

How to avoid this, while making use of online sources and working together:

For the weekly assignments, let us know in the comments if you have worked together with someone or if you used code from online sources, such as stackoverflow. If you found some useful code online, do try to understand what that piece of code does. If it looks 'complicated', we expect you to provide comments in the code explaining what it does.

If you use code you found online in an assignment, please indicate it in the following way:

### Taken from [link] [date]

[code]

\###

Please use a similar format to indicate that you have worked with a classmate (e.g. mention the name instead of the link). Make sure to provide your own comments to show that you understood and indicate what you did yourself, e.g. by commenting your partial solution out (we cannot give credits for copy-pasting full answers from classmates, but you may get full points for a partial individual solution and well commented working one with components from a classmate). If you work on a solution together, also indicate this and make sure to provide individual explanations.

More Repositories

1

OpenDutchWordnet

This repo provides a python module to work with Open Dutch WordNet. It was created using python 3.4.
HTML
64
star
2

ba-text-mining

Hands-on material for the course text-mining BA, taught at VU Amsterdam
Jupyter Notebook
29
star
3

pepper

VU-CLTL Pepper/Nao Application Repository (Python 2)
Python
29
star
4

wsd-dynamic-sense-vector

HTML
25
star
5

SpaCy-to-NAF

spaCy-to-naf converter
Python
21
star
6

EventCoreference

Compares descriptions of events within and across documents to decide if they refer to the same events.
Java
19
star
7

ThesisTips

A collection of tips for writing a PhD thesis
18
star
8

KafNafParserPy

Parser for KAF NAF files written in Python
Python
15
star
9

ma-hlt-labs

Human Language Technology Notebooks for Lab sessions, Master Students
Jupyter Notebook
14
star
10

svm_wsd

Word Sense Disambiguation system developed on the DutchSemCor project using Support Vector Machines. The input is plain text, and the output XML
Python
12
star
11

opinion_miner_deluxe

Opinion miner based of Machine Learning that can be trained on a corpus of KAF/NAF files
Python
10
star
12

BabelfyReimplementation

Reimplementation of Babelfy (http://babelfy.org)
Python
9
star
13

ma-ml4nlp-labs

Course code for "Machine Learning in NLP"
Jupyter Notebook
9
star
14

lexical_pattern_extractor

Lexical pattern extractor to generate patterns and target words from a seed list
Python
8
star
15

entity-identification-from-scratch

Entity recognition and linking for historical documents in Dutch, developed within the Clariah+ project at VU Amsterdam
Python
8
star
16

OntoTagger

Ontotagger inserts (semantic) labels into KAF representation on the basis of lemma or wordnet synset representations of text
Java
8
star
17

vu-rm-pip3

Dutch NewsReader pipeline
Shell
7
star
18

ecbPlus

ECB+ and derived corpora
7
star
19

WordnetTools

Set of functions to use a wordnet in Wordnet-LMF format
Java
7
star
20

ma-language-as-data-labs

This Github provides the Jupyter notebooks for the Lab sessions of the VU Language-As-Data course.
Jupyter Notebook
7
star
21

semantic_space_navigation

Jupyter Notebook
6
star
22

event-resource-interoperability

6
star
23

morphosyntactic_parser_nl

Morphosyntactic parser for Dutch based on the Alpino parser
Python
5
star
24

a-proof-zonmw

Detecting the functioning level of a patient from a free-text clinical note in Dutch.
Jupyter Notebook
5
star
25

multilingual-finegrained-entity-typing

Python
5
star
26

multilingual-wiki-event-pipeline

This project aims to extract information about incidents of a particular type. This information consists of structured data on the incidents from Wikidata, as well as unstructured description and supporting sources from Wikipedia. We obtain information from Wikipedia in multiple languages.
Python
5
star
27

FormatConversions

Several conversions between formats that are commonly used by our tools
Python
4
star
28

BiographyNet

NLP tools and data used in BiographyNet
Python
4
star
29

StoryTeller

Toolkit to query the NewsReader KnowledgeStore with SPARQL and create a JSON story
HTML
4
star
30

cltl-ma-thesis

(LaTeX) MA thesis template
TeX
4
star
31

HumanLikeEL

Human-Like Entity Linking using Contextual knowledge
Jupyter Notebook
4
star
32

WordNetSimilarity

Programs and scripts that test performance of WordNet similarity measurements using different settings
Perl
4
star
33

Target-Spans-Detection

Target_Spans_HateXplain
Python
4
star
34

FrameNet-annotation-tool

Python-based command-line tool for FrameNet annotation
XSLT
4
star
35

MultiWordTagger

Reads a KAF or NAF file to detect multiword sequences of terms according the WordNet
Java
4
star
36

aproof-icf-classifier

Classifier that can read medical reports and assign a functional level classification following the WHO ICF classification scheme.
Python
4
star
37

EL-long-tail-phenomena

Systematic study of long tail phenomena in the task of entity linking
Jupyter Notebook
4
star
38

PostmaVossenGWC2014

This repository provides the code to replicate the results from PostmaVossenGWC2014
C
3
star
39

SoNar2Naf

Converter from Folia to NAF
HTML
3
star
40

vua-wsd-sem2015

System for the CLTL participation in SemEval2015 task 13: multilingual all-words sense disambiguation and entity linking
Python
3
star
41

frame-annotation-tool

Annotation tool in JavaScript and Node.js for annotation of frames in Dutch documents.
JavaScript
3
star
42

machine-learning-for-nlp-course

releases of notebooks for students participating in machine learning for nlp
Jupyter Notebook
3
star
43

MoreIsNotAlwaysBetter

Java
3
star
44

BiographicalDataModels

3
star
45

lexical-negation-dictionary

Python
3
star
46

ma-communicative-robots

Communication robots
Python
3
star
47

multilingual_factuality

Python
3
star
48

NAF-HeidelTime

NAF (KAF) Wrapper around HeidelTime
Python
3
star
49

reference-framing-perspective

Workshop website
3
star
50

NewsAcquisition

Analysis and acquisition of news data from the Signal Media corpus and other news collections
Jupyter Notebook
3
star
51

tokeniser-opennlp

Tokenizer and sentence splitter based on opennlp
Python
3
star
52

WordNetMapper

This repo provides the possibility to map between lexical keys | offsets | ilidefs from one wordnet version to the other ["16","17","171","20","21","30"]. It makes use of the index.sense files from WordNet (http://wordnet.princeton.edu/) and the automatically generated mappings between WordNet offsets (http://nlp.lsi.upc.edu/tools/download-map.php)
HTML
3
star
53

a-proof

Tools for the text classification of clinical note in electronic patient records
Jupyter Notebook
2
star
54

LSTM-WSD

Python
2
star
55

ELBaselines

This repo is aimed to create baseline results for Entity Linking, by running a text against the state-of-the-art systems for entity linking, using their most standard configuration.
Python
2
star
56

nlpp

Script to install NLP pipeline from its components.
CWeb
2
star
57

DFNDataReleases

2
star
58

micro-portraits

Python
2
star
59

voc-missives

NER and format conversion scripts for the Generale Missiven
HCL
2
star
60

Image-Specificity

Reimplementation of Jas & Parikh's (2015) image specificity metric, using word embeddings.
Python
2
star
61

TextToCoNLL

Python
2
star
62

KafAnnotator

Standalone program to annotate KAF files
Java
2
star
63

dutch-nlp-tools

Overview of data sets and resources for Dutch
2
star
64

NAF-4-Development

Python
2
star
65

FrameNetNLTK

Python
2
star
66

SemanticOverfitting

Python
2
star
67

mergeAnnotationCAT

Script to merge files annotated from different annotators (on the same task) to better explore (dis-)agreement
Python
2
star
68

Mining-Ministers

Python
2
star
69

CuriousMachine

Investigations on how to build a curious machine based on NLP technologies
Python
2
star
70

GunViolenceCorpus

2
star
71

News2RDF

Python
2
star
72

NAFFoLiAPy

Library for converting between FoLiA and NAF
Python
2
star
73

MFS_classifier

This repo contains the scripts to attempt to remove the mfs bias from a WSD system.
PostScript
2
star
74

hpsp

Experiments with hyperspace models for selectional preference
Jupyter Notebook
2
star
75

coreference-evaluation

Evaluation package for event coreference using the reference-scorer
Java
2
star
76

SimpleTagger

Python
2
star
77

GRaSP

2
star
78

ceopathfinder

Finds a path of circumstantial relations between events on the basis of the CircumstantialEventOntology
Rich Text Format
2
star
79

run_open-sesame

Python
2
star
80

pepper_tensorflow

This is the repository for Pepper modules and external services. Use Python 3
Python
2
star
81

entity-link-postprocess

Python
2
star
82

DutchDescriptions

Dutch descriptions for the Flickr30K validation and test data, plus a cross-lingual comparison tool.
Roff
2
star
83

LongTailAnnotation

Annotation tool for data2text approaches
JavaScript
2
star
84

cltl.github.io

CLTL organization site
JavaScript
1
star
85

TeamRobot

Python
1
star
86

vua_factuality

Python
1
star
87

LongTailIdentity

Generating profiles of long tail identities from text
Jupyter Notebook
1
star
88

a-proof-project

JavaScript
1
star
89

PythonVirtuosoInterface

Simple interface to SPARQL for python 2 and 3 scripts
Python
1
star
90

rfp_corpus_collection

Collect a referentially grounded corpus for the 1st workshop on Reference, Framing, and Perspective (LREC-COLING 2024)
Python
1
star
91

pwgc

tool to load the princeton wordnet gloss corpus
Python
1
star
92

Wikipedia_langlinks

Python
1
star
93

relink

RElinking with CONtext - Entity linking module
1
star
94

KafKybot

Extracts tuples from KAF file using profiles
Java
1
star
95

ma-applied-tm-course

Github Repository supporting the Applied TM Course as part of the VU Text Mining Masters
Python
1
star
96

SPT_crowd_data_analysis

Code to analyze crowd annotations of property-concept pairs in terms of their relations.
Python
1
star
97

inner-outer-coreference

A repository for investigating the role of common ground in datasets of social dialogue in coreference resolution tasks
Python
1
star
98

ma-course-subjectivity-mining

Repository for the Subjectivity mining course
Python
1
star
99

BERT-WSD

Python
1
star
100

VUA_pylib

Set of functions in python, including feature extractors, common functions, NAF/KAF manipulation
Python
1
star