Kielipankki & CSC

Corpora, language technology, computational linguistics, corpus linguistics, supercomputers and all that

Sam Hardwick

Kielipankki

AKA The Language Bank of Finland

We create, solicit and receive corpora
We enrich the corpora with NLP tools
We make them available through browsing interfaces, downloads and computing environments
We make available our enriching tools, plus others
Hundreds of corpora, tens of billions of words

CSC

Owns and operates computational resources and services
Our own cloud, network, supercomputers, storage solutions
Kielipankki uses CSC resources, and helps language-oriented users use them

What can we do with language?

Computational linguistics

Linguistics from a computer science perspective: formalisms, parsing, rule-writing
We have and use tools for this
We will not focus on this in this course

What can we do with language?

Language technology

Arguably a rebrading of computational linguistics for the 90’s-2000’s
More application-oriented and uses statistics
More computation
We have a bonus problem set on topic modeling in this course

What can we do with language?

Corpus linguistics

Statistics on a body of text
Can use tool-generated information: morphology, syntax, semantics etc.
Arguably Kielipankki’s main focus, we will do a project in this course

What can we do with language?

Modern large-scale machine learning

This is what you read about in the headlines
Generic statistical approaches, no linguistic content
Huge amounts of data and computation, cutting-edge applications
A lot of this goes on at CSC, but we won’t do it in this course

Kielipankki resources: Korp

A browser for corpora
Developed by Språkbanken (our Swedish sister organisation)
Advanced search features, supports some statistics views

Kielipankki resources: Corpora

https://www.kielipankki.fi/tools/
Many modalities (text, speech, video, structured, scanned pages)
Some available on Korp, some as downloads, some as both
From completely open to very restricted
Text usually annotated with metadata, part of speech, base form, morphology, dependency syntax
We can also annotate named entities, sentiment, geospatial information etc.

Kielipankki resources: Tools

https://www.kielipankki.fi/tools/
Hosted tools: run on our machines, interact via the web (eg. WebAnno for manually annotating text)
Downloadable tools: tools maintained by us that you can run on your own machine (eg. finnish-tagtools)
Software installed on our HPC platforms, need to log in there to use them

CSC resources: Data storage

HPC has a lot of capacity for temporary data
IDA, long-term storage for FAIR research data is hosted at CSC
Users have access to:
- an object storage system (Allas)
- a hosted database service (Kaivos)
- encrypted data storage via SD desktop

CSC resources: Cloud computing

Virtual servers (Pouta), you can host long-running services and websites here
Container cloud (Rahti), same but with containers
Notebooks, an instant programming environment

CSC resources: HPC (High-Performance Computing)

Access usually via a command line, or a browser-based app
Puhti – General purpose. 682 CPU nodes, each with 40 cores + 80 Nvidia GPU nodes - we’ll be using this through Jupyter in the browser!
Mahti – Larger-scale jobs. 1404 CPU nodes, each with 128 cores, 180K CPU cores total! 24 GPU nodes, but very beefy ones (4 Nvidia A100’s).
Lumi – GPU-specialised jobs. Over 10K AMD GPUs. That’s a lot of compute (#3 in the world!).

Finally

What do I hope that you will take home from this couse?

An awareness of the existence of enriched data, and how easily you can use it to study interesting questions about the data
An awareness that the environment for doing computational tasks is there for you, and you can learn to use it
If you haven’t already had it, a taste of programming - and if you have, do the bonus problem and get a taste of parallelism!

License

CSC’s contributions are licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0/.

Image credits:

Lexc screenshot from Lindén, Silfverberg, Pirinen, “HFST Tools for Morphology – An Efficient Open-Source Package for Construction of Morphological Analyzers”

Shift-reduce screenshot from D. Fernández-González, C. Gómez-Rodríguez, “Faster shift-reduce constituent parsing with a non-binary, bottom-up strategy”

Praat screenshot used with permission from Joey Stanley

Unattributed Shoggoth image downloaded from https://knowyourmeme.com/memes/shoggoth-with-smiley-face-artificial-intelligence

Other images (c) CSC