Exercise - Fetch text in VRT format and do topic modeling
💬 In this exercise we will experiment with a common task for text, topic modeling. You can go through the tutorial step by step and do all the exercises, or just read it. The exercises are in the form of Python code that you can edit to make it run faster. Solutions are included.
A processing node and workspaces
☝🏻 First, hop into an interacive computing node with sinteractive --time 08:00:00 --mem 32000 --cores 4
, it will prompt you to select a CSC project. --time 08:00:00
means that the node will kick you out after 8 hours. If you exit the node before then, you will save on billing units; the reservation is more for scheduling than billing purposes. --mem 32000
means 32000 megabytes, and --cores 4
means you will be able to run that many processes simultaneously (for this small example).
We will organise workspaces as follows: data archives, which we don’t want to download many times, go to a non-temporary location, like /scratch/<project>/<your_username>/
. The same with dependencies and code. But we’ll unpack data into $TMPDIR
. The reason for this is that $TMPDIR
will be a disk local to the computing node, so it will be fast for reading and writing. In this particular case with just a few files it barely matters, but it’s a good habit to learn.
For dependencies and code, making a directory for this under /scratch/<project>/<your_username>
is a good choice, since we’re just trying things out. You can make sure that this directory exists with mkdir -p /scratch/<project>/$USER
. In that directory, fetch some starter code into a new directory with wget https://a3s.fi/hardwick-clarin-pub/lda.zip; unzip lda.zip
(assuming you don’t already have a directory called lda
. Then cd lda
into the directory. This will be our workspace. The Python scripts ending with _solution.py
contain solutions to the exercises.
Dependencies
☝🏻 We need to install some dependencies. This can be done in many ways, some simpler than others, and some more efficient than others.
There are essentially three alternatives for installing Python dependencies:
- Installing them in your home directory with
pip install --user
. This quickly becomes unmaintainable with many projects and library versions. - Installing in a virtual environment with
venv
orconda
. This has some downsides on the HPC systems, causing slow startup times and unnecessary IO load on the whole system. - An Apptainer container, for which we have our custom tool
tykky
, which is usually the ideal option.
If you have a requirements.txt
file, as we do here, installing them into a tykky
environment is in principle simple, as long as your libraries support the default Python version, which at the time of writing is 3.6. Unfortunately, that’s too old for us, so we’ll first make a temporary venv
in which to build the tykky
container with python3.9. So we do:
mkdir tykky-env # the tykky environment will go here
$ python3.9 -m venv tmp-venv # create a temporary venv with the correct Python version
$ source tmp-venv/bin/activate # step into the venv
$ module load tykky # load the tykky module
$ pip-containerize new --prefix /scratch/<project>/$USER/lda requirements.txt # or whatever directory you chose
$ deactivate # exit the temporary venv
$ rm -rf tmp-venv # not needed anymore
$ export PATH="/scratch/<project>/$USER/csc-training/lda/tykky-env/bin:$PATH" # make the tykky environment visible $
For the rest of this session, your default Python environment will have the packages from requirements.txt
installed. After logging out, things will be back to the way they were before. Then you can export PATH
again, or set the path on every login in eg. .bash_profile
.
Data
The Language Bank of Finland keeps its analyzed text data in a format called VRT. VRT is used because it’s the format of the IMG Open Corpus Workbench (CWB), so it’s not exactly a common standard, but it’s easy enough to use for many purposes. We will fetch some VRT files, extract the lemmas, and use the lemmas as input to a topic modeling package.
The Language Bank maintains a directory of corpora which you can browse for corpora available to you. Each corpus is listed with license information: PUB means available to everyone, ACA means available for users affiliated with an academic institution, RES means you have to apply for access.
💡 If you are a member of the kieli
group on puhti
, you can find read-only VRT data under /appl/data/kielipankki/
. Otherwise, you can follow download links from the corpus directory.
The rest of this example will use the YLE news in Finnish corpus, which can be downloaded here.
wget https://korp.csc.fi/download/YLE/fi/2019-2021-s-vrt/ylenews-fi-2019-2021-s-vrt.zip
$
unzip ylenews-fi-2019-2021-s-vrt.zip -d $TMPDIR
$ creating: /local_scratch/<username>/ylenews-fi-2019-2021-s-vrt/
...
We should now have three VRT files under ylenews-fi-2019-2021-s-vrt/vrt
of roughly two gigabytes each.
Data format
💭 Let’s take a quick look at the files so we have some idea of what we’re dealing with:
head $TMPDIR/ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt
$ <!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel lex/ -->
<!-- #vrt info: VRT generated from CWB data for corpus "ylenews_fi_2019_s" (2022-08-24 11:38:39 +0300) -->
<!-- #vrt info: A processing log at the end of file -->
<text datetime_content_modified="2018-12-31T23:46:26+0200" datetime_json_modified="2020-09-23T12:41:54+0300" datetime_published="2018-12-31T23:46:26+0200" datefrom="20181231" dateto="20181231" departments="|Näkökulmat|" id="20-280865" main_department="Näkökulmat" publisher="yle-aihe" timefrom="234626" timeto="234626" url="https://yle.fi/aihe/artikkeli/2018/12/31/matkakertomuksia-osa-vi-v-tilanne-verhon-edessa-osa-a">
<sentence id="1" type="text" paragraph_type="text">
Kun 1 kun kun C SUBCAT_CS|CASECHANGE_Up 2 mark |kun..kn.1|
käännyin 2 kääntyä kääntyä V PRS_Sg1|VOICE_Act|TENSE_Prt|MOOD_Ind 5 advcl |kääntyä..vb.1|
katsomaan 3 katsoa katsoa V NUM_Sg|CASE_Ill|VOICE_Act|INF_Inf3 2 xcomp |katsoa..vb.1|
4 , , Punct _ 2 punct |,..xx.1|
, huomasin 5 huomata huomata V PRS_Sg1|VOICE_Act|TENSE_Prt|MOOD_Ind 0 ROOT |huomata..vb.1|
VRT is a pseudo-xml format. By pseudo I mean that it doesn’t have a root node, but is instead a sequence of text
elements. (There are some other differences but that’s not important right now.) The leaf nodes which contain text (here, sentence
), have one token per line, with fields separated by tabs. So it’s a TSV (tab-separated values) format inside an XML-like format. The first line indicates what the fields mean; the first one is word
, for word form, the second is ref
, for token number, lemma
for lemma and so on.
💭 You may notice that the text
element has some interesting attributes, like departments
, main_department
and publisher
. Unfortunately the main_department
is usually empty (the commands are unix
tools available on every system):
grep --only-matching 'main_department="[^"]*' $TMPDIR/ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt | sed 's/main_department="//' | sort | uniq -c | sort -nr
$ 62104
319 Yle TV1
263 Klassinen
184 Yleisradio
164 Luonto
160 Strömsö
160 Kulttuuricocktail
...
The publisher
never is:
grep --only-matching 'publisher="[^"]*' $TMPDIR/ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt | sed 's/publisher="//' | sort | uniq -c | sort -nr
$ 38110 Yle Uutiset
14104 Yle Urheilu
8882 Yle Uutiset - lyhyet
1301 yle-aihe
...
The attributes come from the data source, and there’s no general rule as to what you can rely on. Clearly here publisher
is somewhat meaningful and very reliable, main_department
has more detail, but is very sparse (perhaps we could fill it in ourselves!).
Data processing
💬 Moving on, we can try to run parse_vrt.py
, which by default builds lists of lemmas of each text, and then does nothing with them. It should look something like this:
python3 parse_vrt.py $TMPDIR/ylenews-fi-2019-2021-s-vrt/vrt
$ Running parse_vrt_in_dir...
Reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt
Finished reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt, 65811 texts and 25772447 tokens
Reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2020_s.vrt
Finished reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2020_s.vrt, 63004 texts and 27871609 tokens
Reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2021_s.vrt
Finished reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2021_s.vrt, 56543 texts and 25374938 tokens
...finished in 136.04 s
First task
☝🏻 Your first task, should you choose to accept it, is to replace the sequential processing of VRT files in parse_vrt.py
with parallel processing, and then verify that you are able to accomplish this step faster with parallel than sequential processing.
Hint 1
There are several files to read and process, so you can process different files separately and combine the results afterwards.
Hint 2
The standard library module multiprocessing
has helpful facilities for this, such as multiprocessing.Pool
, which can be used to map
inputs to outputs in parallel.
One possible solution for this is included in
parse_vrt_solution.py
, or you can expand this line to see some code.
In parse.vrt
:
# Exercise 1: parallelise parsing the corpora
# Hint: you can use the Python standard library for this
= []
retval for filename in os.listdir(dirname):
if not filename.endswith('.vrt'):
continue
+= vrt2lemmalists(os.path.join(dirname, filename)) retval
Solution:
# Exercise 1 solution (one possible one): we map each filename to a
# vrt2lemmalists call using multiprocessing.Pool
from multiprocessing import Pool
= []
retval # First we get the valid file names
= [os.path.join(dirname, filename) for filename in os.listdir(dirname) if filename.endswith('.vrt')]
filenames # Then we initialize a Pool object
with Pool() as pool: # by default, processes = number of cores
for result in pool.map(vrt2lemmalists, filenames):
# We add the result lists together
+= result retval
Topic modelling
💬 Next we will use gensim
to do some topic modeling. The Python script topics.py
uses parse_vrt.py
to get data, and processes it in various ways. Try running it with the same argument:
python3 topics.py $TMPDIR/ylenews-fi-2019-2021-s-vrt/vrt
$ Running parse_vrt_in_dir...
Reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt
Finished reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2019_s.vrt, 65811 texts and 25772447 tokens
Reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2020_s.vrt
Finished reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2020_s.vrt, 63004 texts and 27871609 tokens
Reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2021_s.vrt
Finished reading ylenews-fi-2019-2021-s-vrt/vrt/ylenews_fi_2021_s.vrt, 56543 texts and 25374938 tokens
...finished in 133.57 s
Building gensim dictionary... Done in 21.49 s
Computing BOW corpus... Done in 14.11 s
Computing LDA model... Done in 114.99 s
Computing model coherence...
Coherence with 10 topics was -1.8115932174225207
Done in 1.53 s
topic printout] [
After the one step from the previous section, we have added three more sections. All of them can be parallelised, but not all of them offer the same potential. If you are interested in parallelising code, they are all interesting examples, but the most important practical skill is to recognise at this point that these steps represent 47%, 7%, 5% and 41% of the runtime respectively, so that is the ceiling to how much can be accomplished by speeding them up. It is also often the case that relatively fast tasks also have relatively little to gain from parallelisation. We will tackle them in reverse order, from most to least useful.
Second task
☝🏻 Replace the LdaModel class with something else to accomplish the same result, but quicker.
Hint 1
gensim
API reference and search the page for “models.lda”.
Hint 2
gensim.models.LdaMulticore
class.
Solution
Change this:
= gensim.models.LdaModel(bow_corpus, num_topics = n_topics) lda
To this:
= gensim.models.LdaMulticore(bow_corpus, num_topics = n_topics, workers = n_workers) lda
Third Task
☝🏻 Parallelise computing the BOW corpus. This means replacing the texts (or in this case, lists of lemmas) with bag-of-words representations. Each text will undergo the same transformation, so this should be possible to parallelise.
Hint 1
multiprocessing.Pool.map
like in the first exercise, but in an even simpler way: the result is simply the map.
Solution
Change this:
= [dictionary.doc2bow(text) for text in corpus_lemmalists] bow_corpus
To this:
with Pool(processes = n_workers) as pool:
= pool.map(dictionary.doc2bow, corpus_lemmalists) bow_corpus
n_workers
and chunk_size
, an argument to map()
.
Fourth task
☝🏻 Parallelise computing the gensim
dictionary. This exercise is the trickiest one, and least useful to implement.
Hint 1
Dictionary
object has a method .merge_with(other), which we can use to turn a collection of dictionaries into one. But to do this we also need to split the source data, which is a list, into sublists.
Hint 2
You can make sublists with a generator comprehension like this:
def split_list(l, n):
return (l[i:i+n] for i in range(0, len(l), n))
Solution
def split_list(l, n):
return (l[i:i+n] for i in range(0, len(l), n))
= None
dictionary with Pool(processes = n_workers) as pool:
for sub_dictionary in pool.map(gensim.corpora.Dictionary,
5000)):
split_list(corpus_lemmalists, if dictionary is None:
= sub_dictionary
dictionary else:
dictionary.merge_with(sub_dictionary)
Finally
We have focused on getting dependencies installed on CSC’s HPC systems and on parallelism, but of course there are more general things that could be done to speed up a process like this. If a system like this is run many times with a lot of data, data preprocessing need not be done at every step. You can do it once and use that result as a cache. Later steps depend on earlier steps, so you could set up a Makefile-like system to only redo steps that have changed or depend on those changes. The more times something is run, the more it pays to optimize it.