Linguists and data scientists against COVID-19
Over the last few days I have seen several posts from individuals or institutions – such as the Terminology Coordination Unit – sharing useful resources (glossaries, dictionaries) about COVID-19. Each of these resources is helpful in its own way and is as comprehensive as the enthusiastic team creating it could manage in the short time available. At the same time, over on Kaggle, a growing group of natural language processing (NLP), machine learning (ML), and language modelling (LM) experts and enthusiasts were also hard at work throwing all manners of neural networks at the dataset gathered by the Allen Institute for AI and their partners for the COVID-19 Open Dataset Research Challenge (CORD-19). This dataset is “a free resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community”.
The discussions in the two ‘camps’ are quite different: on one side, there is talk of terms, contexts, equivalents in other languages, and making sure that human translators get the translation tasks right, while on the other it’s more about word embeddings, language modelling, text mining and automatic summarisation of relevant paper sections in order to answer one of the 10 questions set by this challenge. There isn’t a lot of overlap, because traditionally the first camp believes in manual, case-by-case solutions, and the latter is happier with big data, algorithms and automatic metrics, while occasionally being a bit disenchanted with how slowly things move in the first camp. On a good day, though, both sides acknowledge the different, but still very important contributions which both camps are bringing to the challenge of human communication.
To find some middle ground, as well as highlight a way to visualise the information available without needing to know word2vec, Python and command line, I have done what others no doubt have also done: while still donating a few CPUs to the folding@home initiative, I downloaded the COVID-19 dataset (it is made up of four subsets of different sizes), converted the 44k JASON files to XML (thank you, @gridinoc for introducing me to Node JS, even if it made my CPU sigh a few times), and then uploaded the result to the best platform for terminology work that I know: the SketchEngine. The result, after some time uploading, unarchiving, lemmatising and part-of-speech tagging is a 206,894,540-word corpus (270mil+ tokens in 6mil+ sentences) on which you can set the various excellent SketchEngine tools to work in order to extract single and multi-word term candidates, see them in context, look for words which appear in similar contexts, learn which other words they collocate with, which modifiers are used in the specialised literature with some terms, but not with others, etc. For a translator looking to gain more insights into this specialised terminology, or for a terminologist looking out for new terms in this area, or investigating how consistently existing terms are being used, I have always found the SketchEngine to be the most powerful and user-friendly platform currently available (but please let me know if you have other preferences).
UPDATE: since starting to write this post this morning, the fine folks at the SketchEngine have also released a slightly trimmed version of the dataset as a public corpus (called ‘Covid-19’), complete with sub-corpora for abstracts and full article texts. If you work for or study at an EU university, you are very likely to have SketchEngine access already through your institution. Otherwise, you can either easily create a trial account and build your own resources, as well as explore existing ones. Your third option is not to create an account at all, but still be able to do some work on open resources – try it!
If, on the other hand, you would like to work with the full dataset I have been looking at (called ‘CORD-19 corpus EN’, and also containing article metadata, references and bibliography to the tune of about 60 million words more), let me know at elearningbakery [at] gmail.com which e-mail address you have signed up to the SketchEngine with and I will share my corpus with you.
So, specifically and in pictures, what are at least some of the things you can do to study this subject matter in the SketchEngine (SE) with either of these corpora?
- Ask the SE to present you with single-word and multi-word term candidates (the emphasis here is on ‘candidates’ because you still need to use human judgement to decide which is a genuine term and which is not – fortunately, in the term candidate interface, you can click on the … next to each term candidate in order to see it in context, so deciding is not hard, just suitably time-consuming.
As you can see, because I have not filtered the data from the dataset, quite a few of the single term candidates are the bibliographical references, so something simple like a find/replace with a regular expression will be enough to get rid of them when downloading this list for further processing. SketchEngine’s filtered corpus, on the other hand, does not suffer from this issue. Having said that, the problem of reference placeholders in the corpus I compiled goes away when it comes to multi-word expressions:
What else apart from this? Well, lots, actually: for starters, you can do ‘word sketches’ in order to see, among others, which modifiers are used with the word you are investigating, as well as which parts of speech it modifies – here is an example for ‘interferon’:
Next (in a pretty random order, in fact), you can see which other words appear in similar contexts as the one you are investigating (and which, therefore, have a good chance of being its synonyms or antonyms or have some other relationship with it – again, it’s still a human attribute to work out what that exact relationship is.
To help you out in your examination of the difference between the word you are investigating and another one suggested by the Thesaurus function above, the SE has the Sketch Difference function. Again, thanks to the part-of-speech tagging element in the corpus compilation process, the SE produces very useful insights into which other words, but also, more generally, which nouns/verbs/adjectives/etc. are used more with the first word (in green), the second word (in red), or fairly equally with both (the white-ish area in between).
Just like all respectable terminology tools, the SE allows you to create concordances for individual words, phrases, lemmas, or whatever you want, pretty much (learning the corpus query language CQL will prove particularly rewarding in the log term if you are serious about terminology mining). Moreover, depending on how the corpus you are working on has been created, you can Filter the contexts which will be looked at, as well as the corpus subfolders, subcorpora, or even individual files. Neat!
The result of the concordance search is a series of examples which will help you understand much better what your search word means and in which natural language contexts it is being used. Using the buttons at the top of the concordance window, you can specify which XML tags you would like your concordances to come from – so far I have been mainly working with the content of the titles, abstract, and article text, but you can choose any of the other XML elements available to fine-tune your searches. You can always click to expand these contexts, and if you want more info about the source of these contexts you can click on the doc# hyperlink to the left of each example in order to access metadata about the file from which the example comes from (in CORD-19 corpus EN you can see the filename, too, if you wish to read the entire research paper containing that concordance example).
The SE team appreciates that some of these contexts are still quite challenging for non-specialists to understand fully, and that’s why they also implemented the Good Dictionary Examples function which selects the most readable/accessible/plain-language concordance examples containing your search term(s). This is really handy if you are working on a dictionary under time pressure (well, when aren’t you?).
Last, but by no means least, the SE has not forgotten that, when you are investigating a particular term, you will find it very useful to also know the words with which your terms appears together frequently. This is the role of the Collocation function and, just like in the case of pretty much all the other functions, clicking on … to the right of the collocation candidate will open up a concordance window so that you can examine the contexts and establish the nature of the relationships yourself.
I hope that this brief illustrated overview of how you can use the SketchEngine to investigate the emerging terminology in English connected with COVID-19 helps and head over to the SE interface to start playing with the tools yourself. If you want to compare the public SE Covid-19 corpus with the one I uploaded, let me know as I mentioned above. Otherwise, stay well and follow the safety guidance. If you can answer some of the Kaggle questions with information gleaned from these corpora, that would be fantastic! May your ‘distancing’ be ‘rigorous’ and ‘physical’ rather than ‘social’!