About TermiKnowledge

From TermiKnowledge
Jump to navigation Jump to search

TermiKnowledge is an educational project within the framework of the 4EU+ Alliance running in the academic year 2021–2022 for students of the University of Warsaw, Charles University in Prague, Heidelberg University and Milan University. The project was co-funded by the Erasmus+ Programme of the European Union.

The website of our sister project on refugee crisis-related terminology can be found here.

About the Project

Our team consisted of 7 teachers and more than 30 student participants (7–8 students per university + 1 Student Assistant per university).

Our goal in the 1st semester was to compile a multilingual knowledge base on COVID-19. Following a series of six weekly lectures delivered by the teachers and spanning such topics as terminology science, corpus linguistics and lexicography, we agreed on a list of 40 key concepts and worked to provide terms in the four languages of the countries where our universities are (Czech, German, Italian and Polish) and in English. Knowledge base entries were based exclusively on text excerpted from corpora in the five languages, nearly all of which were compiled by project participants.

Four corpora were created for each language, comprising four different types of texts related to COVID-19. One type was normative texts (legal regulations such as EU directives, national regulations, and medical guidelines), another was research texts (published research papers), then press texts (general press, not specialized medical journals) and finally texts found online in comments sections under press articles.  The purpose here was to take a multifaceted “snapshot” of the linguistic landscape of COVID-19. Thus, there were a total of 20 corpora altogether. Not all concepts were represented by terms in all types of corpora.

Of the 20 corpora, 19 were compiled by our participants. However, instead of compiling our own corpus of research texts in English, we used a huge corpus available on the Sketch Engine website, consisting of texts that were released as part of the COVID-19 Open Research Dataset (CORD-19).

Reference: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-05-02. Retrieved from https://pages.semanticscholar.org/coronavirus-research.+doi:10.5281/zenodo.3715505.

Some technical parameters regarding our corpora can be found here.

Entry Structure

Each entry in a given language combines information from up to four corpora. Therefore, entries are generally subdivided into four sections (Normative, Press, Research & Comments) corresponding to the different corpora. If a corpus did not contain any single- or multiword terms or expressions (headword, variants and synonyms) reflecting a given concept, a note so stating can be seen, and the usual field headings are not present.

While most headwords are terms, i.e. units of language naming concepts that are part of the conceptual system of a given field of specialized knowledge or activity, some (such as anti-vaxxer) are obviously non-terminological expressions used in general language.

Below the headword there is a list of related terms with a specification of the relation in each case. The specifications are written in natural language. We did not use a semiotic code.

The fields within each section relate to conceptual and linguistic aspects of the headword.

The conceptual domain is reflected in definitions (which may include encyclopaedic information) and descriptions. Definitions, with encyclopaedic information added where possible, are given in the Normative section. It was assumed that any description of the entry term following the rules prescribed for intensional (hypernym + distinctive features) or extensional (list of types of objects covered) definitions would represent good material for inclusion in the Definition field while all other information describing a term would be considered encyclopaedic information. The Research section includes a Definition/Description field, because we expected that it would be difficult to find formal definitions in research papers, but that scientific descriptions of concepts could be excerpted from those corpora. The Press and Comments sections include just a Description field, where entry authors would place any definition-like quotations. You will notice that the conceptual fields are sometimes skipped, which is because entry authors could not find suitable quotations to use as a definition or description. This may be partly due to the rather limited size of a number of our corpora, but also partly stem from the nature of a particular entry headword.

All definitions, descriptions and pieces of encyclopaedic information are direct quotations from texts in the respective corpora. The quotations were not altered in any way. We also did not prepare our own definitions that summarized what we had read. Thus, our approach to definition mining was exclusively corpus-driven.

Some of our headwords were obviously not terms, for example emotion-laden units like anti-vaxxer or covidiot, but this was not an obstacle to finding quotations that described their meaning. By way of experiment, we also included negative and positive as headwords, even though these adjectives are classified as term elements rather than standalone terms. The compilers noted that these two behaved differently in texts and not all uses were related to COVID-19, but this proved no obstacle to the successful generation of entries.

The linguistic fields begin with Variants and Synonyms, which are found above the Definition/Description field. We decided not to distinguish between (spelling, morphological, etc.) variants and true synonyms.

All synonyms and variants can be your starting point for accessing entries via the search box.

The remaining fields providing linguistic information comprise Examples, followed by Collocations and an indication of the Keyness (see below) of the headword and any variants or synonyms attested in a given corpus.

Sentences that served as Examples were selected from among those that did not quite qualify as definitions/descriptions, for example, because they did not contain general information. The presence of frequent collocations also played a role.

Collocations were included on the basis of their frequency. Although steps were taken to ensure a unified layout, particularly of collocations with verbs to distinguish subject and object collocations, some liberty was allowed in that respect, too.

The Keyness Field

The values in the Keyness field refer to the position of a particular term or expression on the keyness list for a particular corpus. Note that Sketch Engine generates two keyword lists, separately for single words and potential multiword expressions, and thus there were two keyness lists for each corpus. That is why the keyness figure given for a single-word keyword may be higher (which generally means that its frequency was lower) than the figure for a multiword expression including this headword, since the keyness data were derived from a different list.

The reference corpora were the default national web-crawl corpora suggested by Sketch Engine.

We encountered a number of problems with keyness lists. In the case of highly inflected languages (Czech and Polish), some lexemes were not lemmatized by Sketch Engine, and so the occurrences of the different paradigm forms (cases, tense-person-number forms, etc.) were counted separately, the result being, obviously, that the associated keyness figures were rather low (high positions on keyness lists). Keyness positions are nevertheless provided in view of their potential utility for making comparisons between synonyms, terms or languages.