Materials and their use

 

The Research Unit for Volgaic Languages has a collection of electronic language corpora that is constantly being developed. The materials can be divided into the following types:

  • uncoded texts
  • grammatically coded texts
  • parallel texts
  • word lists

 

Uncoded texts

The goal of the Research Unit is to collect texts of at least one million words for each of the languages spoken in the Volga-Kama region. So far, large collections of texts have been acquired for Udmurt, Mari, Mordvin, and Chuvash. Some of the materials are accessible through the Internet. Since these corpora have no morphosyntactic coding the material can only be searched using character strings.

Grammatically coded texts

To date, the only grammatically coded corpus available contains Erzya and Moksha texts. The size of the corpus is about 240 000 words, and it is comprised of folkloristic and literary texts. Information about the word class and declension has been coded for every word. The user of the corpus can easily obtain, for instance, all words in the inessive case.

Parallel texts

To facilitate the morphosyntactic and semantic comparison of languages, parallel text corpora have been created. They contain the same text in many different languages, and the sentences that correspond with each other in different languages have been labeled with the same number. Thus the user can easily compare expressions of the same semantic content  in different languages. Of course it is also possible to make character string searches from the parallel text corpora.

Word lists

The goal of the Research Unit is to create a large electronic word list for every language in the Volga-Kama region. The word lists are primarily intended for the study of derivation and word structure. Since the word lists of different languages have the same format, and because they can be simultaneously accessed using a special computer program, it is possible to compare word formation in several languages. Word lists of tens of thousands of words exist for Mari, Mordvin, Udmurt, and Chuvash. In these lists, the language form (e.g. Erzya or Moksha), the word class, and the source from which the word has been taken into the list are given. There is, however, no information about the meaning of the word in the word lists.

Search programs

Special search programs for each type of corpus have been created. These aid the user to find the linguistic elements he or she is interested in easily. The programs can be used on the premises of the Finno-Ugric Languages Department of the University of Turku. Some uncoded texts may also be accessed through the Internet, and a special tool for handling word lists will be distributed by the Finno-Ugrian Society.

User rights

The corpus materials can be used free of charge by the personnel and the students of the Turku University Finno-Ugric Languages Department, as well by researchers who are connected to the Research Unit projects or who otherwise collaborate with the Department or Research Unit. The right to use the materials is always bound to a clearly defined research project or theme. If you wish to use the language materials of the Research Unit, please contact Dr. Jorma Luutonen (Jorma.Luutonen_at_utu.fi).

 

23.03.2007 12:10 Ilmari Vakkala