Thesauri. Meaning of the word sample in the thesaurus of the Russian language Relationships of words in the thesaurus
Computing technologies
Volume 12, Special Issue 2, 2007
TECHNOLOGY OF CREATING A THESAURUS OF THE SUBJECT AREA ON THE BASIS OF THE SUBJECT INDEX OF THE ENCYCLOPEDIA
V. B. Barakhnin
Institute of Computational Technologies SB RAS, Novosibirsk, Russia
e-mail: [email protected]
V. A. Nekhaeva Novosibirsk State University, Russia e-mail: [email protected]
This work describes a technology for creation of object domain thesaurus, which is based on subject heading for specialized encyclopedia. Such technology offers a high quality description of the object domain using reliable terms thus allowing to build up a first stage of thesaurus with a minimal engagement of experts in this particular field of knowledge. The proposed technology also contains a thesaurus building algorithm and web based application implementing this algorithm.
Introduction
One of the most important factors ensuring the successful implementation of integration research projects is effective scientific and information support. In particular, the joint work of researchers of several (and not always related) specialties requires careful coordination of the terminology used, because the same concept can be denoted in different fields of science by different terms, and one term - different concepts.
Another task of information support for projects is the creation of an integrated card file of bibliographic descriptions of documents (i.e. articles, books, etc.) on the subject of the project, compiled by combining the resources of collaborative researchers, each of whom has already accumulated a card file on one or another topic (at present, such file cabinets are stored, as a rule, on electronic media). To facilitate the search in the card index, it is desirable that the keywords characterizing the documents are selected, if possible, from a single dictionary. For automatic classification of documents included in the card index or potentially able to be entered into it from electronic databases
© Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, 2007.
scientific publications such as a database of abstract journals, "Current Contents", etc., it seems appropriate to use the coordinate indexing algorithm. This algorithm is based on taking into account the classification features of the terms (words and phrases) included in the text that characterize a particular subject area.
The solution of all the tasks listed above is impossible without creating a dictionary of terms of the subject area, and in this dictionary links between terms should be established and a classification of terms should be carried out. Such a dictionary is called a thesaurus (see details in). Thesaurus (or normative thesaurus) is a reference dictionary containing all lexical units of an information retrieval language - descriptors (together with keywords that are considered synonyms of these descriptors within a given information retrieval system), and the descriptors in the dictionary must be systematized according to meaning, and the semantic connections between them are explicitly expressed.
However, compiling a thesaurus "from scratch" can require a very significant amount of work for experts, who must collect all the terms that cover the subject area quite fully, agree on their meanings, establish relationships and classify. Such difficulties that arise in solving an important, but still auxiliary task, can negatively affect the prospects for its solution.
We have developed and implemented a technology for creating a thesaurus based on the subject index of specialized encyclopedias. This technology provides a highly qualified description of the subject area using reliably verified terms, allowing you to carry out the initial stage of building a thesaurus with minimal involvement of specialists - experts in this subject area. A detailed presentation and justification of the algorithm are given in the work. Below is a brief description of the algorithm, as well as the web application that implements it.
1. Algorithm for creating a thesaurus
It is proposed to use the subject index of a specialized encyclopedia (or several encyclopedias) as a list of keywords and phrases for the thesaurus. The choice of a particular encyclopedia is made by a specialist in the subject area, and this choice depends on the goals pursued when creating the thesaurus. So, to solve complex environmental problems, it is advisable to use encyclopedias (or, in their absence, encyclopedic dictionaries) on physics, chemistry, geology, biology, medicine, mathematics, etc. With proper choice, the subject index is quite suitable, if not as a complete , then at least as a basic list of keywords, which will be replenished if necessary.
The subject indexes of most encyclopedias are arranged in a similar way - they contain terms that are the names of encyclopedia articles, terms whose definitions are given in the articles, as well as the most important results mentioned in the articles.
As descriptors (i.e., terms that are the names of classes of concepts that are close in meaning), the names of encyclopedia articles are assumed, and the words from the subject index that occur in the corresponding
articles. The main advantage of this method is that you do not need to be an expert in this subject area to establish the types of relationships between terms - general knowledge is enough to understand the text of the encyclopedia - more specific information needed in the process of classifying concepts can always be gleaned from a specific article .
Since the thesaurus being created is designed to work using the Z39.50 protocol, the link types are set in accordance with the recommendations of the /l lies scheme, which distinguishes the following types:
VT - connection with the parent term, i.e. with the term of a broader sense;
NT - connection with a child term, i.e. with a term of a narrower sense. Communication VT - NT is mutually inverse;
USE is a link to the term that is used instead;
UF - mutual feedback USE;
RT - a link that defines a related term;
LE - relationship between linguistically equivalent terms;
FE - completely identical terms.
Further, the classification of descriptors is carried out in accordance with the sections of this subject area. The choice of a specific classifier, as well as the choice of an encyclopedia, is carried out by an expert, and in the case of using several encyclopedias from different subject areas, it is possible to use several specialized classifiers. Links of the form NT, RT, LE (FE) are established between descriptors and sections of the classifier, while the classification should use, if possible, sections of the lowest possible level.
After that, the keywords associated with the descriptor by the relationships BT, USE, RT, LE, and FE are assigned the same classification number as the descriptor. However, this does not exclude the situation that if the descriptor is assigned to a class not of the lowest level, then during the subsequent work of the Expert Advisor, the terms associated with the descriptor by the relations BT and USE can be assigned to a class of a lower level. In this case, these terms themselves become descriptors.
As a result, all terms included in the subject index are classified according to the sections of this subject area.
2. Description of the web application
Nevertheless, the process of constructing a thesaurus in accordance with this technique involves a large amount of routine work and, in addition, requires the participation of a person with programming skills. Therefore, in addition to the methodology, a web application was developed that has a user-friendly interface and supports the following functions:
1) automatic translation of information from digitized pages of the subject index into a database table;
2) selection of descriptors in the general list of terms;
3) search for terms associated with a given descriptor and setting the types of links in accordance with the Zthes schema.
It is important to note that no programmer skills are required to perform all the operations mentioned above.
The developed application is universal, i.e. can be used to create thesauri for various subject areas. At the moment, the reconfiguration of the program from the subject index of one encyclopedia to the subject index of another (and only at this stage the processes of constructing thesauri of different subject areas may differ) is performed by the programmer, however, work is underway to supplement the program with functions that allow the user to perform this operation. with no programming skills.
The application functions as follows. Processing of digitized index pages is performed automatically. The user specifies the location of the text file with data, after which it is read line by line and the terms themselves are entered into the database, as well as information about the encyclopedia page numbers where they are located (Fig. 1).
Descriptors from the general list of keywords are selected by the user himself, marking the search terms in the list displayed on the screen. \¥ob-application also supports the function of correcting possible errors (Fig. 2). Recall that all terms found in the encyclopedia article devoted to it are considered to be associated with this descriptor.
To facilitate the search for related terms, the user is shown only a list of keywords located on the same page as the descriptor chosen by him (in fact, for this we entered into the database only the terms, and information about the page borders). Of course, since the article may not take up the entire page, extra terms will be included in the list. User, establishing connections,
Rice. 1. Entering text files with terms from the subject index
No. Creating a Descriptor Dictionary - Microsoft Internet Explorer!
File Edit View Favorites Tools Help
Q Back " © " @ |í| & yP Search ^Favorites - . in
Address; |¡j§ http:^localhost/math_dict/Deskj-_Slovar/Descr/gen_ss.phtml ; V ¡¿3 Going Links y>
fiBár JOQQ- © - I * 1 ]0 l de:*- F
1 Abacus | 1, 13 1111111
2 Abelev machine | 1.67 1111111
3 Abelian group object | 1, 1149 111 1 | |
4 Abelev differential 11.13-15 I 2, 240 111111
5 Abelian differential, basis | 1, 13 1111111
6 Abelian differential, divisor | 1, 15 | | | | | 1 |
7 Abelian normal differential | 1, 14 1111111
8 Abelian normalized differential | 1, 14 1111111
9 Abelian differential, polar period | 1, 14 | | | | | | |
10 Abelian differential, cyclic period | 1, 14 1111111
11 Abelian idempotent 14, 941 1111111
12 Abelian integral 11.15-17 1111111
13 Abelian integral, Abel theorem | 1, 17 1111111
14 Abelian integral canonical |1,16||||||
16 Abelian integral, period matrix |1,16||||||
15 Abelian normal integral | 1, 16|||||||
17 Abelian integral, polar period | 1.16||||||| 1S Abelian integral, cyclic period | 1, 16 | | | | |
19 Abelian potential | 2, 239 1111111
20 Abelev a group 11.17-20 1111111
21 An Abelian group is completely decomposable |1,19||||||
22 Divisible Abelian group | 1, 19|||||||
23 Finitely generated Abelian group | 1.18 1111111
24 Abelian group, Kulikov criterion | 1, 18 | | | | | |
25 Abelian group, zero | 3.1082 1111111
26 Abelian group, periodic part | 1, 18 111 | |
http://locdlhostymath_dict/Deskr_Slovar/Descr/goto, phtml?ss 1+4+1+A+1+3
j 5tartApache.bat
I Svoj.NET: PHP Edit
J Adobe Photoshop || w
^ Local intranet
EN W/m K 21:0;
Rice. 2. List of keywords and selection of descriptors
Rice. 3. Choice of related terms
Rice. 4. Establishment of types of links.
selects only a part of the keywords from the proposed list, however, such automation significantly reduces the amount of routine work (Fig. 3).
The type of connection between the descriptor and the keyword is specified by filling out the appropriate form (Fig. 4).
Conclusion
The performance of this algorithm and the web application was tested by creating a thesaurus of a number of sections of the subject area "Mathematics" ("Differential Equations", "Partial Differential Equations", "Numerical Analysis", "Fluid Mechanics", etc.) based on the subject index " Mathematical Encyclopedia". It has been established that for the classification of terms and the establishment of relationships between them, a bachelor's qualification is sufficient (provided that in rare cases an expert with a scientific degree is involved for consultations). This proves the high efficiency of the developed algorithm.
Bibliography
Mikhailov A.I., Cherny A.I., Gilyarevsky R.C. Fundamentals of informatics. Moscow: Nauka, 1968.
Barakhnin V.B. Development of the thesaurus of the subject area "Mathematics" // Mater, Conf. "Computing and information technologies in science, technology and education". Part 1. Novosibirsk; Almaty; Ust-Kamenogorsk, 2003, pp. 111-115.
Zthes: a Z39.50 Profile for Thesaurus Navigation
http://lcweb.loe.gov/z3950/agency/profiles/zthes-04.html
The first step in creating a thesaurus was to search for information about the structure of thesauri, its types, and operating programs. The second stage was the choice of a programming language and a scheme for building my future thesaurus. The third stage is the search for information to fill it in, for this I used the "Educational and methodological complex Computer networks".
Here are a couple of examples of thesauri (see Figure 1.1 and Figure 1.2):
Figure 1.1 - Information retrieval system "Thesaurus.com"
Figure 1.2 - Glossary of gender terms
After collecting the necessary information, the creation of the thesaurus began. To create a thesaurus, the programming language was chosen - HTML. Hyper Text Markup Language - "HTML" (Hypertext Markup Language), many have long ceased to consider it just a programming language. Since the very concept of HTML includes various methods of designing hypertext documents, design, hypertext editors, browsers, and much more. A user who has mastered this language acquires the ability to do serious things with simple methods and, most importantly, quickly, which is considered very good in the modern world!
In the HTML language, you can create your own multimedia products and distribute them on any media, and all these products, made in the form of sets of HTML pages, do not require the development of specialized software tools, since everything necessary for working with data (Web browsers) has become part of the standard software of most personal computers.
The code of the future Web page is usually typed in a standard text editor, but there are other programs and programming languages, for example: Adobe Dreamweaver CS3, JavaScript, Pascal, C, C++, BASIC, Prolog.
To begin with, the thesaurus will consist of three frames: a title frame, a link frame, and a content frame, as shown in Figure 1.3.
Figure 1.3 - Scheme of thesaurus
The following HTML tags and attributes were used to create the thesaurus sketch: