|
This article is cited in 2 scientific papers (total in 2 papers)
COMPUTER SOFTWARE AND COMPUTING EQUIPMENT
Discriminant analysis of the technical short texts
A. V. Borovskya, E. E. Rakovskayaa, A. L. Bisikalob a Baikal State University
b Irkutsk State University
Abstract:
Today much attention is paid to processing textual information in order to form thematic groups and to systematize documents. This is stipulated by growing popularity of the Internet as a means of communication and requires to categorize short technical texts, which, in turn, is characterized by complexity of traditional approaches — preprocessing and digitization of documents and identification of "classifying" features. Specificity of the study at each stage is determined by the characteristics of the texts — small size, similar vocabulary, a large number of highly specialized symbols and signs, synonymity of terms.There has been suggested the procedure of preparing texts for analysis, reducing the dimensions of "term-document" matrix using singular decomposition method which allows to solve the problem of small-rank approximation of the original matrix. There are classification methods used such as $k$-nearest neighbors method and discriminant analysis based on Fisher elementary functions (texts on assignment of instruments was taken as an example). The Fisher classification procedure uses discriminant variables and the approach of maximizing the differences between classes to obtain the classification function. An object belongs to the class for which the value of classifying function is the greatest. There has been given assessment of the results obtained and the inadequate accuracy of classification when applying $TF-IDF$ measure under experimental conditions. To improve the quality of classification, a combined method has been proposed to select words at the first step using $TF-IDF$ measure. The dictionary of terms and phrases is to be used at the second stage for classifying texts. According to the obtained data, it has been offered to carry out classification by discriminant analysis and $k$-closest neighbors method. The proposed combined method is planned to be refined and upgraded in the future.
Keywords:
classification of short texts, defining the weight of terms, singular decomposition, discriminant analysis, Fisher elementary functions, $k$-nearest neighbor method.
Received: 05.03.2018
Citation:
A. V. Borovsky, E. E. Rakovskaya, A. L. Bisikalo, “Discriminant analysis of the technical short texts”, Vestn. Astrakhan State Technical Univ. Ser. Management, Computer Sciences and Informatics, 2018, no. 2, 53–60
Linking options:
https://www.mathnet.ru/eng/vagtu530 https://www.mathnet.ru/eng/vagtu/y2018/i2/p53
|
|