|
This article is cited in 3 scientific papers (total in 3 papers)
Mathematical Modeling, Numerical Methods
About similarity measures of components arrangement of naturally ordered data arrays
A. S. Gumenyuka, A. A. Skibab, N. N. Pozdnichenkoa, S. N. Shpynovc a Omsk State Technical University (OmSTU)
b Company Elmis
c N. F. Gamaleya Federal Research Center for Epidemiology & Microbiology
Abstract:
At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. "The curse of a priori unconscious knowledge" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.
Keywords:
data array, symbolic sequence, information chain, numeric characteristics of order, depth of order, average remoteness, nucleotide sequence, similarity measures, similarity matrix, alignment-free genome comparison, inter-nucleotide distance.
Received: 22.05.2018
Citation:
A. S. Gumenyuk, A. A. Skiba, N. N. Pozdnichenko, S. N. Shpynov, “About similarity measures of components arrangement of naturally ordered data arrays”, Tr. SPIIRAN, 18:2 (2019), 471–503
Linking options:
https://www.mathnet.ru/eng/trspy1053 https://www.mathnet.ru/eng/trspy/v18/i2/p471
|
Statistics & downloads: |
Abstract page: | 154 | Full-text PDF : | 121 |
|