|
Model of statistical dependence of sample distributions using nearest neighbor graphs and Kullback-–Leibler divergence
A. A. Kislitsyn Keldysh Institute of Applied Mathematics of RAS
Abstract:
The paper presents the results of numerical modeling of the distribution of graphs of the
nearest neighbors by the number of connectivity components and vertices by degrees for
the case when the distances between objects are not symmetric. The objects are discrete probability distributions, the distances between which are calculated as Kullback-Leibler
divergence. Estimates of the probability that a set of probability distributions is formed
by statistically dependent objects are numerically obtained. The numerical algorithm for
constructing a benchmark for the probabilities of implementing a certain structure of the
nearest neighbor graph is based on the fact that, up to isomorphism, this structure does
not depend on the distribution of distances between objects. An algorithm for collecting
sample statistics of nearest neighbor graphs for arbitrary random asymmetric matrices,
the elements of which are treated as distances, is described. An example of the analysis
of big data distributions obtained as a result of automatic processing of a corpus of more
than 100 thousand literary texts by 8.5 thousand authors in Russian is considered. The
corpus is structured according to the authors in the form of n-grams of letter combinations. The example is interesting because the n-gram distributions make it possible to accurately identify the author of a single text, so that the reference author distributions can
be considered as a basis. In the exact sense of linear algebra, the vectors of the author's
standards are linearly independent. At the same time, it turned out that these vectors are
statistically dependent with a probability almost equal to 1, which allows additional
structuring of the data array. A comparison was also made with the results obtained in the
L1 histogram norm for the same distributions, and it was shown that the benchmark for
asymmetric distances allows in this example to obtain an answer at a higher level of confidence.
Keywords:
nearest neighbor graph, vertex degree distribution, clustering, Kullback–Leibler divergence, data dependence criterion, text author identification.
Received: 09.09.2024 Revised: 09.09.2024 Accepted: 14.10.2024
Citation:
A. A. Kislitsyn, “Model of statistical dependence of sample distributions using nearest neighbor graphs and Kullback-–Leibler divergence”, Mat. Model., 37:2 (2025), 185–198
Linking options:
https://www.mathnet.ru/eng/mm4606 https://www.mathnet.ru/eng/mm/v37/i2/p185
|
Statistics & downloads: |
Abstract page: | 56 | Full-text PDF : | 1 | References: | 10 | First page: | 4 |
|