|
New approach to searching for string median and visualization of string clusters
D. V. Gorbachev, E. P. Ofitserov Tula State University (Tula)
Abstract:
We consider the following problem of searching the median of a string set:
$$
c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b),
$$
where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$,
$U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has
important applications, e.g., in bioinformatics in the analysis of protein
sequences. However, it is known that in the general case for $U=G^{*}$ the
median problem is NP-hard. Therefore, for an approximate solution, heuristics
algorithms were proposed, in particular, the greedy algorithm. We give a new
flexible approach based on smooth Levenshtein distance approximation
$\tilde{d}$. It uses stochastic character encoding of sequences and the
following formula for the edit distance:
$$
d(X_{1},X_{2})=\min_{(X_{1}',X_{2}'')_{e}\subseteq X_{1}\times X_{2}}
\Bigl\{\frac{1}{2}\,\|X_{1}'-X_{2}''\|_{1}+|X_{1}|-|X_{1}'|+
|X_{2}|-|X_{2}''|\Bigr\},
$$
where the minimum is taken over all subsequences $(X_{1}',X_{2}'')_{e}$ of
equal length. On the one hand, stochastic coding extends the class, over which
the extremum is searched. However, our main result shows that the median is the
same. On the other hand, now we can use smooth optimization methods, replacing
the minimum in the definition above with a smooth approximation. We give
implementation details based on the gradient descent, including the recursion
formulas for calculating $\tilde{d}$. Effective calculation of the approximate
median allows, e.g., to use the $k$-means method for string clustering. We
propose an approach to visualize such clusters based on tSNE stochastic nesting
method.
Keywords:
Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$-means, string cluster visualization.
Received: 18.05.2019 Accepted: 12.07.2019
Citation:
D. V. Gorbachev, E. P. Ofitserov, “New approach to searching for string median and visualization of string clusters”, Chebyshevskii Sb., 20:2 (2019), 93–107
Linking options:
https://www.mathnet.ru/eng/cheb755 https://www.mathnet.ru/eng/cheb/v20/i2/p93
|
Statistics & downloads: |
Abstract page: | 188 | Full-text PDF : | 130 | References: | 18 |
|