Chebyshevskii Sbornik
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Chebyshevskii Sb.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Chebyshevskii Sbornik, 2019, Volume 20, Issue 2, Pages 93–107
DOI: https://doi.org/10.22405/2226-8383-2018-20-2-93-107
(Mi cheb755)
 

New approach to searching for string median and visualization of string clusters

D. V. Gorbachev, E. P. Ofitserov

Tula State University (Tula)
References:
Abstract: We consider the following problem of searching the median of a string set:
$$ c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b), $$
where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$, $U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has important applications, e.g., in bioinformatics in the analysis of protein sequences. However, it is known that in the general case for $U=G^{*}$ the median problem is NP-hard. Therefore, for an approximate solution, heuristics algorithms were proposed, in particular, the greedy algorithm. We give a new flexible approach based on smooth Levenshtein distance approximation $\tilde{d}$. It uses stochastic character encoding of sequences and the following formula for the edit distance:
$$ d(X_{1},X_{2})=\min_{(X_{1}',X_{2}'')_{e}\subseteq X_{1}\times X_{2}} \Bigl\{\frac{1}{2}\,\|X_{1}'-X_{2}''\|_{1}+|X_{1}|-|X_{1}'|+ |X_{2}|-|X_{2}''|\Bigr\}, $$
where the minimum is taken over all subsequences $(X_{1}',X_{2}'')_{e}$ of equal length. On the one hand, stochastic coding extends the class, over which the extremum is searched. However, our main result shows that the median is the same. On the other hand, now we can use smooth optimization methods, replacing the minimum in the definition above with a smooth approximation. We give implementation details based on the gradient descent, including the recursion formulas for calculating $\tilde{d}$. Effective calculation of the approximate median allows, e.g., to use the $k$-means method for string clustering. We propose an approach to visualize such clusters based on tSNE stochastic nesting method.
Keywords: Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$-means, string cluster visualization.
Funding agency
The results of the study are published with the financial support of Tulsu within the framework of the scientific project №: НИР_2018_28.
Received: 18.05.2019
Accepted: 12.07.2019
Document Type: Article
UDC: 519.72
Language: Russian
Citation: D. V. Gorbachev, E. P. Ofitserov, “New approach to searching for string median and visualization of string clusters”, Chebyshevskii Sb., 20:2 (2019), 93–107
Citation in format AMSBIB
\Bibitem{GorOfi19}
\by D.~V.~Gorbachev, E.~P.~Ofitserov
\paper New approach to searching for string median and visualization of~string clusters
\jour Chebyshevskii Sb.
\yr 2019
\vol 20
\issue 2
\pages 93--107
\mathnet{http://mi.mathnet.ru/cheb755}
\crossref{https://doi.org/10.22405/2226-8383-2018-20-2-93-107}
Linking options:
  • https://www.mathnet.ru/eng/cheb755
  • https://www.mathnet.ru/eng/cheb/v20/i2/p93
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Statistics & downloads:
    Abstract page:188
    Full-text PDF :130
    References:18
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024