Sistemy i Sredstva Informatiki [Systems and Means of Informatics]
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Sistemy i Sredstva Inform.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Sistemy i Sredstva Informatiki [Systems and Means of Informatics], 2022, Volume 32, Issue 4, Pages 59–68
DOI: https://doi.org/10.14357/08696527220406
(Mi ssi856)
 

Tokenization based on the method of functional patterns

Yu. V. Nikitina, A. A. Khoroshilovbac, A. E. Makarovad

a Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
b Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation
c 27th Central Research Institute of the Ministry of Defence of the Russian Federation, 5, 1st Khoroshevsky Passage, Moscow 123007, Russian Federation
d Scientific Industrial Joint Stock Company "High Technology and Strategic Systems," 27-9 Elektrozavodskaya Str., Moscow 107023, Russian Federation
References:
Abstract: The article proposes a new method of text tokenization based on the use of generalized functional templates. The method is based on the classification of Unicode characters in terms of their role in the formation of text elements and on the use of compound patterns from the generalized character classes. Widespread regular expressions are not used here. A specific feature of the method is the use of a sequence of characters as a part of the interval template. The strengths of the method include successful tokenization of complex information objects (numbers, geographic coordinates, names of articles of engineering products, etc.), obtaining the detailed classification of tokens at the stage of their formation, the ability to turn on and off tokenization of a certain type of tokens, as well as adding new templates according to the sample text for additional training of the system.
Keywords: tokenization, segmentation, graphematic analysis, computational linguistics, patterns, substitution, token.
Received: 15.09.2022
Document Type: Article
Language: Russian
Citation: Yu. V. Nikitin, A. A. Khoroshilov, A. E. Makarova, “Tokenization based on the method of functional patterns”, Sistemy i Sredstva Inform., 32:4 (2022), 59–68
Citation in format AMSBIB
\Bibitem{NikKhoMak22}
\by Yu.~V.~Nikitin, A.~A.~Khoroshilov, A.~E.~Makarova
\paper Tokenization based on~the~method of~functional patterns
\jour Sistemy i Sredstva Inform.
\yr 2022
\vol 32
\issue 4
\pages 59--68
\mathnet{http://mi.mathnet.ru/ssi856}
\crossref{https://doi.org/10.14357/08696527220406}
Linking options:
  • https://www.mathnet.ru/eng/ssi856
  • https://www.mathnet.ru/eng/ssi/v32/i4/p59
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Системы и средства информатики
    Statistics & downloads:
    Abstract page:54
    Full-text PDF :18
    References:14
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024