|
Tokenization based on the method of functional patterns
Yu. V. Nikitina, A. A. Khoroshilovbac, A. E. Makarovad a Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
b Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation
c 27th Central Research Institute of the Ministry of Defence of the Russian Federation, 5, 1st Khoroshevsky Passage, Moscow 123007, Russian Federation
d Scientific Industrial Joint Stock Company "High Technology and Strategic Systems," 27-9 Elektrozavodskaya Str., Moscow 107023, Russian Federation
Abstract:
The article proposes a new method of text tokenization based on the use of generalized functional templates. The method is based on the classification of Unicode characters in terms of their role in the formation of text elements and on the use of compound patterns from the generalized character classes. Widespread regular expressions are not used here. A specific feature of the method is the use of a sequence of characters as a part of the interval template. The strengths of the method include successful tokenization of complex information objects (numbers, geographic coordinates, names of articles of engineering products, etc.), obtaining the detailed classification of tokens at the stage of their formation, the ability to turn on and off tokenization of a certain type of tokens, as well as adding new templates according to the sample text for additional training of the system.
Keywords:
tokenization, segmentation, graphematic analysis, computational linguistics, patterns, substitution, token.
Received: 15.09.2022
Citation:
Yu. V. Nikitin, A. A. Khoroshilov, A. E. Makarova, “Tokenization based on the method of functional patterns”, Sistemy i Sredstva Inform., 32:4 (2022), 59–68
Linking options:
https://www.mathnet.ru/eng/ssi856 https://www.mathnet.ru/eng/ssi/v32/i4/p59
|
Statistics & downloads: |
Abstract page: | 67 | Full-text PDF : | 28 | References: | 26 |
|