|
SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES
Accessible Russian large language models: open-sourced models and instructive datasets for commercial applications
D. Kosenkoab, Yu. Kuratovabc, D. Zharikovab a Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
b DeepPavlov, Moscow, Russia
c Artificial Intelligence Research Institute, Moscow, Russia
Abstract:
This paper presents an approach to developing and fine-tuning large language models for Russian that are capable of following instructions across domains. As base models, XGLM-4.5B, LLaMA-1 7B, LLaMA-1 13B, LLaMA-2 7B, LLaMA-2 13B, and ruGPT-3.5 13B were used. This work compares two main fine-tuning techniques: fine-tuning all model parameters and fine-tuning using LoRA layers. To create a fine-tuning dataset, several open English language data sources were used, including Databricks Dolly 15k, OpenAssistant Conversations Dataset (OASST1), chip2-instruct-alpha-v6a-1, which were then translated into Russian using the WMT21 En-X model. This work shows that the quality of the instructions provided for training significantly affects the ability to solve tasks on automatic quality metrics like MT-BENCH and MMLU. At the same time, the quality of models trained on the dataset collected as part of this work with a commercial license achieves comparable results to models fine-tuned on the Saiga dataset with a limited license. The fine-tuned language models and collected Russian language dataset are released open-source with licenses suitable for commercial use.
Keywords:
large language models, language models, language models in Russian.
Citation:
D. Kosenko, Yu. Kuratov, D. Zharikova, “Accessible Russian large language models: open-sourced models and instructive datasets for commercial applications”, Dokl. RAN. Math. Inf. Proc. Upr., 514:2 (2023), 262–269; Dokl. Math., 108:suppl. 2 (2023), S393–S398
Linking options:
https://www.mathnet.ru/eng/danma471 https://www.mathnet.ru/eng/danma/v514/i2/p262
|
|