V. Firsanova, “What do text-to-image models know about the languages of the world?”, Investigations on applied mathematics and informatics. Part II–1, Zap. Nauchn. Sem. POMI, 529, POMI, St. Petersburg, 2023, 157

Zapiski Nauchnykh Seminarov POMI

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Zap. Nauchn. Sem. POMI:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Zapiski Nauchnykh Seminarov POMI, 2023, Volume 529, Pages 157–175 (Mi znsl7425)

What do text-to-image models know about the languages of the world?

V. Firsanova

St. Petersburg State University, St. Petersburg, Russia

Full-text PDF (29792 kB)

References:

PDF

HTML

Abstract: Text-to-image models use user-generated prompts to produce images. Such text-to-image models as DALL-E 2, Imagen, Stable Diffusion, and Midjourney can generate photorealistic or similar to human-drawn images. Apart from imitating human art, large text-to-image models have learned to produce combinations of pixels reminiscent of captions in natural languages. For example, a generated image might contain a figure of an animal and a symbol combination reminding us of human-readable words in a natural language describing the biological name of this species. Although the words occasionally appearing on generated images can be human-readable, they are not rooted in natural language vocabularies and make no sense to non-linguists. At the same time, we find that semiotic and linguistic analysis of the so-called hidden vocabulary of text-to-image models will contribute to the field of explainable AI and prompt engineering. We can use the results of this analysis to reduce the risks of applying such models in real life problem solving and to detect deepfakes. The proposed study is one of the first attempts at analyzing text-to-image models from the point of view of semiotics and linguistics. Our approach implies prompt engineering, image generation, and comparative analysis. The source code, generated images, and prompts have been made available at https://github.com/vifirsanova/text-to-image-explainable.

Key words and phrases: explainable artificial intelligence, text-to-image synthesis, diffusion models.

Received: 06.09.2023

Document Type: Article

UDC: 81.322.2

Language: English

Citation: V. Firsanova, “What do text-to-image models know about the languages of the world?”, Investigations on applied mathematics and informatics. Part II–1, Zap. Nauchn. Sem. POMI, 529, POMI, St. Petersburg, 2023, 157–175

Citation in format AMSBIB

\Bibitem{Fir23}

\by V.~Firsanova

\paper What do text-to-image models know about the languages of the world?

\inbook Investigations on applied mathematics and informatics. Part~II--1

\serial Zap. Nauchn. Sem. POMI

\yr 2023

\vol 529

\pages 157--175

\publ POMI

\publaddr St.~Petersburg

\mathnet{http://mi.mathnet.ru/znsl7425}

Linking options:

https://www.mathnet.ru/eng/znsl7425

https://www.mathnet.ru/eng/znsl/v529/p157

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	124
Full-text PDF :	61
References:	16

Что такое QR-код?

Registration to the website

Logotypes