IDENTIFICATION OF SIGNIFICANT THEMES USING THE LDA ALGORITHM ON THE MATERIAL OF GERMAN MEDIA DISCOURSE

Authors

  • Mikhail V. Koryshev St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation https://orcid.org/0000-0001-8946-4431
  • Maria V. Khokhlova St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation https://orcid.org/0000-0001-9085-0284
  • Liubov A. Kulikova St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation
  • Konstantin V. Mazin St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

DOI:

https://doi.org/10.21638/spbu33.2024.122

Abstract

Topic modeling methods allow obtaining insights into the thematic content of texts and identifying latent semantic structures. Each text can be represented by multiple topics, thereby enabling the identification of text similarity and, more broadly, common trends specific to texts targeting a particular audience. The goal of this paper is to form a set of themes that interest readers based on German discourse covering various aspects of life. The results of constructing several thematic models using the Latent Dirichlet Allocation (LDA) algorithm are demonstrated in relation to the texts of the German journals “Zeitschrift für Ideengeschichte” which covers the history and development of political, religious, philosophical, and literary ideas and thoughts, and the student periodical “Moritz.Magazin” of the Greifswald University. The extracted keywords received expert evaluation. The results show that over time, the “Zeitschrift für Ideengeschichte” transits from narrower to broader topics. The analysis demonstrated relatively low similarity between texts from different years of this journal. However, texts published in the same year share common features and this similarity was identified according to the TF-IDF measure. Despite the initially declared diversity of the examined journals, the political component is common to them, with the “Moritz.Magazin” displaying it more prominently, while in the “Zeitschrift für Ideengeschichte” articles, the political orientation is conveyed indirectly through references to prominent thinkers and themes. During the research, a preliminary set of topics interesting to representatives of two significant circles in contemporary Germany, belonging to the educated university community, was defined, thus describing a certain intellectual landscape of the country.

Keywords:

topics, keywords, Latent Dirichlet Allocation (LDA), media discourse, German language

Downloads

Download data is not yet available.
 

Author Biographies

Mikhail V. Koryshev, St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

Candidate of Philological Sciences, Associate Professor of Comparative Study of Languages and Cultures Department, St. Petersburg State University

Maria V. Khokhlova, St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

Candidate of Philological Sciences, Associate Professor of Mathematical Linguistics Department, St. Petersburg State University

Liubov A. Kulikova, St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

Research Assistant, St. Petersburg State University,

Konstantin V. Mazin, St. Petersburg State University, Address: 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

Research Assistant, St. Petersburg State University

References

Источники

Moritz.Magazin. URL: https://webmoritz.de/moritz-magazin/ (дата обращения: 18.01.2024).

Zeitschrift für Ideengeschichte. URL: https://www.wiko-berlin.de (дата обращения: 18.01.2024).


Литература

Кирина М.А. Сравнение тематических моделей на основе LDA, STM и NMF для качественного анализа русской художественной прозы малой формы // Вестник НГУ. Серия: Лингвистика и межкультурная коммуникация. 2022. Т. 20. No 2. С. 93–109.

Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation // Journal of Machine Learning Research. 2013. Vol. 3 (4–5). P. 993–1022.

Deerwester S., Dumais S.T., Furnas G.W., Landauer Th.K., Harshman R. Indexing by latent semantic analysis // Journal of the American Society for Information Science. 1990. Vol. 41 (6). P. 391–407.

Dehler-Holland J., Schumacher K., Fichtner W. Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act // Patterns. 2021. Vol. 2. P. 100–169.

Hofmann Th. Probabilistic Latent Semantic Indexing // Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99). New York: Association for Computing Machinery, 1999. P. 50–57.

Küsters A., Garrido E. Mining PIGS.A structural topic model analysis of Southern Europe based on the German newspaper Die Zeit (1946–2009) // Journal of Contemporary European Studies. 2020. Vol. 28: (4). P. 477–493. https://doi.org/10.1080/14782804.2020.1784112

Landauer T.K., Foltz P.W., Laham D. Introduction to Latent Semantic Analysis // Discourse Processes. 1998. Vol. 25. P. 259–284.

Lee D., Seung H. Learning the Parts of Objects by Non-Negative Matrix Factorization // Nature. 1999. Vol. 401. P. 788–791.

Řehůřek R., Sojka P. Software Framework for Topic Modelling with Large Corpora // Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta: University of Malta, 2010. P. 46–50.

Roberts M., Stewart B., Tingley D., Airoldi E. The Structural Topic Model and Applied Social Science // Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. 2013. URL: https://projects.iq.harvard.edu/files/wcfia/files/stmnips2013.pdf (дата обращения: 18.01.2024).

Röder M., Both A., Hinneburg A. Exploring the Space of Topic Coherence Measures // Proceedings of the Eight International Conference on Web Search and Data Mining, Shanghai, February 2–6. Shanghai: ACM, 2015. P. 399–408.

Wartena Ch. A probabilistic morphology model for German lemmatization // Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers / German Society for Computational Linguistics & Language Technology. Erlangen: Friedrich-Alexander-Universität Erlangen-Nürnberg, 2019. P. 40–49. https://doi.org/10.25968/opus-1527


References

Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2013, vol. 3 (4–5), pp. 993–1022.

Deerwester S., Dumais S.T., Furnas G.W., Landauer Th.K., Harshman R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, vol. 41 (6), pp. 391–407.

Dehler-Holland J., Schumacher K., Fichtner W. Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act. Patterns, 2021, vol. 2, pp. 100–169.

Hofmann Th. Probabilistic Latent Semantic Indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99). New York: Association for Computing Machinery, 1999, pp. 50–57.

Kirina M.A. A Comparison of Topic Models Based on LDA, STM and NMF for Qualitative Studies of Russian Short Prose. Vestnik NSU. Series: Linguistics and Intercultural Communication, 2022, vol. 20, no. 2, pp. 93–109. (In Russian)

Küsters A., Garrido E.Mining PIGS. A structural topic model analysis of Southern Europe based on the German newspaper Die Zeit (1946–2009). Journal of Contemporary European Studies, 2020, vol. 28 (4), pp. 477–493. https://doi.org/10.1080/14782804.2020.1784112

Landauer T.K., Foltz P.W., Laham D. Introduction to Latent Semantic Analysis. Discourse Processes, 1998, vol. 25, pp. 259–284.

Lee D., Seung H. Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature, 1999, vol. 401, pp. 788–791.

Řehůřek R., Sojka P. Software Framework for Topic Modelling with Large Corpora. Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta, University of Malta, 2010, pp. 46–50.

Roberts M., Stewart B., Tingley D., Airoldi E. The Structural Topic Model and Applied Social Science. Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation, 2013. Available at: https://projects.iq.harvard.edu/files/wcfia/files/stmnips2013.pdf (accessed: 18.01.2024).

Röder M., Both A., Hinneburg A. Exploring the Space of Topic Coherence Measures. Proceedings of the Eight International Conference on Web Search and Data Mining, Shanghai, February 2–6. Shanghai, ACM, 2015, pp. 399–408.

Wartena Ch. A probabilistic morphology model for German lemmatization. Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, German Society for Computational Linguistics & Language Technology. Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 2019, pp. 40–49. https://doi.org/10.25968/opus-1527

Published

2025-03-25

How to Cite

Koryshev, M. V., Khokhlova, M. V., Kulikova, L. A., & Mazin, K. V. (2025). IDENTIFICATION OF SIGNIFICANT THEMES USING THE LDA ALGORITHM ON THE MATERIAL OF GERMAN MEDIA DISCOURSE. German Philology at the St Petersburg State University , 14, 418–433. https://doi.org/10.21638/spbu33.2024.122

Issue

Section

III. GERMAN LANGUAGE AND COMMUNICATION AT PRESENT: NEW APPROACHES AND METHODS OF ANALYSIS