An article by a group of authors, including Ilyas Aslanov and Evgeny Kotelnikov, staff members of the School of Computational Social Sciences (SCSS) at the European University, has been published in a special issue of the journal Supercomputing Frontiers and Innovations dedicated to large language models. The authors assess how modern open large language models answer questions from the famous game "What? Where? When?"
In this study, the authors introduced a new dataset of 2,600 "What? Where? When?" questions collected from 2018 to 2025. Using structural and thematic clustering, they provided a detailed overview of question types and knowledge areas, and evaluated 14 modern open LLMs using automatic metrics and the LLM-as-a-Judge approach.
The strongest open models, such as Qwen3-235B-A22B-Thinking and DeepSeek-R1, approach the average performance of human teams but do not surpass it. The best language model achieved only 32.4% accuracy—compared to 45.8% for an average team of human experts. Architectures with extensive reasoning capabilities consistently outperformed non-reasoning counterparts, especially in the categories of "technology," "ancient world," "psychology," and "nature," while questions involving wordplay, assumptions, and common nouns proved difficult for all language models.
These results highlight both the progress of modern open LLMs and their current limitations in intellectual reasoning within a quiz format.
All findings and test results for the 14 open models can be read in the article.