System comparison using automated generation of relevance judgements in multiple languages

  • ,
  • Douglas W Oard ,
  • Eugene Yang ,
  • Dawn Lawrie ,
  • James Mayfield

2025 International ACM SIGIR Conference on Research and Development in Information Retrieval |

Recent work has shown that Large Language Models (LLMs) can produce relevance judgements for English retrieval that are useful as a basis for system comparison, and they do so at vastly reduced cost compared to human assessors. Using relevance judgements and ranked retrieval runs from the TREC NeuCLIR track, this paper shows that LLMs can also produce reliable assessments in other languages, even when the topic description or the prompt are in a language different from the documents. Results with Chinese, Persian and Russian documents show that although document language affects both agreement with human assessors on graded relevance and on preference ordering among systems, prompt-language and topic-language effects are negligible. This has implications for the design of multilingual test collections, suggesting that prompts and topic descriptions can be developed in any convenient language.