Assessing LLMs for Natural Language Search in Poetry Databases

Ryan Kovatch

This project aims to evaluate the performance of two major large language models (LLMs) — Google’s Gemini and Meta’s Llama — on their abilities to identify poems in a database based on abstract characteristics of the text, namely form and techniques, major themes, and historical context. Currently, search engines for poetry databases are primarily keyword-based, which makes the discovery of new poems difficult when the user does not know (or cannot predict) the content of a poem they are looking for. This paper uses a dataset developed by Walsh et al. (2024) and a technique called retrieval-augmented generation (RAG) to implement a natural-language search function where users can ask more advanced questions of a poetry collection and retrieve relevant results to abstract queries. On a benchmark, the LLMs performed poorly at poetry retrieval tasks across the board, scoring low on precision and recall, but patterns in the data indicated that the models performed better on some types of queries than others, and that the technology could become feasible for this use case sometime in the near future.