DEDUCE: Deductive Consistency as a Frame Work to Evaluate LLM Reasoning

ICLR 2025 - Workshop on Reasoning and Planning for LLMs |

Despite great performance on Olympiad-level reasoning problems, frontier large language models can still struggle on high school math. We study the nature of language models’ (LM) reasoning by analyzing their chain-of-thought traces. To avoid memorization issues, we present a framework that can evaluate reasoning of LMs over novel, perturbed versions of benchmark problems. Formally, we compare LMs to ideal deductive reasoners that given a set of premises, can provide valid conclusions over any number of reasoning hops. To assess reasoning performance beyond final accuracy, we introduce deductive consistency, a metric that evaluates the correctness of system’s reasoning across varying input premise lengths and the number of solution hops. Using this metric, we examine potential explanations for language models’ failures on novel problems. Through experiments on GSM8K and a synthetic dataset, we find that the failure is not primarily due to shifts in language style or the propagation of early errors. Instead, it stems from a fundamental limitation: as the number of reasoning hops increases, language models exhibit a decline in deductive consistency, which was masked by memorization for existing benchmark problems. Our analysis provides a new view to characterize LM reasoning—as computations over a window of input premises and reasoning hops—that can provide unified evaluation across problem domains.