Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning
Published in AI4Law Workshop at ICML 2026; AI4Math Workshop at ICML 2026, 2026
Large language models are increasingly tested as legal reasoning tools, formalizing contract clauses into strict logic and running them through automated solvers to verify whether one statement follows from another. Our new study re-annotating real non-disclosure agreements finds this added rigor is partly an illusion. Comparing five leading models across plain LLM judgment, LLM reasoning over formal logic, and an actual SMT solver (Z3), we found that formal structure boosted accuracy, but often because models reported solver-like conclusions without actually running the solver — a failure we call scope laundering, found in every model tested and in some cases over half the time (compounded by models missing logical constraints in their own formalizations and frequently generating faulty solver code), underscoring that looking logically grounded and being logically grounded aren’t the same thing as these tools edge closer to legal practice.
Recommended citation: Wang, O. P., Wong-Toropainen, S., Amrollahi, D., Bai, R., Bansal, T., Garg, A., & Gilpin, L. H. (2026). Know your limits: On the faithfulness of LLMs as solvers and autoformalizers in legal reasoning. arXiv. https://arxiv.org/abs/2606.16118
Download Paper
