Reading the Limitations Section: What the First RCT on AI and Legal Reasoning Actually Shows

May 28, 2026

Bednar, Cleveland, Erbsen, and Schwarcz have published the first randomized controlled trial testing whether AI use on legal tasks erodes the human reasoning that competent practice depends on. Nick Bednar et al., Artificial Intelligence and Human Legal Reasoning (Univ. of Minn. L. Sch., Apr. 2026). The study ran 91 2Ls and 3Ls through four sequential tasks: synthesizing doctrine from a source packet, answering comprehension questions, applying the doctrine to new facts, and revising their application memo with AI. The AI-exposed group used Gemini 2.5 Pro on the first and fourth tasks. The control group worked without AI until the revision stage. Contrary to the researchers’ own pre-registered hypothesis, the AI-exposed group did not show degraded reasoning. They outperformed the control group on the application task by 24 percent, even though neither group had access to AI for that task.

The result will be cited as evidence that AI use does not erode independent legal reasoning, and within the study’s conditions, it supports that claim. But the study’s conditions are doing a great deal of the work, and the commentary that treats the headline as a general finding about AI and legal reasoning will be reading the study too quickly. The limitations the authors identify—law students rather than practitioners, a closed source set, a single doctrinal area, structured prompting instructions, a specific model—describe, almost exactly, the institutional infrastructure that separates the firms where AI use produces good outcomes from the ones where it produces sanctions.

What the headline depends on

The study’s regression analysis identifies the mechanism behind the positive result, and it is not the one the headline suggests. Table 7 controls for each participant’s synthesis-task performance. Once that control is in place, the AI-exposed treatment effect on the application task drops to statistical insignificance. The only predictor of application-task quality is the quality of the synthesis memo.

Rather than making the students better reasoners, AI helped them produce a better intermediate document, and that better document improved their downstream work. The benefit ran through the foundation, not through any enhancement of the students’ analytical capacity. When the two groups had equally strong synthesis memos, the AI-exposed group held no advantage.

Any litigator will recognize the mechanism. A junior associate who starts from a well-organized research memo will produce a stronger brief than one working from scattered notes. AI can serve the role of the organizing tool at the synthesis stage—structuring doctrine, mapping relationships among authorities, producing a coherent framework the lawyer can then apply to new facts. The 24-percent advantage holds up under the study’s controls, but it is an advantage in document quality at an intermediate step, not an enhancement of the cognitive capacity that produced the downstream analysis.

The conditions that produced the result

The study’s limitations section functions less as a set of qualifications than as a description of the institutional choices that made the positive result possible.

Participants worked from a curated, twelve-page source packet: five edited sections of the Restatement (Third) of Property and four judicial opinions the authors wrote for a hypothetical fifty-first state. Participants received instructions for confining the AI to the same closed universe. No participant faced the open-ended research environment in which hallucination and source-reliability failures occur. The experiment eliminated, by design, the conditions that most commonly produce fabricated citations.

The experiment decomposed the assignment into four discrete tasks with separate time allotments and separate instructions. Participants were not told to “research this issue and write a memo.” They were told to synthesize first, then apply, then revise—each step bounded, each step with its own deliverable. That structure prevents the undifferentiated delegation that produces the worst AI outcomes: a lawyer pastes a problem into a chatbot and treats whatever comes back as finished work product.

Participants used a specific model selected by the researchers, under structured prompting instructions that guided their interaction with the tool. They were not handed a consumer chatbot and left to improvise.

Each design choice served a valid experimental purpose, but each also created conditions that most legal workplaces do not replicate. Curated sources prevent hallucination. Task decomposition prevents context pollution and forces the user to engage with intermediate products rather than accepting a single undifferentiated output. Structured prompts channel the interaction toward the kind of bounded, specifiable work where AI performs best. A selected, institutionally provided tool prevents the data-handling and quality-variance problems that arise when lawyers choose their own platforms. The experiment tested whether AI helps law students reason within an environment that eliminates the most common failure modes. The question for firms is whether they have built that environment—and the answer, for most, is that they have not.

The revision compression

The study’s revision-task results cut against the optimistic reading in a way that deserves more attention than it has received. When all participants used AI to revise their application memos, participants who started with weak memos improved, but participants who started with strong memos got worse. AI-assisted revision compressed performance toward the mean.

That compression has direct practice implications. AI revision pulls weak work up and strong work down, converging on a fluent, well-organized middle that lacks the sharpest insights of the best human analysis. The model produces prose that reads well, and the study’s rubric-scored results captured the gap between fluency and analytical quality—participants whose unrevised memos demonstrated sophisticated doctrinal application saw their scores decline after AI revision.

The authors suggest several explanations, including cognitive fatigue after three hours of work and the possibility that participants deferred to AI-generated edits without evaluating whether those edits strengthened the analysis. I find the second explanation more illuminating, because it describes a cognitive failure mode that practicing lawyers will recognize: the draft looks polished after the AI pass, the lawyer is tired, and the path of least resistance is to accept the revision without asking whether the AI’s organizational choices preserved the strongest parts of the analysis. The model reorganizes, smooths transitions, and makes everything read like competent legal prose—but in the process it can flatten the distinction between a routine point and the insight that carries the argument. A lawyer whose initial draft built carefully toward a counterintuitive conclusion may find, after AI revision, that the conclusion now reads as one item in a balanced list of considerations.

The revision task ran under a twenty-minute time limit at the end of a three-hour experiment. Those conditions—fatigue, time pressure, material the participant had already worked through at length—are the conditions under which practicing lawyers most often reach for AI revision tools. The study’s structured safeguards (decomposed tasks, a bounded source set, a selected model) were still in place during the revision stage. In practice, lawyers revising work product at 11 p.m. before a morning filing deadline have no such safeguards. They have a chatbot and a desire to be done. If AI-assisted revision degraded strong work even within the study’s controlled environment, the degradation in uncontrolled practice settings is likely worse—and less visible, because the degraded output still reads fluently.

For supervisory practice, this means a partner reviewing an associate’s AI-revised brief should be alert to the possibility that the brief’s strongest analysis appeared in an earlier draft and was softened during revision. Firms that encourage AI revision as a quality-improvement step should consider whether that step, without adequate oversight, functions instead as a quality-compression step. The study does not prove that AI revision always degrades strong work, but it provides the first controlled evidence that it can—and under conditions far more favorable than those prevailing in practice.

What the study prescribes

The positive results on synthesis and application survive controls for GPA, year in law school, and prior AI experience, and the absence of comprehension deficits pushes back against the strongest version of the skill-erosion hypothesis. The study is pre-registered. The authors are candid about what it does and does not show. Its value for the practicing profession, though, turns on reading it as a conditional result—and then building the conditions.

The study’s experimental design maps onto a set of institutional practices that some firms already maintain and most do not. Constrained source sets are the experimental analog of firm-curated document repositories and jurisdictionally bounded research databases—environments that limit the model’s opportunity to fabricate authorities. Task decomposition has its counterpart in structured assignment workflows that require associates to produce intermediate deliverables (a research log, a case chart, a synthesis memo) before drafting the final product. Structured prompting instructions correspond to firm-developed prompt libraries and AI use guides that channel interactions toward bounded, specifiable tasks. And an institutionally selected model maps onto a firm-provided AI platform operating under commercial terms with appropriate confidentiality protections.

None of this requires novel technology. It simply requires decisions by firm leadership about how AI-assisted work is organized, what intermediate products are expected, and how the transition from AI-generated material to lawyer-authored analysis is structured and supervised.

The findings also bear on CLE programming. Most current AI offerings focus on what the tools can do and what ethical obligations attend their use—capability and compliance. Bednar et al.’s results suggest that the more productive focus is workflow design: how to decompose a legal task into stages where AI contributes at each stage without supplanting the lawyer’s evaluative role at any of them. The synthesis-then-application structure the study used maps onto how competent lawyers have always organized complex research assignments; the contribution of AI, under the right conditions, is to make each stage more productive without collapsing the distinction between them.

The profession’s instinct will be to cite the headline—24-percent improvement, no comprehension deficit, AI does not erode reasoning—and leave the limitations section to the academics. That would miss the study’s most useful contribution, which sits in the section the headline readers will skip. Firms that want to replicate the result should study what the experiment did, not just what it found.

This post draws on Nick Bednar, David Cleveland, Allan Erbsen & Daniel Schwarcz, Artificial Intelligence and Human Legal Reasoning (Univ. of Minn. L. Sch., Apr. 2026). The discussion of task decomposition and delegation draws on an earlier post on the task/judgment distinction, and the discussion of institutional infrastructure connects to the floor-versus-ceiling analysis in a recent post on open-source legal AI.