Answer Quality Is Not Learning Impact: The Stanford AI-Tutoring Study and Hybrid Legal Education

June 5, 2026

Salinas, Nyarko, and colleagues from fourteen U.S. law schools have published the first large-scale blind evaluation of AI tutoring in a domain where quality depends on judgment rather than factual accuracy. Alejandro Salinas et al., Law Professors Prefer AI Over Peer Answers (Stanford Law Sch., May 2026). Sixteen contracts professors wrote answers to forty questions representative of what students ask during office hours—questions about case holdings, doctrinal principles, hypotheticals requiring application of rules to new facts, and policy. The professors then judged anonymized pairs of answers without knowing which came from a colleague and which from one of two AI systems (Gemini 2.5 Pro and a NotebookLM instance grounded in the shared casebook). Across 2,918 forced-choice comparisons, professors preferred the AI-generated answer 75 percent of the time.

The margin is notable but less remarkable than the consistency. Every participating judge preferred AI answers over human ones; the least AI-favorable judge still chose the AI 56 percent of the time. The preference held across all four question categories, including hypotheticals and policy questions where quality hinges on reasoning and weighing competing arguments rather than recall. Professors flagged AI answers as pedagogically harmful—likely to mislead a student or hinder learning—3.5 percent of the time, versus 12 percent for answers written by their peers. A follow-up analysis using a separate LLM as judge extended the comparison to nine additional models, including Claude Opus 4.7, ChatGPT 5.4, and Gemini 3.1 Pro. Every model tested outperformed the human instructors. The advantage belonged to the technology class, not to any particular system.

The study is well-designed. The blinding protocol, forced-choice format, and inter-coder agreement analysis are rigorous. The authors tested whether the AI advantage could be explained by surface-level features—length, clarity, structural organization—and found that it could not: the preference for AI answers persisted after controlling for lexico-syntactic differences, suggesting that the advantage was driven at least partly by the substance of the answers rather than their polish. The research team drew from both top-ranked and regional law schools, used a common casebook to ensure comparability, and calibrated AI responses to match the length and structure of human ones. A finding that experienced professors prefer AI-generated explanations over their colleagues’ in a field defined by judgment rather than recall is not easily dismissed.

It is easy to overread. The study’s own authors are careful about what it shows: “Our design evaluates answer quality—which response an expert would prefer to deliver under blinding—rather than learning impact. We therefore treat our results as an encouraging first indication... not as proof of improved student outcomes.” The measure was not whether students learn better from the AI answer. It was whether an experienced instructor, reading two answers blind, would choose the AI’s explanation over a colleague’s. The study tested the quality of the answer, not the quality of the teaching.

What the finding means for hybrid programs

That distinction carries the most force in the setting where the practical stakes are immediate: the growing number of ABA-accredited hybrid JD programs that deliver a substantial share of their curriculum asynchronously. Programs like Syracuse’s JDinteractive and Mitchell Hamline’s Hybrid J.D. combine recorded lectures, online assignments, and limited synchronous sessions—typically intensive weekend residencies or scheduled video meetings—with the structured self-pacing that makes a law degree accessible to working professionals. The asynchronous portion handles the explanatory work: delivering doctrine, walking through analysis, illustrating the application of rules to facts.

The structural weakness of these programs has always been student support. A residential student who does not understand the parol evidence rule can walk down the hall to office hours. A hybrid student submits a question to a discussion board or waits for the next scheduled session. The gap between confusion and clarification can be days, and the support available during that interval—TA responses, peer discussion, asynchronous email exchanges—varies widely across programs and across the semester.

The Salinas study tested exactly the function where hybrid programs are weakest: on-demand, short-answer explanatory support of the kind students seek during office hours. The study’s question pool was designed to represent what students ask after class. The answers were short, focused, and explanatory—not multi-page research memos, but the kind of doctrinal clarification that helps a student past a conceptual bottleneck. If professors prefer AI-generated versions of those answers 75 percent of the time, the case for adopting AI tutors as the primary support mechanism for hybrid programs’ asynchronous components looks difficult to resist. The tool gives better answers, remains available around the clock, reaches every enrolled student without the scheduling constraints that limit faculty availability, and—on the study’s own data—produces pedagogically harmful responses at roughly a third of the rate that human instructors do.

The ABA’s revised accreditation standards add pressure from a different direction. Revised Standard 314, which I discussed in April, requires formative assessments with feedback tied to stated learning outcomes in every first-year course. Hybrid programs implementing that requirement need scalable mechanisms for providing individualized feedback. An AI tutor calibrated to a program’s course-level learning outcomes and grounded in its assigned materials could, in principle, serve the feedback loop that Standard 314 contemplates—one that most hybrid programs currently struggle to sustain with available faculty time.

The argument looks strong because the study tested exactly the variable that favors it.

What the study did not test

Teaching involves more than answering questions well. The observation is plain enough to state and easy enough to forget when a 75-percent preference rate is on the table.

The Socratic method—still the dominant pedagogical approach in first-year law courses, including the synchronous components of most hybrid programs—is built on the opposite of answering. The professor asks, not tells. The question is calibrated to expose the student’s assumption, force the student to reason through the implications, and reveal the instability of what the student thought she understood. The pedagogical value lies not in the answer the professor eventually provides but in the cognitive work the student does before receiving it. A student who asks “does the mailbox rule apply to email acceptances?” and receives a clear, well-reasoned answer has acquired information. A student who receives “what would happen to the mailbox rule’s rationale if the offeror can confirm receipt instantaneously?” has been pushed to reason about the rule’s purpose in a way the answer alone does not produce.

The study’s design cannot capture this function because the design presupposes that the student has asked a question and the task is to answer it. The metric—which answer would you prefer to deliver—assumes delivery. The most productive tutoring interaction may be the one in which the tutor does not answer the question at all, or answers it with a question that forces the student to find the answer herself. No blind forced-choice comparison between two answers can measure the pedagogical value of withholding one.

There is a related function that experienced instructors perform and AI systems currently do not: reading the question behind the question. A student who asks about the parol evidence rule may be confused about the rule itself, or may be confused about something upstream—the distinction between interpretation and supplementation, or the difference between a fully integrated and a partially integrated agreement—and the surface question masks a deeper misunderstanding. An experienced professor recognizes the pattern because she has seen it before, across hundreds of students, and knows that this particular question usually signals a specific conceptual gap. The AI tutor answers the question as asked. The human tutor may answer a different question—the one the student did not know to ask—because a precise response to the wrong question is pedagogically inert.

I have written about the sycophancy problem in AI systems: the documented tendency of large language models to affirm the user’s framing rather than challenge it. The tendency is directly relevant in a tutoring context. A student who asks “I think the consideration requirement is met here because the promisee incurred a detriment—is that right?” is presenting an analysis and asking for validation. A professor who sees a flaw in the reasoning will say so, and will press the student to locate the error before providing the correction. The AI tutor, even when it identifies the mistake, tends toward a softer response that preserves the student’s framing: “that’s a reasonable starting point, but consider whether...” The pedagogical difference between being told your analysis is wrong and being told it is a reasonable starting point is the difference between restructuring your understanding and adding a qualification to an analysis you still believe is basically correct. The study tested the AI’s response to student questions—which answer do you prefer?—not the AI’s response to student reasoning, which is the interaction where the tendency toward affirmation does its damage.

What the hybrid format does to the gap

The gap between answering and teaching applies everywhere legal education happens, but the hybrid format changes the arithmetic.

In a residential program, the synchronous component—classroom instruction, office hours, hallway conversations, supervised study groups—provides the human functions. The professor challenges student reasoning in class. She reads confusion in a student’s face during a cold call. She models professional judgment through the questions she asks and the reasoning she demonstrates in real time. A residential student who uses an AI tutor for after-hours doctrinal clarification still has those human interactions five days a week. The AI handles explanation; the human handles everything else. The division works because the human contact hours are plentiful enough to absorb the functions the AI cannot perform.

Hybrid programs do not have that margin. The synchronous component is already compressed—an intensive residency every few weeks, a scheduled video session each week, a limited number of real-time interactions across the semester. If the asynchronous component replaces its existing human support with AI tutors that give better answers, the total human-instructor contact declines further. The functions that only a human performs—challenging reasoning, reading confusion, modeling the professional judgment the degree is supposed to develop—must be concentrated in whatever synchronous time remains. If that time was already insufficient for those functions, reducing the human presence in the asynchronous portion makes the insufficiency worse.

The compression analogy

The Salinas study invites comparison to the Bednar study I discussed last month. Bednar et al. found that law students who used AI on a synthesis task outperformed their peers on a later reasoning task—but when all participants used AI to revise their work, the revision compressed performance toward the mean, pulling weak work up and strong work down. The compression occurred because students deferred to the AI’s organizational choices without evaluating whether those choices preserved the strongest parts of their analysis.

An AI tutor that consistently delivers better explanations than the available human instructors could produce an analogous effect—not on written work product, but on the student’s analytical development. If every student’s understanding of consideration or the parol evidence rule is shaped by the same AI-generated explanation—clear, well-organized, preferred by professors 75 percent of the time—the result may be a cohort that understands the doctrine through the same framework, with the same emphasis, organized around the same examples. The analytical diversity that arises when different professors stress different aspects of the material, challenge students’ reasoning in different directions, and bring different practice experiences to the same doctrine is a feature of legal instruction, not an inefficiency. Whether AI-mediated uniformity in doctrinal explanation produces better or worse legal reasoning downstream is a question the Salinas study does not answer. It is a question a hybrid program should ask before treating the 75-percent preference rate as a mandate for adoption.

What the study prescribes

Read carefully, the study supports more than its headline suggests and less than the commentary will claim.

It demonstrates that current AI systems produce explanatory content in contracts law that experienced professors, evaluating blind, prefer to what their colleagues produce. That finding should retire the argument about whether AI-generated explanations are good enough for law students. On the study’s data, they are not merely good enough; they are better, as judged by the people whose professional standards define what “better” means. Programs that have resisted AI tutoring on quality grounds will need a different argument.

The study does not demonstrate that AI tutors produce better learning outcomes. The authors say so and propose course-embedded randomized controlled trials as the next step. Until those trials are run, the preference data tells us what professors would rather deliver, not what students retain, apply, or develop from. The history of educational technology is full of interventions that produced better content and worse learning—interactive textbooks that replaced active reading with passive consumption, recorded lectures that offered polish at the cost of engagement, adaptive testing platforms that optimized item difficulty without improving comprehension—because the quality of the content was not the binding constraint on whether students learned.

For hybrid JD programs, the practical implication is to give AI the function it performs best—on-demand, high-quality doctrinal explanation—and use the resulting savings to protect and enrich the human interactions where the program’s educational value resides. That means redirecting the time faculty no longer spend answering the same contracts questions every semester toward more intensive synchronous work: smaller sessions where faculty challenge student reasoning in real time, structured exercises where students receive individualized feedback on their own analysis rather than a model answer, supervised practice where students learn to exercise judgment under conditions an AI tutor cannot replicate because the AI, by design, provides the answer rather than withholding it.

The study shifts the burden of justification. Before Salinas, a law professor could justify her role in part by the quality of her explanations. That justification is no longer available, at least not in contracts, at least not for the kind of office-hours question the study tested. What remains is everything the study did not measure: the challenge that precedes the explanation, the ability to read what a student does not yet know she does not understand, the modeling of professional judgment that cannot be demonstrated by answering a question no matter how well the answer is constructed. Those are the functions hybrid programs should be designing their synchronous components around. They are also the functions most at risk when the institutional response to a 75-percent preference rate is to let the AI handle the teaching.

This post draws on Alejandro Salinas et al., Law Professors Prefer AI Over Peer Answers (Stanford Law Sch., May 2026), and the Stanford Report’s coverage of the study. The discussion of performance compression draws on Nick Bednar et al., Artificial Intelligence and Human Legal Reasoning (Univ. of Minn. L. Sch., Apr. 2026), discussed in an earlier post. The discussion of AI sycophancy in professional contexts builds on Sycophancy as a Failure Mode in AI-Assisted Legal Reasoning, and the ABA accreditation requirements are discussed in Revised Standard 314: Learning-Outcomes Requirements and the August Deadline.