# Generative Interpretation

Canonical page: https://works.battleoftheforms.com/papers/ssrn-4526219/

[p. 1]
Public Law and Legal Theory Research Paper Series
Research Paper No. 23-27
Generative Interpretation
Yonathan A. Arbel
UNIVERSITY OF ALABAMA-SCHOOL OF LAW
David A. Hoffman
UNIVERSITY OF PENNSYLVANIA C AREY LAW SCHOOL
This paper can be downloaded without charge from the
Social Science Research Network
Electronic Paper collection: https://ssrn.com/abstract=4526219.

[p. 2]
ARBEL & HOFFMAN
Generative Interpretation
Yonathan A. Arbel & David A. Hoffman*
99 N.Y.U. L. REV. __ (forthcoming 2024)
[DRAFT October 27, 2023]
We introduce generative interpretation, a new approach to estimating contractual meaning
using large language models. As AI triumphalism is the order of the day, we proceed by way
of grounded case studies, each illustrating the capabilities of these novel tools in distinct
ways. Taking well-known contracts opinions, and sourcing the actual agreements that they
adjudicated, we show that AI models can help factfinders ascertain ordinary meaning in context, quantify ambiguity, and fill gaps in parties’ agreements. We also illustrate how models
can calculate the probative value of individual pieces of extrinsic evidence.
After offering best practices for the use of these models given their limitations, we consider
their implications for judicial practice and contract theory. Using large language models permits courts to estimate what the parties intended cheaply and accurately, and as such generative interpretation unsettles the current interpretative stalemate. Their use responds to efficiency-minded textualists and justice-oriented contextualists, who argue about whether
parties will prefer cost and certainty or accuracy and fairness. Parties—and courts—would
prefer a middle path, in which adjudicators strive to predict what the contract really meant,
admitting just enough context to approximate reality while avoiding unguided and biased
assimilation of evidence. As generative interpretation offers this possibility, we argue it can
become the new workhorse of contractual interpretation.
* Irving Silver Associate Professor, University of Alabama School of Law & William A. Schnader Professor,
University of Pennsylvania Carey School of Law. We thank participants at faculty workshops at Minnesota,
Penn, Texas A&M, and Yale, and Vince Buccola, Jon Choi, James Grimmelmann, Erik Knutset, Jeff Lipshaw,
Omri Ben-Shahar, David Stein, Kevin Tobia, Polk Wagner. Michael Hurley, Elizabeth Meeker and JD Uglum
for helpful research assistance

[p. 3]
GENERATIVE INTERPRETATION
TABLE OF CONTENTS
I. CONTRACT INTERPRETATION AS PREDICTION ........................ 8
II. GENERATIVE INTERPRETATION ........................................................ 21
A. A Gentle Introduction to Large Language Models ........................... 24
B. LLMs as a Source of Contractual Meaning ........................................ 29
C. The Ambiguity Problem ........................................................................ 31
D. Filling Gaps .............................................................................................. 37
E. From Text to Context ............................................................................ 41
III. THE FUTURE OF CONTRACT INTERPRETATION ..................... 43
A. Interpretation for the 99%? .................................................................. 45
B. Beyond the Textualist/Contextualist Divide .................................... 54
CONCLUSION .......................................................................................................... 57
INTRODUCTION
When New Orleans’ levees broke during Hurricane Katrina, devastation, both human and economic, swept the city. And then came the lawyers. In mass contract litigation
by policyholders against their insurance companies, advocates fighting over tens of billions
of dollars of potential liability ultimately contested the meaning of a single word, representing a concept the companies had excluded from coverage: Flood.1 Plaintiffs labored first to
convince judges that flood might not mean water damage caused by humans, so they could
then prove to a factfinder that their insurance policies didn’t contemplate damage resulting
from negligence by the Army’s Corps of Engineers.2 Lawyers for the defense argued that the
1 In re Katrina Canal Breaches Litig., 495 F.3d 191, 199 (5th Cir. 2007) (“We will not pay for loss or damage
caused directly or indirectly by any of the following. Such loss is excluded regardless of any other cause or event
contributing concurrently or in any sequence to the loss. . . . Water . . . Flood, surface water, waves, tides, tidal
waves, overflow of any body of water, or their spray, all whether driven by wind or not . . . .”).
2 In re Katrina Canal Breaches Litig., 495 F.3d 191, 197, 199, 200–01, 203–04 (5th Cir. 2007); Brief for Appellee-Cross Appellant Humphreys at 16–18, In re Katrina Canal Breaches Litig., 495 F.3d 191 (5th Cir.
2007) (No. 07-30119), 2007 WL 4266576; Brief for Plaintiff-Appellee Xavier Univ. of La. at 17–44, In re
Katrina Canal Breaches Litig., 495 F.3d 191 (5th Cir. 2007) (No. 07-30119), 2007 WL 4266583; Brief of the
Chehardy Representative Policyholders in Response at 14–41, In re Katrina Canal Breaches Litig., 495 F.3d
191 (5th Cir. 2007) (No. 07-30119), 2007 WL 4266578. On the scope, source, and allocation of negligence
see ANDY HOROWITZ, KATRINA: A HISTORY, 1915-2015, 1–12, 128–33 (2020); see also Campbell Robertson & John Schwartz, Decade After Katrina, Pointing Finger More Firmly at Army Corps, N.Y. TIMES (May
23, 2015), https://www.nytimes.com/2015/05/24/us/decade-after-katrina-pointing-finger-more-firmly-atarmy-corps.html.

[p. 4]
ARBEL & HOFFMAN
word was unambiguous in context, covering rising waters no matter their cause, and therefore no further factfinding was necessary.3 Here, as so often in real court proceedings, though
rarely in law school classrooms, expensive, cumbersome and unsatisfactory processes of contract interpretation took center stage.4
After years of litigation, the Fifth Circuit—in the best-known and most consequential contracts case of the last generation5—held that flood was unambiguous: It meant any
inundation, regardless of cause.6 To get to that outcome, it engaged in the most artisanal and
articulated form of textualism available in late-stage Capitalism. The court consulted four
dictionaries, one encyclopedia, two treatises, a medley of for-and-against, in-and-out-of-jurisdiction cases, and two linguistic, latinized interpretative canons.7 That’s on top of the four
dictionaries and twenty reporter pages of caselaw analyzing the same problem in the district
court.8
Notwithstanding such expensive and extensive efforts, the court’s interpretation has
come under attack: its dictionary analysis was misleading,9 its canons badly deployed,10 and
3 In re Katrina Canal Breaches Litig., 495 F.3d at 208. Brief of Appellee State Farm Fire & Casualty Co. at 14–
26, In re Katrina Canal Breaches Litig., 495 F.3d 191 (5th Cir. 2007) (No. 07-30119), 2007 WL 2466572;
Brief of Appellee Allstate Ins. Co. & Allstate Indem. Co. at 16–37, In re Katrina Canal Breaches Litig., 495
F.3d 191 (5th Cir. 2007) (No. 07-30119), 2007 WL 4266556.
4 Benjamin E. Hermalin, Avery W. Katz & Richard Craswell, Contract Law, in 1 HANDBOOK OF LAW AND
ECONOMICS 3, 68 (A. Mitchell Polinsky & Steven Shavell eds., 2007) (noting that interpretation is the most
litigated type of contract dispute).
5 The opinion has been cited nearly 7,000 times over fifteen years, discussed in almost 2,000 secondary sources,
and is taught to 1Ls. See, e.g., IAN S. AYRES AND GREGORY M. KLASS, STUDIES IN CONTRACT LAW 701 (9TH
ED. 2017).
6 In re Katrina Canal Breaches Litig., 495 F.3d at 214–19 (“The distinction between natural and non-natural
causes in this context would . . . lead to absurd results and would essentially eviscerate flood exclusions whenever a levee is involved.”).
7 Id. at 210–19.
8 In re Katrina Canal Breaches Consolidated Litig., 466 F. Supp. 2d 729, 747–763 (E.D. La. 2006).
9 Natasha Fossett, What Does Flood Mean to You: The Louisiana Courts’ Struggle to Define in Sher v. Lafayette Insurance Company, 37 S.U. L. REV. 289, 303–306 (2010) (arguing that flood as defined in Louisiana
Law had a narrower meaning than either the Fifth Circuit or the later Louisiana Supreme Court decision implied).
10 Rachel Lisotta, In Over Our Heads: The Inefficiencies of the National Flood Insurance Program and the
Institution of Federal Tax Incentives, 10 LOY. MAR. L. J. 511, 523 (2012) (criticizing the court for not focusing
on the intent of the parties); Fossett, supra note 9, at 309–10 (arguing for use of the absurdity canon); Mark
R. Patterson, Standardization of Standard-Form Contracts: Competition and Contract Implications, 52 WM.
& MARY L. REV. 327, 356 (2010) (critiquing the Fifth Circuit for failing to address the significance of the
relevant policy being drafted by the Insurance Service Office); Eyal Zamir, Contract Law and Theory: Three
Views of the Cathedral, 81 U. CHI. L. REV. 2077, 2096 (2014) (critiquing the limited tools used by American
courts to regulate standard form contracts, as evidenced by the court’s narrow approach in the Katrina case).

[p. 5]
GENERATIVE INTERPRETATION
some of the relevant legal authorities were in fact pro-plaintiff.11 Rather than reach a decision that followed from a constraining method, the Fifth Circuit (says its critics) merely
affirmed its pro-business priors.12 If textualism looks like another infinitely malleable and
justificatory practice in high stakes cases, what good is it? But textualism’s competitor,
kitchen-sink contextualism, has been in bad odor for two generations, at least for the sorts
of contracts that generally get litigated.13 Thus, contract jurists muddle along, looking for a
better, more convenient path.14
In this article we offer a new approach to determining contracting parties’ meaning,
which we’ll call generative interpretation.15 The idea is simple: applying large language
11 See, e.g., Sher v. Lafayette Ins. Co., 2007-CA-0757, 2007 WL 4247708 (La. App. 4th Cir. Nov. 19, 2001)
(finding flood ambiguous), reversed by Sher v. Lafayette Ins. Co., 07-2441, 988 So. 2d 186 (La. 4/8/08); Ebbing v. State Farm Fire & Cas. Co., 1 S.W.3d 459, 462 (Ark. Ct. App. 1999) (holding flood excluded manmade
causes); cf. M & M Corp. of S.C. v. Auto-Owners Ins. Co., 701 S.E.2d 33 (S.C. 2010) (finding that rainwater
deliberately channeled on insured’s land was not flood water).
12 Willy E. Rice, The Court of Appeals for the Fifth Circuit: A Review of 2007–2008 Insurance Decisions, 41
TEX. TECH L. REV. 1013, 1039 (2009) (“[T]he Fifth Circuit has received some highly negative coverage in
newspapers for its pro-insurer, Katrina-related decisions . . . Without doubt, for those who believe the Fifth
Circuit is a ‘pro-insurer court,’ the discussions of the outcomes and opinions in those cases will do very little
to dispel that perception.”); Kenneth S. Abraham & Tom Baker, What History Can Tell Us About the Future
of Insurance and Litigation After Covid-19, 71 DEPAUL L. REV. 169, 189 (2022) (arguing that homeowners‘ unwillingness to buy federal flood insurance helped motivate strict construction of their private contracts);
Thomas A. McCann, 5th Circuit Ruling: A Tough Pill to Swallow for Katrina Policyholders, 20 LOY. CONSUMER L. REV. 100 (2007); Becky Yerak, Insurers Win Key Katrina Ruling, CHICAGO TRIBUNE (Aug. 3,
2007), https://www.chicagotribune.com/news/ct-xpm-2007-08-03-0708020805-story.html (noting the effect on homeowners). To be clear, the earlier ruling came under even more scrutiny. See, e.g., Walter J. Andrews, Michael S. Levine, Rhett E. Petcher & Steven W. McNutt, Essay, A ”Flood of Uncertainty”: Contractual Erosion in the Wake of Hurricane Katrina and the Eastern District of Louisiana’s Ruling in In Re Katrina
Canal Breaches Consolidated Litigation, 81 TUL. L. REV. 1277 (2006) (arguing that the District Court’s finding that flood was ambiguous was wrong); Michelle E. Boardman, The Unpredictability of Insurance Interpretation, 82 L. & CONTEMP. PROBS. 27, 41 n.45 (2019) (calling the District Court infamous and arguing
that the Fifth Circuit ruling was correct); Edward P. Richards, The Hurricane Katrina Levee Breach Litigation: Getting the First Geoengineering Liability Case Right, 160 U. PA. L. REV. 267 (2012) (arguing in support
of the Fifth Circuit ruling).
13 Lawrence A. Cunningham, Contract Interpretation 2.0: Not Winner-Take-All but Best-Tool-For-The-Job,
85 GEO. WASH. U. L. REV. 1625, 1628–31 (offering the history of contextualism versus textualism and noting
a rise in the latter starting in the early 1990s). But cf. 5 CORBIN ON CONTRACTS § 24.7 (2023) (noting a
“trend” toward abandoning plain meaning in some states).
14 Cunningham, supra, at 1633–43 (noting proposals to compromise between the two approaches).
15 For previous discussions of the use of large language models in contracts, see Ryan Catterwell, Automation
in Contract Interpretation, 12 L. INNOVATION & TECH. 81, 100 (2020) (early paper showing how information can be extracted from contractual texts); Yonathan A. Arbel & Shmuel I. Becher, Contracts in the Age
of Smart Readers, 90 GEO. WASH. L. REV. 83 (2022) (arguing that language models could serve as “smart

[p. 6]
ARBEL & HOFFMAN
models (LLMs) to contractual texts and extrinsic evidence to predict what the parties
would have said at contracting about what they meant.16 Our goal is to convince you that
generative interpretation avoids some of the problems that bedeviled the Fifth Circuit in its
Katrina litigation, while being materially more accessible and transparent. Giving courts a
convenient way to commit to a cheap and predictable contract interpretation methodology
would be a major advance in contract law, and parties may start to include them in their
choice-of-law repertoire. We argue that even today’s freshly-minted LLMs can be of service.
Convincing judges to forgo dictionaries and canons and adopt a chat tool best
known today for encouraging lawyers to submit fake authorities will be a tall order.17 We’ll
largely proceed by way of demonstrative case studies. Let’s start with the word flood. In the
Katrina case, the question was really whether the widely shared meaning of flood reasonably
excluded manmade disasters. To answer that question you could, as the court did, turn to
the traditional tools of High Textualism.18 Or you could survey insured citizens (if you could
identify them and avoid motivated answers).19 And you might even, if you were technically
sophisticated and patient enough, query a few relatively small databases and ask which words
in English generally tend to occur, or collocate, with flood in newspapers, books, and the
like.20
But we instead turned to a convenient, free, open-source LLM tool resting on a database of trillions of words and asked it to transform words into complex vectors in a process
called embedding.21 As a first cut, this process can be thought of as trying to quantify how
much a word or phrase belongs to a given category, or dimension. Thus, if there is a
readers” of consumer contracts); Noam Kolt, Predicting Consumer Contracts, 37 BERK. TECH. L.J. 71 (2022)
(arguing that ChatGPT might be useful in helping consumers to understand their contracts and providing
examples).
16 Cf. Jonathan H. Choi, Measuring Clarity in Legal Texts, 91 U. CHI. L. REV. (forthcoming, 2024). Choi’s
excellent paper, though not focused on contract interpretation particularly, significantly advanced understanding of how automated interpretative methods can aid factfinders. We build on his work technically by developing new ways of interacting with large language models and incorporating context and attention mechanisms.
17 See infra at text accompanying notes 197–200 (discussing Mata v. Avianca, Inc., __ F.Supp.3d. __, 2023 WL
4114965 (June 22, 2023).); see also Ex Parte Allen Michael Lee, __. S.W.3d. __ , 2023 WL 4624777, at *1 n.2
(Ct. App. Tex. July 19, 2023) (explaining the court’s suspicion that counsel had filed briefs using ChatGPT
and had made up cases and citations).
18 For a related phrase, see Ryan Doerfler, LateStage Textualism, 2022 Sup. Ct. Rev. 267.
19 See Omri Ben-Shahar & Lior J. Strahilevitz, Interpreting Contracts via Surveys and Experiments, 92 N.Y.U.
L. REV. 1753 (2017) (proposing using surveys to interpret certain mass contracts).
20 See Stephen C. Mouritsen, Contract Interpretation with Corpus Linguistics, 94 WASH. L. REV. 1337, 1378
(2019) (proposing using corpus linguistics to interpret contracts).
21 For a survey of embedding methods, see MOHAMMAD TAHER PILEHVAR & JOSE CAMACHO-COLLADOS,
EMBEDDINGS IN NATURAL LANGUAGE PROCESSING 27–110 (2021).

[p. 7]
GENERATIVE INTERPRETATION
dimension for the word water, fish will score higher than dogs. Using an interface we developed, we queried several models about the relation of the policy exclusion term relative to
words and phrases describing other potential sources of damage.22
Figure 1: Analysis of the cosine distance—a measure of distance for the numerical representation of terms (embeddings) by language models—between the exclusion clause ("We will
22 All of the code necessary to replicate these results, and the remaining ones in the paper, can be found at:
GITHUB, https://github.com/yonathanarbel/generativeinterpretation/tree/main (last visited Sep. 6,
2023).The exclusion term is the language contained at footnote 1, supra. Because embeddings are vectors in
high-dimensional space, we can measure the distance between them. This method has been used extensively in
the literature. See Choi, supra note 16, at 24–26 (using method and reports its usage and limitations.) For a
non-legal example, see e.g., Nitika Mathur, Timothy Baldwin & Trevor Cohn, Putting Evaluation in Context:
Contextual Embeddings Improve Machine Translation Evaluation, PROCEEDINGS OF THE 57TH ANNUAL
MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS 2799 (2019). We found that while results using this method seem sensible, they are also fragile. To create a more robust measure, we relied on the
embeddings of the ten top performing models today (found at https://huggingface.co/spaces/mteb/leaderboard on pair classification tasks) and used similar sentence structures. This approach is partly inspired by
Maria Antoniak & David Mimno, Evaluating the Stability of Embedding-based Word Similarities, 6 TRANS.
ASS'N FOR COMPUTATIONAL LINGUISTICS 107 (2018). We then calculated the cosine distance, normalized
it, and reported the results in the figure below. For an elaboration on the limitations of these techniques, see
infra notes 210-211 and accompanying text.

[p. 8]
ARBEL & HOFFMAN
not pay for loss or damage caused directly or indirectly by . . . . Water . . . Flood . . . all whether
driven by wind or not . . .”) and various terms and phrases.
To read Figure 1, focus on the location of the red markers. The further they are from
the origin, the more distant the model considers the semantic relationship between the
phrases.23 In our view, the Figure offers immediately available, objective, cheap support for
the court’s judgment that floods can be unnaturally caused. Common sentences regarding
floods do not distinguish between the type of cause, but seem more focused on its typicality.
Our quality checks, flood caused by tears of joy or police, are indeed farther out than flood
caused by heavy rainfall or a severe storm.24 And while it supports this decision of the court,
it challenges another. Louisiana courts refused to exclude water main floods, even though
linguistically they appear to be as much of a flooding event as any other.25
Now, the model doesn’t provide (nor could it) a scientific answer to the question of
whether words are sufficiently close to make the plain meaning of flood unambiguous. But
there is a bit of difference between an informed conclusion based on a statistical analysis of
billions of texts and a judgment by a few dictionary editors. And there is an ocean of difference between the baroque and expensive textualism the court used and code that is cheap,
replicable, quick, and most importantly, extremely straightforward to use. Simply put, generative interpretation is good enough for many cases that currently employ more expensive,
and arguably less certain, methodologies. It's a workable, workmanlike method for a resource-constrained contract litigation world.
In Part I, we introduce the methodologies of contract interpretation and argue that
they badly fail at their core purposes of unbiased, accessible ascertainment of what the parties
would have wanted. In practice, interpretation operates as a kludgy prediction engine. Both
textualism and contextualism strive to estimate what the parties would have said on a matter,
accounting for realistic constraints of evidence and cost. But those constraints impose real
tradeoffs and can’t avoid legitimacy problems generated by courts’ motivated reasoning.26
23 The models we use here specialize in creating embeddings that can measure the semantic textual similarity
of sentences and words. For technical background, see Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang, Sentence-T5: Scalable Sentence Encoders from Pretrained Text-to-Text Models, arXiv:2108.08877 (2021)
24 It is telling that ‘fire,’ while having a wide distribution, is nearer to the origin than ‘tears of joy.’ A possible
reason is that the exclusion term references a number of harm-causing events, and given that fire is
25 Sher v. Lafayette Ins. Co., 2007-2441 (La. 4/8/08), 988 So. 2d 186, 195, on reh’g in part (July 7, 2008)
(“[I]nundation of property due to broken water mains . . . would not be excluded as a ‘flood.’”). In re Katrina
Canal Breaches Litig., 495 F.3d 191, 216 (5th Cir. 2007) (“Unlike a canal, a water main is not a body of water
or watercourse.”).
26 See supra note 12 (charging the Fifth Circuit with being pro-business).

[p. 9]
GENERATIVE INTERPRETATION
We describe some modern proposed improvements on interpretation’s normal science and
suggest that however promising they are, concerns about usability and cost impair their realworld utility.27
Part II is the heart of the Article. Here, we look at several types of interpretative
problems generated by real contracts that produced contracts opinions. These range from
the easy (what is the predicted meaning of a particular word?), to the hard (is there an ambiguity?), to the metaphysical (what did the parties mean when they clearly hadn’t considered the issue?). In each example, we showcase new ways to use large language models to
sharpen intuitions about the parties’ presumed intent, to illuminate how transparent and
objective interpretative methodologies have advantages over intuitive ones, and to suggest
that generative interpretation has real promise as a judicial adjunct. The cases we run
through include casebook staples, like Trident Ctr. v. Connecticut Gen. Life Ins. Co.28 and
C & J Fertilizer, Inc. v. Allied Mut. Ins. Co.,29 as well as some that should be, like Famiglio
v. Famiglio,30 Haines v. City of New York,31 and Stewart v. Newbury.32 For many of these
cases, our work is based on archival research identifying original contract materials, until
now obscured by the judicial opinions that purportedly interpret them.
These case studies show how generative interpretation might be deployed in practice. As we will explore, the technology underlying large language models can do more than
merely help us see if flood is closer to levee than it is to joy. Dictionaries, encyclopedias, or
corpus linguistics can do that. What makes large language models powerful is the vastness of
the data they incorporate; what makes them unique is that they wield an internal mechanism
known as “attention” which allows them to account for to context. And by becoming context sensitive, these models can parse the effects of contract text from the marginal value of
relevant extrinsic evidence
27 See infra notes 32–107 and accompanying text.
28 847 F.2d 564 (9th Cir. 1988). See, e.g., RANDY E. BARNETT & NATHAN B. OMAN, CONTRACTS: CASES
AND DOCTRINE 483 (7th ed. 2021); E. ALLEN FARNSWORTH, CAROL SANGER, NEIL B. COHEN, RICHARD
R.W. BROOKS & LARRY T. GARVIN, CASES AND MATERIALS ON CONTRACTS 560 (10th ed. 2023).
29 227 N.W.2d 169 (Iowa 1975). See Brian Bix, The Role of Contract: Stewart Macaulay’s Lessons from Practice, in REVISITING THE CONTRACTS SCHOLARSHIP OF STEWART MACAULAY: ON THE EMPIRICAL AND
THE LYRICAL 252 (Jean Braucher, John Kidwell & William Whitford eds., Hart Publishing, 2013) (describing
C&J and noting that it is often assigned in casebooks, including Stewart Macaulay’s and Charles Knapp’s).
30 279 So.3d 736 (Fla. Dist. Ct. App. 2019).
31 41 N.Y.2d 769 (1977). See also ROBERT S. SUMMERS, ROBERT A. HILLMAN AND DAVID A. HOFFMAN,
CONTRACT AND RELATED OBLIGATION: THEORY, DOCTRINE, AND PRACTICE 834 (8th ed. 2021).
32 220 N.Y. 379 (1917). See also SUMMERS ET AL., supra note 27, at 948.

[p. 10]
ARBEL & HOFFMAN
But current practices about LLMs and their future uses are contingent: Lawyers
tend to use tools before they are theoretically sharp.33 In Part III, we develop a theory to
justify and constrain generative interpretation going forward, as the technology that enables
it continues to rapidly develop and its use by lawyers and judges grows explosively. We make
two claims.
First, the method fills a glaring need for a simple, transparent, and convenient way
to commit to an interpretative method that helps predict the parties’ intent. If courts follow
the set of best practices we describe, they will avoid certain access-to-justice and legitimacy
problems that have beset the modern contract litigation machine. Second, rather than
simply a marginal improvement over dictionary-and-canon textualism, or its negation as a
form of 1960s-California contextualism,34 use of artificial intelligence (AI) should prompt
a top-to-bottom reexamination of the assumptions justifying these approaches to interpretation. As more courts commit to generative interpretation, parties may come to prefer contextual evaluation of meaning when their deals are evaluated, thus flipping a longstanding
default rule in contract law.35
We do consider some of the developing objections to the use of large language models, including their hallucinatory errors, biases, black-box methods, and the tension between
the rapidity of their deployment and stately needs of precedential decision-making. As we
show, generative interpretation’s dangers illustrate its limits: Judges will have to use these
engines as tools to excavate the normative judgments on which all interpretative and adjudicatory exercises rest. Large language models aren’t robot judges. What they will do (and
maybe are already doing) is help judges illuminate the degree to which we want to give the
parties what they really bargained for, as best as we can.
I. CONTRACT INTERPRETATION AS PREDICTION
Jurists interpreting contracts start with a simple question: “what would the parties
33 Consider originalism.
34 For defenses of contextualism, see Jeffrey W. Stempel & Erik S. Knutsen, Rejecting Word Worship: An
Integrative Approach to Judicial Construction of Insurance Policies, 90 U. CIN. L. REV. 561, 600–01 (2021);
Jeffrey W. Stempel, Unmet Expectations: Undue Restriction of the Reasonable Expectations Approach and
the Misleading Mythology of Judicial Role, 5 CONN. INS. L.J. 181, 183–84 (1998).
35 In some industries, the evidence that parties would prefer that later decisionmakers incorporate context is
robust. William Hoffman, On the Use and Abuse of Custom and Usage in Reinsurance Contracts, 33 TORT
& INS. L.J. 1, 3 (1997) (origin of nonintegrated contracts); William Hoffman, Facultative Reinsurance Contract Formation, Documentation, and Integration, 38 TORT TRIAL & INS. PRAC. L.J. 763, 836–37 (2003)
(explaining why parties prefer custom).

[p. 11]
GENERATIVE INTERPRETATION
have said about the meaning of a disputed phrase at the time they entered the contract?”36
That is, to “ascertain the parties’ intention at the time [the parties] made their contract.”37
As Alan Schwartz and Bob Scott noted in their canonical article, Contract Theory and the
Limits of Contract Law, this question in theory has a “correct answer.”38 In practice,
however, it is not always easy or possible to know what it is. Lacking a time machine,
adjudicators traditionally have stitched together an answer using imperfect evidence—a mix
of the contract’s text, the parties’ statements about the deal (whether from before, during,
or after its formation),39 market data,40 and some hunches about fairness and efficiency
under the circumstances.41
To put it another way, almost all jurists agree that the goal of contract
interpretation—its real ambition—is to be a prediction machine.42 That is, to look
36 Bruce v. Blalock, 241 S.C. 155, 161, 127 S.E.2d 439, 442 (1962) (“In construing the contract the Court will
ascertain the intention of the parties . . . as well as the purposes had in view at the time the contract was made.”).
37 STEVEN J. BURTON, ELEMENTS OF CONTRACT INTERPRETATION § 1.1, at 1.
38 Alan Schwartz & Robert E. Scott, Contract Theory and the Limits of Contract Law, 113 YALE L.J. 541, 568
(2003) (“There is a consensus among courts and commentators that the appropriate goal of contract interpretation is to have the enforcing court find the ‘correct answer.’”); Alan Schwartz & Robert E. Scott, Contract
Interpretation Redux, 119 YALE L.J. 926 (2010). For criticisms, see Adam B. Badawi, Interpretive Preferences
and the Limits of the New Formalism, 6 BERKELEY BUS. L.J. 1 (2009); Shawn J. Bayern, Rational Ignorance,
Rational Closed-Mindedness, and Modern Economic Formalism in Contract Law, 97 CALIF. L. REV. 943
(2009); Robin Bradley Kar & Margaret Jane Radin, Pseudo-Contract and Shared Meaning Analysis, 132
HARV. L. REV. 1135, 1182–92 (2020) (arguing that sophisticated parties would not and do not prefer acontextual readings).
39 Stephen F. Ross & Daniel Trannen, The Modern Parol Evidence Rule and its Implications for New Textualist Statutory Interpretation, 87 GEO. L.J. 195, 196–97 (1995) (noting disagreement between Williston and
Corbin on parol evidence).
40 JOHN BOURDEAU, PAUL M. COLTOFF, JILL GUSTAFSON, GLENDA K. HARNAD, JANICE HOLBEN, SONJA
LARSEN, LUCAS MARTIN, ANNE E. MELLEY, KARL OAKES, KAREN L. SCHULTZ & ERIC C. SURETTE, AMERICAN JURISPRUDENCE § 219 (2nd ed. 2023) (“Under the Uniform Commercial Code, a course of dealing between the parties . . . may give particular meaning to, and supplement or qualify, terms of an agreement.”).
41 Omri Ben-Shahar, David A. Hoffman & Cathy Hwang, Nonparty Interests in Contract Law, 171 U. PA. L.
REV. 1095, 1017–1129 (2023) (describing courts use of public interests in interpreting contracts).
42 Schwartz & Scott, supra note 38, at 568 (noting “consensus” about the “appropriate goal”). There are exceptions. Eyal Zamir, for example, argues that interpretation should adhere to moral and social norms, partly because they are more likely to reflect the parties’ true intent, and partly because only those contracts are worth
enforcing. Cf. Eyal Zamir, The Inverted Hierarchy of Contract Interpretation and Supplementation, 97
COLUM. L. REV. 1710, 1777–88 (1997). Other common reasons to deviate from the parties’ intentions include attempts to incent clearer drafting, to share valuable information, and to facilitate standardization. See,
e.g., Ian Ayres, Default Rules for Incomplete Contracts, in 1 THE NEW PALGRAVE DICTIONARY OF ECONOMICS AND THE LAW 585 (Peter Newman ed., 1998) (reviewing the economic theories for the design of default
rules). It is inevitable that the parties at times will choose not to think about a relevant possibility to minimize
transaction costs or permit a deal. Therefore, when we say that the goal is prediction, consider it the beginning,
rather than the end, of interpretation.

[p. 12]
ARBEL & HOFFMAN
backward and predict what the parties would have said they meant.43 This seems
straightforward, akin to the retrospective intent-based inquiries we see in criminal law and
tort. Nonetheless, interpretation is “the least settled, most contentious area of
contemporary contract doctrine and scholarship.”44 That’s because of the many problems it
seeks to solve. As Greg Klass puts it, jurists ask (1) whose meaning counts, (2) what type of
meaning matters (local/majoritarian, semantic/pragmatic), and (3) what facts determine
the legally relevant meaning.45 These questions map, imperfectly, onto distinctions between
textualists and contextualists. And, at the bottom of the well, contractual interpretation
resolves questions of claims to judicial power, and thus legitimates violence.46 The result is
that parties contesting how to interpret contracts are sometimes arguing about what
outcomes are just, not merely which are more likely to lead to parties getting what they want.
But putting aside normative questions, even basic operational empirics about
interpretation—the prediction questions everyone agrees are at the core—are hard.
Prediction is difficult, and mistakes are inevitable. Accuracy—in the sense of thinking that
we really got as close as we could to knowing what the parties would have said—trades off
against cost and certainty. Efficiency-minded scholars have repeatedly argued that as the
amount of evidence offered to prove the parties’ contemporaneous-to-contracting meaning
increased, so does expense across several domains.47
As a first cut at that cost, consider that when parties are permitted to adduce additional sources of interpretative evidence, they also increase the range of defensible answers
from the tribunal. This means that it becomes harder to know what the factfinder will do—
their ability to choose unexpected meanings waxes with the evidentiary inputs.48 But worse,
43 In recent work one of us elaborates on the idea developed here of interpretation-as-prediction. See
Yonathan A. Arbel, Time and Contract Interpretation: Lessons from Machine Learning, in Research
Handbook on Law and Time (forthcoming 2024, Frank Fagan & Saul Levmore Eds.).
44 Ronald J. Gilson, Charles F. Sabel & Robert E. Scott, Text and Context: Contract Interpretation as Contract
Design, 100 CORNELL L. REV. 23, 25 (2014); Schwartz & Scott, Redux, supra note 38, at 928.
45 See Gregory Klass, Contracts, Constitutions and Getting the Interpretation-Construction Distinction
Right, 18 GEO. J. L. & PUB. POL’Y 13, 24–28 (2020).
46 See Robert M. Cover, Violence and the Word, 95 YALE L.J. 1601, 1601 (1986).
47 See generally Gregory Klass, Contract Exposition and Formalism, GEORGETOWN LAW FACULTY
PUBLICATIONS & OTHER WORKS 63 (2017), https://scholarship.law.georgetown.edu/facpub/1948/ (“The
more evidence one allows into interpretation, the less certain the outcome. The costs of such uncertainty in
the contractual setting can be especially high.”); Schwartz & Scott, supra note 38, at 580 (2003) (“Expanding
the evidentiary base is not costless, however. The parties, therefore, face a tradeoff between the efficiency of
increased accuracy and the inefficiency of increased contract-enforcement costs.”).
48 Klass, supra note 44 , at 63 (“A party that wants to organize its behavior . . . needs to be able to predict how
an adjudicator will later interpret that agreement. To the extent thicker interpretive rules reduce predictability,

[p. 13]
GENERATIVE INTERPRETATION
both parties and factfinders are motivated in how they offer and process evidence.49 In a
regime that permits more evidence, parties will offer evidence that favors their view, sometimes unconsciously motivated to avoid presenting data that favors the other side;50 factfinders, equally subject to motivated cognition, will process new evidence in biased ways.51
At the same time, as the types of evidence relevant to contract interpretation become
more capacious, parties will seek to introduce more evidence at trial, raising the costs of
litigation.52 These costs may be significant, even in dispute resolution forums like arbitration
that are built to resolve cases quickly and cheaply.53 The interpretation arms race has led
scholars to model when parties would prefer to spend money ex ante on more specified text,
rather than spend ex post on litigation.54 That is, to pre-commit to methodologies which are
less accurate but more efficient.
This is all familiar territory. Now, consider what interpretative methodologies have
been on offer to calibrate between predictive accuracy and virtues that center around
certainty and efficiency. Like other legal extrapolative enterprises, interpretation has
they impose an additional cost . . . .”). {NOTE: the Klass cite is to his FORMALISM paper, not, as was corrected in this round of edits, to CONTRACTS, CONSTITUTONS… See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2913620 }
49 Christoph Engel, Judicial Decision-Making. A Survey of the Experimental Evidence, MPI COLLECTIVE
GOODS DISCUSSION PAPER, No. 6. 5, (2022) (noting that even when decision makers are motivated to be
impartial, bias has been shown to sneak in inadvertently via race, gender, ideology, and the stereotype that
tattoos are typical for criminals.); Lawrence M. Solan, Terri Rosenblatt & Daniel Osherson, False Consensus
Bias in Contract Interpretation, 108 COLUM. L. REV. 1268, 1269 (2008) (explaining that “false consciousness
bias” may cause contracting parties not to recognize different interpretations of their agreement until litigation, at which point judges fall victim to the same bias).
50 Schwartz & Scott, supra note 38, at 607 (2003) (claiming that under standards allowing for recovery of
“commercially reasonable” costs and investments, parties would always claim their costs were higher and their
investments reasonable).
51 Solan, Rosenblatt & Osherson, supra note 49, at 108 (“Susceptibility to false consensus bias places judges
engaged in the interpretation of contractual language at risk of erroneous decisionmaking.”).
52 For some evidence on this process in the courts, see generally Lisa Bernstein, Custom in the Courts, 110 NW.
U. L. REV. 63 (2015) (showing that courts accept evidence of custom that isn’t systematic even in commercial
disputes).
53 Richard A. Posner, The Law and Economics of Contract Interpretation, 83 TEXAS L. REV. 1581, 1605–06
(2004) (arguing that commercial arbitration, where the arbitrator uses commercial common sense to predict
intent rather than asking the parties to present evidence, may be preferable when the written contract does not
make the parties’ intentions immediately clear because it allows the parties to avoid extra expenses).
54 Ronald J. Gilson, Charles F. Sabel & Robert E. Scott, Braiding: The Interaction of Formal and Informal
Contracting in Theory, Practice, and Doctrine, 110 COLUM. L. REV. 1377, 1391 n.35 (2010) (“If conditions
are unlikely to change much in the future (the level of uncertainty is low), and thus the ex-ante cost of writing
contract rules is low relative to the anticipated gains, the parties’ most cost-effective strategy is to write a complex, rule-based contingent contract.”).

[p. 14]
ARBEL & HOFFMAN
developed two basic methods to solve for the predictive question in the absence of the ability
to travel to the time of contracting.55 These methods, textualism and contextualism, are
represented in the real world by the courts in New York and California, respectively.56
New York’s textualist judges focus on the contract: They take its words as the
canonical source of the parties’ meaning and abjure other sources of evidence as predictive
grist. Textualists try to use the common sense meaning of words, using dictionaries to obtain
the public meaning of the words the parties chose, and grammatical and lexical tools to
understand how the words, when collated, create obligation.57 Textualism has known
advantages, including forcing the parties to think carefully about what they mean, and to use
contract words in ordinary ways.58 This ideological approach to contract interpretation
resembles that same concept in statutory and constitutional interpretation;59 though it is
less politically valanced, it is equally ascendent.60
The linguistic textualist project has long been controversial. To begin with, the
method of brute sense plain meaning primes judges to overconfidently believe that their
beliefs and conclusions are more common than they in fact are.61 As Arthur Corbin put it
long ago, “when a judge reads the words of a contract he may jump to the instant and
confident opinion that they have but one reasonable meaning and that he knows what it
55 John F. Manning, What Divides Textualists from Purposivists?, 106 COLUM. L. REV. 70, 75 (2006) (arguing
that textualism and purposivism remain meaningfully distinct modes of statutory interpretation); see generally
Eric A. Posner, The Parol Evidence Rule, the Plain Meaning Rule, and the Principles of Contractual Interpretation, 146 U. PA. L. REV. 533 (1998) (defending textualist approaches in contract law).
56 Klass, supra note 45, at 29 (distinguishing New York and California archetypes).
57 Joshua M. Silverstein, Contract Interpretation Enforcement Costs: An Empirical Study of Textualism Versus Contextualism Conducted Via the West Key Number System, 47 HOFSTRA L. REV. 1011, 1014 (2019)
(“‘Textualist’ judges and commentators argue that the interpretation of contracts should focus primarily on
the language contained within the four corners of written agreements.”); Gilson, Sabel & Scott, supra note 54,
at 40 (“Textualist arguments accordingly focus on the insight that, for legally sophisticated parties who write
bespoke contracts, context is endogenous; the parties can embed as much or as little context into a customized
agreement as they wish, and they can do so in many different ways.”); Uri Benoliel, The Interpretation of
Commercial Contracts: An Empirical Study, 69 ALA. L. REV. 469, 472–73 (2017) (noting importance of ambiguity).
58 Schwartz & Scott, supra note 38, at 572.
59 For a discussion of the differences between statutory and contract textualism, see William Baude & Ryan D.
Doerfler, The (Not So) Plain Meaning Rule, 84 U. CHI. L. REV. 539, 563–65 (2017). For an insightful argument that interest in contract interpretation has waned relative to statutory interpretation, see Karen Petroski,
Does it Matter What We Say About Legal Interpretation?, 43 MCGEORGE L. REV. 359, 382 (2019).
60 Ethan J. Leib, The Textual Canons in Contract Cases: A Preliminary Study, 2022 WIS. L. REV. 1109 (2022)
(studying the use of textualist canons in contract interpretation); J. Stempel & Knutsen, supra note 34, at 565–
66 (“In short, textualism has been resilient and ascendant in the 40 years of the post-Restatement era.”).
61 See infra at text accompanying notes 117–10.

[p. 15]
GENERATIVE INTERPRETATION
is.”62 Empirical work—experimental63 and sociological64—has since found that judges doing
plain meaning analysis disagree with each other and with lawyers about things they thought
obvious.
Critics also charge textualists with incoherence about ambiguity.65 To reach the safe
shoals of plain meaning, textualists ask first if the language is unambiguous.66 But while textualism provides tools to discover ambiguities, in practice, critics charge, it fails to prioritize
one plausible interpretation over the other. It appears to simplify interpretative disputes, but
in reality sometimes facilitates expensive, biased battles over extrinsic evidence.67
But even outside of ambiguity, textualism’s basic methodological tools are
remarkably underdeveloped. Scholars often blame the humble dictionary.68 Courts doing
62 ARTHUR LINTON CORBIN, CORBIN ON CONTRACTS § 535 (rev. ed. 1960)
63 Solan, Rosenblatt & Osherson, supra note 49, at 1285–94 (finding that we overestimate our sense of whether
others will agree about contract interpretation).
64 John F. Coyle, The Canons of Construction for Choice-of-Law Clauses, 92 WASH. L. REV. 631, 682–87
(2017) (showing that in the absence of a systematic survey, judges can interpret contract language in ways that
conflict with the parties’ intentions).
65 See Lawrence M. Solan, Pernicious Ambiguity in Contracts and Statutes, 79 CHI-KENT L. REV. 859, 859
(2004) (describing problems with the concept of ambiguity).
66 11 Williston on Contracts § 33:43 (4th ed.) (“When patent ambiguities are found by a court that adheres to
the traditional distinctions, they will be resolved by the rules of interpretation or not at all.”). Those supposed
rules of interpretation reference § 30:4, where they turn out to combine extrinsic evidence, contract purpose,
and rules of construction.
67 Ward Farnsworth, Dustin F. Guzior & Anup Malani, Ambiguity About Ambiguity: An Empirical Inquiry
into Legal Interpretation, 2 J. LEGAL ANALYSIS 257, 271 (2010) (arguing that policy preferences drive ambiguity) (statutory); Schwartz & Scott, supra note 38, at 570 n.55 (“Courts seldom distinguish between ‘vague’
and ‘ambiguous’ terms . . . . More narrowly, however, a word is vague to the extent that it can apply to a wide
spectrum of referents, or to referents that cluster around a modal ‘best instance,’ or to somewhat different
referents in different people.”).
68 Thomas R. Lee & Stephen C. Mouritsen, Judging Ordinary Meaning, 127 YALE L.J. 788, 801, 810–11
(2018) (identifying several problems with dictionaries, including their failure to define words in terms of “prototypes” and the inconsistency of definitions across dictionaries); Stephen C. Mouritsen, The Dictionary Is
Not a Fortress: Definitional Fallacies and a Corpus-Based Approach to Plain Meaning, 5 B.Y.U. L. REV. 1915,
1919 (2010) (describing “widely shared” false views about dictionaries); Lawrence Solan, When Judges Use
Dictionaries, 68 AM. SPEECH 50, 50 (1993) (“[W]e commonly ignore the fact that someone sat there and
wrote the dictionary, and we speak as though there were only one dictionary, whose lexicographer got all the
definitions ‘right’ in some sense that defies analysis.”); Samuel A. Thumma & Jeffrey L. Kirchmeier, The Lexicon Has Become A Fortress: The United States Supreme Court's Use of Dictionaries, 47 BUFF. L. REV. 227,
276 (1999) (“[A]s with the other steps in the Court's general process of using dictionaries, selecting a specific
definition for a term can be problematic, at times appears to lack principled guidance and can determine the
outcome of a case.”).

[p. 16]
ARBEL & HOFFMAN
textualism are sometimes reversed for failing to use one.69 But it’s an imprecise tool for
discerning the parties’ intent at the drafting stage. Selecting between dictionaries is a valueladen act,70 and even within a single volume, dictionaries do not provide a single plain, or
majoritarian, meaning of words.71 Critically, dictionary definitions are blind even to internal
context, those other parts of the document or statute that textualists do embrace.72 As Kevin
Tobia demonstrated, definitions can be poor trackers of actual usage, a point well
understood by anyone not adding tomatoes to a fruit salad.73
Dictionary-thumping jurists face two opposing critiques: They bind themselves too
much,74 but also too little.75 The first strips the judicial process of its nuanced nature, the
latter breeds gamesmanship and bias.76 This critique is (to be fair) a little overheated. Sure,
judges take dictionaries seriously,77 but they also freely admit that dictionaries are not “infallible.”78 Even Learned Hand cautioned, “it is one of the surest indexes of a mature and
developed jurisprudence not to make a fortress out of the dictionary.”79 Dictionaries are normally under-determinative of outcomes, and this is a virtue rather than a vice. As we shall
claim, this virtue is equally shared by generative interpretation.
Similarly, the canons of interpretation themselves are difficult to defend
69 Lorillard Tobacco Co. v. Am. Legacy Found., 903 A.2d 728, 738 (Del. 2006) (reversing for failure to follow
dictionary).
70 Lee & Mouritsen, supra note 68, at 807 (“A common use of a dictionary involves simple cherry-picking.”).
71 Id. at 810–11 (“We cannot tell from the opinion whether the written translator sense of interpreter is less
often listed in a real ‘survey’ of dictionaries because we are not presented with an actual survey of dictionaries.”).
72 11 WILLISTON ON CONTRACTS § 32:5 (4th ed.) (“A contract will be read as a whole and every part will be
read with reference to the whole”); Bradley C. Karkkainen, "Plain Meaning:" Justice Scalia's Jurisprudence of
Strict Statutory Construction, 17 HARV J. L. & PUB. POL’Y. 401, 407 (1994).
73 Kevin P. Tobia, Testing Ordinary Meaning, 134 HARV. L. REV. 726, 797-99 (2020).
74 Nicholas S. Zeppos, Judicial Review of Agency Action: The Problems of Commitment, Non-Contractability
and the Proper Incentives, 44 DUKE L.J. 1133, 1143 (1995) (“fanatical” devolution to dictionaries).
75 See Mouritsen, supra note 35, at 1930. (critiquing dictionaries as weak source of plain meaning and for the
absence of context); Jordan v. De George, 341 U.S. 223, 234 (1951) (Jackson, J., dissenting) (calling dictionaries “the last refuge of the baffled judge”).
76 Lee & Mouritsen, supra note 68, at 798 (“The concern here is that even if we could settle on a theory of
ordinary or plain meaning, we are unsure how to assess it.”).
77 See, e.g., Matter of the Liquidation of Am. Mut. Liab. Ins. Co., 802 N.E.2d 555 (2004) (“Normally, a dictionary definition of a term is strong evidence of its common meaning.”); see also Brigade Leveraged Cap.
Structures Fund Ltd. v. PIMCO Income Strategy Fund, 995 N.E.2d 64, 69 (2013).
78 Cyprus Plateau Min. Corp. v. Commonwealth Ins. Co., 972 F. Supp. 1379, 1384 (D. Utah 1997) (“Dictionaries, while not infallible (or even consistent), are general guides to common usage.”).
79 Cabell v. Markham, 148 F.2d 737 (2d Cir. 1945).

[p. 17]
GENERATIVE INTERPRETATION
empirically.80 These canons are traditionally known by their evocative Latin names—in pari
materie, expressio unius est exclusio alterius, ejusdem generis, contra proferentem, generalia
specialibus non derogant—and they are used to fill dictionaries’ gaps.81 They try to address
the problem of context by giving heuristics to parse the parties’ proffered meanings.82
Popular with judges but absent from the Restatement,83 scholars criticize them as essentially
ad hoc.84 There is no obvious way to know what to do when different canons lead to
different outcomes, meaning that they offer the same kinds of degrees of freedom as
dictionaries do.
Nor is it clear that the contractual linguistic canons are rooted in how parties think
or write.85 The extant empirical work on linguistic canons in statutory interpretation
suggests that the answer is: they might be, but only some of the time.86 Now, to be sure, some
of the canons, like contra proferentem, aren’t intended to replicate how the parties would
80 Farshad Ghodoosi & Tal Kastner, Big Data on Contract Interpretation, U.C. DAVIS L. REV. 1, 58 (forthcoming 2024) (highlighting the issue of precedent around the use of canons being deployed without regard to
the context in which the precedent arose); Leib, supra note 60, at 1110 (“Few scholars or lawyers believe they
are applied consistently enough to be reliable in predicting case outcomes . . . .”).
81 See generally Edwin Patterson, The Interpretation and Construction of Contracts, 64 COLUM. L. REV. 833,
852–55 (1964) (identifying canons of contract interpretation).
82 The canons of contract interpretation are to be distinguished from the canons of construction in statutory
interpretation. As Ryan Doerfler has explored, those canons have been subject to a rehabilitative project over
the last generation. Ryan D. Doerfler, Late-Stage Textualism, 2022 SUP. CT. REV. 267, 269 (2022). Of course,
the contra proferentem doctrine particularly is not necessary effecting intent, but may instead by motivating
clear drafting. See generally Daniel Schwarcz, The Role of Courts in the Evolution of Form Contracts: An
Insurance Case Study, 46 BYU L. REV. 471 (2021) (making this argument in the context of insurance contracts)
83 Ethan J. Leib, The Textual Canons in Contract Cases: A Preliminary Study, 2022 WIS. L. REV. 1109, 1112
(2022) (“Yet the Restatement does not treat the textual canons like expressio unius, ejusdem generis, or noscitur a sociis at all”); Ghodoosi & Kastner, supra note 80, at 48 (“While substantive canons have remained
roughly in equilibrium over time, the chart below demonstrates a trend in which the invocation of textual
canons by courts across contract cases is increasing.”).
84 Karl N. Llewellyn, Remarks on the Theory of Appellate Decision and the Rules or Canons about How Statutes Are to Be Construed, 3 VAND. L. REV. 395, 401 (1950) (“there are two opposing canons on almost every
point”).
85 Gregory Klass, Interpretation and Construction in Contract Law 48 (2018), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2913228 (“Rules of construction are only sometimes pragmatically prior to contract interpretation, but not always and not pervasively.”).
86 Kevin Tobia, Brian Slocum, & Victoria Nourse, Statutory Interpretation from the Outside, 122 COLUM. L.
REV. 213, 241–43, 262 (2022) (finding that some linguistic canons are stated overbroadly or inaccurately but
many canons do reflect the intuitive judgment of ordinary people); Kevin Tobia & Brian G. Slocum, The Linguistic and Substantive Canons 23 (2022), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4186956
(“providing evidence that some interpretive canons that are traditionally motivated by normative values also
have a basis in language”); Janet Randall and Lawrence Solan, Legal Ambiguities: What Can Psycholinguistics
Tell Us?, available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4475356 (comparing canons).

[p. 18]
ARBEL & HOFFMAN
have understood the contract at drafting (if that has a stable meaning in contracts deployed
to millions of adherents). These normative canons may, or may not, relate to the parties’
contemporaneous intentions.87 But other canons are intended to reflect ordinary uses of
language, and yet have been subject to remarkably little controlled scrutiny.88
Notwithstanding its methodological shortcomings, contract textualism is ever more
popular.89 That’s so for a whole host of reasons, but none more so than the weakness of its
main conceptual rival: contextualism. This familiar alternative starts with the same premise
as textualism: What would the parties have said they meant had we asked them at the time
of contracting? But contextualism invites parties to offer extrinsic evidence to build depth
into the predictive analysis. By doing so, contextualism seeks to privilege accuracy—the
parties’ real intended meaning.
This approach to interpretation, capacious in the types of evidence considered
relevant, found its heyday in the 1960s in California and has never been as popular since.90
The problem with the approach, according to its critics, is that it does not permit the parties
to know what meaning a court will assign to the words they write, since the other side can
always offer self-serving meanings ex post and, if believable enough, write a new bargain in
court to replace the one drafted in the past.91 Even contextualism’s origin story is one of a
party suddenly remembering that they actually meant to make the purchase option available
only to family members, creditors be damned.92 Contextualism makes it difficult to lock
down meaning ex ante, through merger clauses and the like, which are always subject to later
testimonial refutation. Contextualism’s consumer protection allure is understandable.93 But
87 Christopher J. Walker, Legislating in the Shadows, 165 U. PA. L. REV. 1377, 1404 (2017) (arguing that
“contra proferentem” is not a method by which the true intent of the parties is determined, but rather, is a
decision to impose the burden of ambiguity on the drafter).
88 Ross & Tranen, supra note 39, at 226 (“Descriptive canons are based on the way ordinary people express
themselves in English.”).
89 Ghodoosi & Kastner, supra note 80, at 49 (“our study provides evidence that textualism is on the rise in
contract interpretation.”); Aaron D. Goldstein, The Public Meaning Rule: Reconciling Meaning, Intent, and
Contract Interpretation, 53 SANTA CLARA L. REV. 73, 77 (2013) (arguing that courts have increasingly moved
away from the use of extrinsic evidence to help them understand the parties’ intent, leaning instead on “objective” manifestations of intent); Mark L. Movsesian, Formalism in American Contract Law: Classical and Contemporary, 12 IUS GENTIUM 115 (2006) (“It is a truth universally acknowledged, that we live in a formalist
era. At least when it comes to American contract law.”).
90 See Pac. Gas & Elec. Co. v. G. W. Thomas Drayage & Rigging Co., 442 P.2d 641 (Cal. 1968); see also Masterson v. Sine, 436 P.2d 561 (Cal. 1968).
91 Masterson, 68 Cal. 2d at 231 (Burke, J., dissenting).
92 Id.
93 Olah v. Ganley Chevrolet, Inc., 2010-Ohio-5485, ¶ 15, 191 Ohio App. 3d 456, 460, 946 N.E.2d 771, 774
(holding that buyers of a vehicle are barred from presenting evidence that the car was represented by the dealer
as new because the contract says the vehicle is used).

[p. 19]
GENERATIVE INTERPRETATION
even if contextualism could offer more accuracy, critics charge it does so at a high cost.94
Indeed, scholars often defend textualism on efficiency grounds.95 Though it may be
unclear what parties want interpretative rules to be, it’s almost certainly the case that lawyerdrafters prefer textualist to contextualist modes of decision. Eric Posner captures the idea
well: Parties will often include an explicit merger clause, but few ever bother with an “antimerger clause.”96 Thus, from the perspective of the litigated cases—those between rich and
lawyered parties—contextualism is simply harder to defend.
And yet, from a certain perspective, contextualism seems well-positioned for a
revival. Recall that even contextualism’s critics agree about first-order goal: to figure out
what the parties would have meant at contracting. The problems with contextualism are
largely centered around motivated testimony and cost, which persuades the factfinder to
ignore the text. But consider: We increasingly live in a world where our thoughts are
recorded contemporaneously, whether sent by text, posted on social media, or recorded on
TikTok. Such recorded, immutable utterances are cheap to reproduce and appear to courts
to be excellent sources of contractual meaning.97 Defenders of textualism may argue that
permitting their use creates uncertainty, but some of the best arguments against
94 An admittedly limited survey of enforcement costs did not find meaningful differences between textualist
approaches and contextualists ones. See Silverstein, supra note 57. For an argument that textualism produces
higher enforcement costs because of the judge-by-judge variation in outcomes produces more litigation, see 6
PETER LINZER, CORBIN ON CONTRACTS § 25.14[B] at 163 (Joseph M. Perillo ed., rev. ed. 2010).
95 Schwartz & Scott, Redux, supra note 38, at 928 & n.3 (2010) (“A strong majority of U.S. courts continue to
follow the traditional, ‘formalist’ approach to contract interpretation”). But see Joshua M. Silverstein, Contract Interpretation and the Parol Evidence Rule: Toward Conceptual Clarification, 24 CHAP. L. REV. 89, 92
(2020) (arguing that the matter is indeterminate); Silverstein, supra note 57, at 1020 (“contracts scholars can
also generally be split into textualist and contextualist camps, with a clear majority falling into the latter
group”). There is recent evidence that contract scholars prefer contextualism. Eric Martinez & Kevin Tobia,
What Do Law Professors Believe About Law and the Legal Academy, 112 GEO. L. REV. 42 (forthcoming
2023).
96 Eric A. Posner, The Parol Evidence Rule, the Plain Meaning Rule, and the Principles of Contractual Interpretation, 146 U. PA. L. REV. 533, 571 (1998). As Larry Solan later pointed out, merger clause analogs in statutory interpretation “are not easy to find.” LAWRENCE M. SOLAN, THE LANGUAGE OF STATUTES: LAWS AND
THEIR INTERPRETATION 187 (Chicago 2010).
97 See BrewFab, LLC v. 3 Delta, Inc., No. 22-11003, 2022 WL 7214223, at *1 (11th Cir. Oct. 13, 2022) (affirming that a party’s text message was a personal guaranty that satisfied Florida’s statute of frauds); see also
Cloud Corp. v. Hasbro, Inc. 314 F.3d 289, 295 (7th Cir. 2002) (finding that a party’s e-mails satisfied the
UCC’s statute of frauds and using these as evidence in support of the claim that the contract had been modified); see also Cosby v. Am. Media, Inc., 197 F. Supp. 3d 735, 744 (E.D. Pa. 2016) (holding that tweets may
form the basis of a breach of contract claim).

[p. 20]
ARBEL & HOFFMAN
contextualism—that it can be abused ex post—are weaker than they used to be.98 And yet,
we lack a method to know which excited utterances to privilege, and we should worry that
courts’ motivated reading will cause them to come to inaccurate or biased understandings.
The debate between textualism and contextualism is old, and scholars have offered
various theoretical lenses by which one or the other approach ought to prevail.99 Most
arguments for or against extrinsic evidence turn on hypotheses about what parties would
have wanted (had we asked them) and which methods promote social welfare. These
arguments are often theoretically rich but empirically poor.100
More recently, scholars have offered two new methods, both advancing the certainty
values of textualism with a dash of the accuracy interests of contextualism. One school
focuses on the use of corpora of words to predict the meaning of phrases in contractual
texts—so-called corpus linguistics.101 To take the prototypical example, consider the
following phrase taken from an insurance contract:
[T]his insurance does not apply to ‘bodily injury’ [including death] to any person
while practicing for or participating in any sports or athletic contest or exhibition
that you sponsor.102
An insured dies while snorkeling: Is that a “sports or athletic contest”? As Stephen
Mouritsen observes, the question is not easily answerable using the classic dictionary-andcanon based tools of textualism. And, considering that insurance contracts are drafted by
powerful firms, who subject them to regulatory scrutiny, the idea of using extrinsic expressions by either firms or the insured seems hopeless.103 Instead, Mouritsen suggests that courts
could (helped by adversarial presentation by parties) query language databases to establish
whether sports and snorkeling appear relatively close to each other in some number of previous examples. That is, to derive the meaning of the word from its common use in previous
98 Cf. Shawn Bayern, Contract Meta-Interpretation, 49 U.C. DAVIS L. REV. 1097, 1136 (2016) (pointing out
that because text messages are informal, they don’t satisfy some of the deliberation-inducing virtues that textualists would otherwise place in written products).
99 See Ross & Tranen, supra note 39, at 196–97; see also Joshua M. Silverstein, The Contract Interpretation
Policy Debate: A Primer, 26 STAN. J.L. BUS. & FIN. 222 (2021); see also Mark L. Movsesian, Severability in
Statutes and Contracts, 30 GA. L. REV. 41, 70 n.184 (1995) (noting that the popularity of the major interpretive approaches “ebbs and flows”).
100 Silverstein, supra note 57, at 1014 (“The textualist/contextualist controversy cannot be resolved in the abstract. . . . Unfortunately, empirical evidence bearing on this debate is sorely lacking.”).
101 See generally Mouritsen, supra note 20, at 1360–1407 (making case).
102 Id. at 1340.
103 Christopher C. French, Insurance Policies: The Grandparents of Contractual Black Holes, 67 DUKE L.J.
ONLINE 40 (2017) (discussing the difficulty of interpreting insurance contracts for evidence of real meaning).

[p. 21]
GENERATIVE INTERPRETATION
texts. (The answer is, more or less, that sports are rule-based competitions, while snorkeling
is swimming wearing a goofy mask.)104
Corpus linguistics is an advance over traditional textualism or contextualism. It provides a methodology that theoretically allows courts to adhere to an objective set of responses
when determining the ordinary meaning of words based on their actual usage. Essentially,
it’s a form of textualism that doesn’t rely on dictionary definitions or a battery of canons. It
mirrors not the static decisions of lexicographers in their secluded, book-filled offices, but
rather the public use of words—democratized textualism.105
But corpus linguistics is inattentive to context.106 It can only really compare brief
snippets of text, rather than whole documents. Thus, although the method has been repeatedly used in statutory interpretation cases—where the stakes are high, parties are commonly
engaged in interpretative battles over short phrases—only one contracts opinion to date has
applied the method.107
A different constraining approach, advanced by Omri Ben-Shahar and Lior Strahilevitz, encourages courts to use survey evidence to decide on the public meaning of certain
contractual texts.108 As they point out, this survey evidence is a second best to the predictive
ideal we described above:
Contracts should have the meaning that the parties to the transaction assign
to the text. [But] it is pointless to ask the actual parties in the litigation what
the text meant to them when they formed the contract, because they will
104 Mouritsen, supra note 20, at 1371–74 (CL approach to snorkeling).
105 For an extended defense, see Jeffrey W. Stempel and Erik S. Knutsen, Technologically Improving Textualism," 6 Nevada Law Journal Forum 10 (2022).
106 See Choi, supra note 16, at 8, 16–17 (arguing that the context “undermines the core claim of corpus linguistics”).
107 See Fulkerson v. Unum Life Insurance Co. of America, 36 F.4th 678 (6th Cir. 2022); see also Richards v.
Cox, 450 P.3d 1074, 1085–86 (Utah 2019) (Lee, J., concurring) (concurring in majority opinion “to the extent it relies on corpus linguistic analysis” to support constitutional and statutory interpretation). Cf. Wilson
v. Safelinte Group, Inc. 930 F.3d 429, 439 (6th Cir. 2019) (arguing for use of CL in statutory analysis); Caesars
Entm't Corp. v. Int'l Union of Operating Eng’rs Local 68 Pension Fund, 932 F.3d 91, 95 n.1 (3d Cir. 2019)
(using corpus linguistics to interpret “previously”).
108 Ben-Shahar & Strahilevitz, supra note 19; Ian Ayres & Alan Schwartz, The No-Reading Problem in Consumer Contract Law, 66 STAN. L. REV. 545 (2014) (advocating empirical testing to identify surprising and
problematic provisions in standard form contracts, against which consumers ought to be warned); Ariel Porat
& Lior Jacob Strahilevitz, Personalizing Default Rules and Disclosure with Big Data, 112 MICH. L. REV. 1417,
1419–20 (2014) (advocating the use of surveys to identify the majoritarian preferences for the design of granular default rules).

[p. 22]
ARBEL & HOFFMAN
bend their answers to fit their litigation goals. So the law should instead ask
disinterested people just like them.109
The authors defend this interesting proposal against various charges.110 Their core
survey case is consumer contracts designed for mass audiences.111 There, the survey audience
and the original adherents are the same people (although separated by time), and we should
have fewer worries about the parties intending idiosyncratic meanings.112 But outside of that
frame, a problem with the survey approach is that for most litigated contract cases—i.e.,
commercial cases—the relevant survey audience will be difficult to find, as sophisticated adherents don’t take surveys, or will game them, producing the same problems encumbering
contextualism.113
Survey evidence is also an expensive adjudicatory technology. Surveys themselves are
difficult to conduct: Judges would need to rely on their adversarial presentation in the ordinary case. And they are increasingly unreliable: Recent work has found that almost a third
of online survey respondents use LLMs to complete answers.114 Surveys based on more collated samples face the same sorts of problems that have bedeviled modern polling: Nonresponse bias among parts of the population, difficulties of generalization, and inaccuracy.
And even here, attention is scarce. It is hard to survey consumers on a twenty-page policy or
to expect anyone filling out a survey for a $5 gift card to attentively consider interdependencies within the contract.
Consequently, though survey methodology is an established technique in trademark
cases and could very well be of enormous help in making sense of the meaning of certain
consumer contracts, it is unlikely to be a transformative technology in the ordinary contract
interpretation case. We are unaware of any cases to date that permit the use of survey evidence to determine contractual meaning.
* * *
109 Ben-Shahar & Strahilevitz, supra note 19, at 1802.
110 Id. at 1802–13 (making the case).
111 Id. at 1758 (noting focus on consumer contracts).
112 Id. at 1776–77 (noting the utility of surveys for consumer contracts on these grounds).
113 Cf. Roberts v. Farmers Ins. Co., 201 F.3d 448 (10th Cir. 1999) (“[W]hat the public expects from an insurance policy is simply not relevant to the legal question of whether the contract is ambiguous.”).
114 Veniamin Veselovsky, Manoel Horta Ribeiero & Robert West, Artificial Artificial Artificial Intelligence:
Crowd Workers Widely Use Large Language Models for Text Production Tasks, ARXIV:2306.07899 (2023),
https://arxiv.org/abs/2306.07899 (noting that 33–46% of mTurk survey workers use LLMs to complete
tasks).

[p. 23]
GENERATIVE INTERPRETATION
In summary, notwithstanding broad agreement about the predictive goal of interpretation, there’s also a shared sense that there’s something amiss in how jurists balance accuracy and efficiency. Textualism promises the latter, but in practice it often merely supercharges the judge’s own overconfident priors. Contextualism promises the former, but probably doesn’t deliver it, while eroding parties’ ability to plan for court outcomes and making
litigation prohibitively expensive for all but the wealthiest parties. The two most sophisticated modern improvements on these old technologies—statistical plain meaning and survey evidence—promise to rescue textualism from some of its sins, but haven’t been taken up
in live cases.
Enter large language models.
II. GENERATIVE INTERPRETATION
The doctrine of reasonable expectations plays a contested role in the regulation
of insurance contracts.115 For some courts, the insured’s reasonable expectations trump
the insurance contract’s terms, while for many others, the policy’s plain language should
control.116 Notoriously, these sorts of cases motivate armchair speculation by judges—
whose life experience, education, sophistication, and hard-earned cynicism systematically diverge from most lay people. Worse, the interpretations we give words appear very certain in
our own minds. Contract interpretation is a prime subject for a phenomenon psychologists
call “false consensus bias.”117 To illustrate the effect, Lawrence Solan, Terry Rosenblatt and
Daniel Osherton presented contract interpretation questions to both laypeople and judges.
After giving their opinion, the authors asked subjects to estimate how many other participants would agree with them. This design allows us to compare the actual distribution of
answers with how people expected the distribution to look. The results were striking: Both
115 See generally Jeffrey W. Stempel, Unmet Expectations: Undue Restriction of the Reasonable Expectations
Approach and the Misleading Mythology of Judicial Role, 5 CONN. INS. L.J. 181 (1998).
116 Restatement of Liability Insurance, Section 3 (nothing the plain meaning approach is typically followed
instead of cases like C&J). As Dan Schwartz has explored, the doctrine is unpredictable when applied in real
cases. See Daniel Schwarcz, A Products Liability Theory for the Judicial Regulation of Insurance Policies, 48
Wm. & Mary L. Rev. 1389 (2007).
117 Joachim Krueger & Russell W. Clement, The Truly False Consensus Effect: An Ineradicable and Egocentric Bias in Social Perception, 67 J. PERSONALITY & SOC. PSYCHOL. 596, 596–97 (1994); Brian Mullen, Jennifer Atkins, Debbie S. Champion, Cecelia Edwards, Dana Hardy, John E. Story & Mary Vanderklok, The
False Consensus Effect: A Meta-Analysis of 115 Hypothesis Tests, 3 J. EXP. SOC. PSYCH. 262 (1985)(providing a meta-analysis of false consensus effect).

[p. 24]
ARBEL & HOFFMAN
laypeople and judges overestimated how common their chosen interpretations were. Judges
even overestimated how much other judges would agree with them.118
Thus, one of the risks of introspective interpretation is that its products are very
sticky and hard to dislodge. This leads to dissent and reversal, and of course, interpretation
that defies parties’ reasonable expectations. Uncertainty about common interpretation is an
appealing case for the use of surveys.119 And surveys would be of great interpretative use,
were it not for the practical difficulties which we’ve just discussed.
Consider C & J Fertilizer v. Allied Mutual.120 The president of C&J, a fertilizer firm,
purchased a burglary insurance policy from Allied Mutual. The discussions preceding the
purchase made it clear that the policy would not cover an inside job. The insurance firm in
the negotiations tried to insist that to bring a claim, C&J would have to present hard evidence that a theft was made by a stranger.121 That idea was embodied in the following promise in the insurance contract:
[Allied will pay for] the felonious abstraction of insured property (1) from
within the premises by a person making felonious entry therein by actual
force and violence, of which force and violence there are visible marks made
by tools, explosives, electricity or chemicals . . . .122
As it turns out, a burglar robbed the fertilizer plant with style. While leaving some
tread marks in the mud, he avoided leaving any other visible signs before absconding with
$50,000 worth of fertilizer. The insurance company, denying the claim, argued that by its
plain language, the absence of visible marks made by tools (as opposed to tires) meant that
it didn’t have to pay.
The Iowa Supreme Court, in a contracts casebook staple, held that the exclusion
applied in this way violated the insured’s reasonable expectations. No one could have reasonably expected that burglary would be limited only to those leaving visible entry marks.123
118 Lawrence Solan, Terri Rosenblatt & Daniel Osherson, False Consensus Bias in Contract Interpretation,
108 COLUM. L. REV. 1268, 1291 (2008).
119 See generally Stempel, supra note 34 (describing the worry).
120 C & J Fertilizer, Inc. v. Allied Mut. Ins. Co., 227 N.W.2d 169, 176 (Iowa 1975).
121 227 N.W.2d. at 172.
122 227 N.Y.2d at 171.
123 227 N.Y.2d. at 177 (“But there was nothing relating to the negotiations with defendant’s agent which would
have led plaintiff to reasonably anticipate defendant would bury within the definition of ‘burglary’ another
exclusion denying coverage when, no matter how extensive the proof of a third-party burglary, no marks were

[p. 25]
GENERATIVE INTERPRETATION
In reaching that view, the court relied on its own common sense with no empirical grounding. Was it right?
That question triggers the simplest use cases of LLMs as part of the interpretative
process. The judge can simply ask the model for its assessment. Fantastical only three years
ago, today you might be merely whelmed by the model’s ability to respond coherently and
plausibly to this query. Here’s the model’s response, edited for readability:124
An insurance policy reads: "[The insurance company will pay for] the felonious abstraction of insured property (1) from
within the premises by a person making felonious entry therein by actual force and violence, of which force and violence
there are visible marks made by tools, explosives, electricity or chemicals." With this in mind, please state your prediction—
with the associated numerical level of confidence in parentheses—on the likely expectations of most policyholders under
these terms for the following propositions:
Table 1: GPT-4's estimates of propositions regarding the likely content of the gap in the
policy.
In other words, the model disagreed here with the court’s majority opinion. It (like
a dissenting opinion) predicts that policyholders would have expected to be required to provide some evidence of forceful entry to prove that the burglary was not an inside job.
To us these findings are facially plausible: they validate that this cheap and convenient tool could be potentially of use in real cases. But just because the probabilities are reasonable doesn't mean they are accurate. Your intuition should be: prove it! You would want
to know more, both about what the model is doing when it produces percentages, and how
left on the exterior of the premises. This escape clause, here triggered by the burglar's talent . . . was never read
to or by plaintiff's personnel, nor was the substance explained by defendant's agent.”).
124 Chat repository here https://chat.openai.com/share/4379b796-cece-4616-b8eb-b6772f13ad37

[p. 26]
ARBEL & HOFFMAN
that methodology fits courts’ purposes in interpreting insurance contracts.125 Let’s start
there, in Part II.A. We’ll then try some more complicated examples in the remainder of this
Section.
A. A Gentle Introduction to Large Language Models
When Chat GPT-4 told us that it was 90% likely that the policy would pay in response to a “substantiated third-party burglary,” what happened behind the curtain? We’re
going to give an explanation a shot here, knowing that doing so is difficult in part because
LLM technology is complex and rapidly changing. Essentially, LLMs create a statistical
model of how words connect by training on torrents of existing texts, some historic and
some artificially derived.126
In the common case, LLMs take user input in the form of text and produce an output, also in the form of text. Behind the scenes, the model takes the text and transforms it
into numbers. This is essential, because (superficially) computers do not read text. Numbers
can encode more information than letters, and they are more valuable in that they allow
computers to perform mathematical operations. This is easy to see in the case of ambiguities:
Duck is both a verb and a noun. But in a number system, we can use prefixes like 20 for verbs
and 10 for nouns, so we can encode the word duck twice. One is, say, 201 and the other 101,
to designate the disparate meanings and disambiguate them.127
This simple illustration understates the utility of this process, known as embedding.128 Rather than assigning a single number to each word, machine learning models transform them into strings of number-pairs—each pair capturing some aspects of meaning.129
The length of such vectors is very long; one of the latest models in common use employs a
125 As we emphasize throughout, model outputs involve a certain degree of randomness. Repeated experimentation, ideally with different prompts, is advisable. See infra section III.A. for discussion of best practices.
126 Synthetic data is growing in importance, and sometimes may improve model quality. John Jumper et al.,
Highly accurate Protein Structure Prediction With AlphaFold, NATURE 583, 587–89 (2021) (noting how
training the data using synthetic data improved the model’s accuracy significantly).
127 This, in a sense, is what standard English dictionaries do, at least if one were to number the words by order
of appearance.
128 For a description of embeddings (although without the attention mechanism) see Choi, supra note 16, at
20–22.
129 What embeddings capture is related to but different from meaning. For a discussion that emphasizes the
non-semantic-understanding view, see Lisa Miracchi Titus, Does ChatGPT have Semantic Understanding? A
Problem with the Statistics-of-Occurrence Strategy, 83 COGNITIVE SYSTEMS RESEARCH 1 (2024). For sake of
exposition, we imprecisely use the word meaning.

[p. 27]
GENERATIVE INTERPRETATION
vector with 12,288 number-pairs.130 For simplicity of exposition, suppose you had a list of
common animals and had a two-dimensional vector to describe them. One dimension could
be number of feet; another could be if they lived on land or sea. This would produce vectors
that we can visualize below:131
Figure 2: An illustration of the value of encoding meaning via simple embeddings
What makes vectors so powerful is that they allow us to capture not only semantics,
but also a syntactic relationship to other words. Horses and cows, in our very simplistic
schema, are closer to each other than they are to whales or sea turtles. The snake, always
awkward, occupies its own category. If we were to add salamanders, we would spot the emergence of a distinct category of amphibians, alongside the land mammals. Now, suppose you
did the same with over 10,000 dimensions.132 You can imagine the insights that might result
when words are described along such complex dimensions.
130 Nils Reimers, OpenAI GPT-3 Text Embeddings – Really a New State-of-the-Art in Dense Text Embeddings?, MEDIUM (Jan. 28, 2022), https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-reallya-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
131 Sea turtles have flippers, not legs. In a more sophisticated representation, we might have adopted a more
continuous representation of feet, where flippers are closer to feet than they are to, say, tails.
132 A technical clarification: the dimensions in the embedding model do not correspond to clearly defined semantic categories such as ‘feet’ or ‘habitat’. Rather, they condense information about words in ways that are
useful to the attainment of the model’s training objectives. For the best work to date on deciphering the inner
working of these complex systems see Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam
Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby,

[p. 28]
ARBEL & HOFFMAN
Making words dimensional has proved powerful in many machine learning tasks,
but was insufficient to power the new LLM revolution. What was needed was the idea of
attention.133 Read the following sentences:
“Shohei Ohtani felt the stress. In a desperate attempt, he swung the bat.”
You intuitively grasp that they mean that Ohtani lifted a wooden bat and used it to
swing at the baseball. But how do you know that this was right, and not that Ohtani had
swung a mammal? As Amelia Bedelia taught us, it’s possible to turn many normal phrases
into misadventures if you ignore context. We know that swung typically is associated with
objects, not animals. And we connect bat with Ohtani, a baseball player, which further solidifies our interpretation of the sentence as referring to the object. In other words, our
minds naturally pay attention to the context of the word to infer the meaning of any specific
word.
An LLM’s attention mechanism seek to achieve the same thing with respect to vectors.134 The model assigns an initial vector to each word in a sentence, which is then enriched
by information about its position in the sentence (via positional encoding). Then the attention mechanism assesses which words—say bat or swung—shed light on its meaning.135 In
the sentence above, words like “stress” and “felt” are not particularly relevant to the meaning
of the word “bat”; but both “swung” and “Shohei Ohtani” matter. This allows the model to
assign an attention score to each word in the input (relative to the word under analysis) and
then reweigh the encoding of the word under analysis relative to the words that are relevant
to its interpretation. This means that words do not have stable embedding (as in the older
models), but rather, the embedding changes based on the specific context in which they are
presented.
Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen,
Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan & Chris Olah, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. https://transformer-circuits.pub/2023/monosemantic-features/ (2023)
133 This idea was most powerfully described in a 2017 paper. Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Lloion Jones, Aidan N. Gomez, Lukasz Kaiser & Illia Polosukhin., Attention is All You Need,
ARXIV: 1706.03762 (June 12, 2017), https://arxiv.org/abs/1706.03762.
134 For a helpful introduction, see SEBASTIAN RASCHKA, YUXI (HAYDEN) LIU & VAHID MIRJALILI, MACHINE
LEARNING WITH PYTORCH AND SCIKIT-LEARN 544-61 (2022) (describing the self attention mechanism)
135 This is a simplification in several ways. While we discuss words in the text, current models work at the level
of a token, which is a part of a word. The model is not directed towards meaning per se, but rather towards
information about other tokens that would help it achieve its training objective. Depending on the architecture, attention may be directed only at preceding tokens. There is more than a single attention mechanism and
each one attends to different relationships. There are other subtle simplifications that help the general reader.

[p. 29]
GENERATIVE INTERPRETATION
These ideas are combined to train a model. A model refers to a collection of parameters (mostly ones called “weights” and “biases”) organized in a specific way whose values are
used to transform the input into the model’s output. Modern language models contain tens
to hundreds of billions of such parameters, hence their common designation as “Large Language Models.”
Language models are trained with some objective function, a task which they try to
achieve and on which they are evaluated. In the context of most LLMs, the goal is prediction.
The model is presented with the sentence “Shohei Ohtani felt the stress. In a desperate attempt he swung the [?]” and then the model predicts which word would come next. If the
model were not calibrated, it might have guessed lamp or materiality. As these are (probably)
incorrect, the model is then led to calibrate toward accuracy through a process called gradient descent.136 This process repeats itself until the model learns that bat follows with 70.14%
probability, base with 25.13%, axe with 0.53%, club with 0.51%, and so on.137
We say the model “learns.” But what does that mean? The simple answer is that
during training, the model adjusts the numerical values of billions of parameters such that
they would produce predictions that are more likely to achieve its training objectives. It conducts various (fairly simple) algebraic operations to create from a sentence like “Hello, how
are you __” a prediction that the next highest probability word would be “doing.” And yet
this simplicity doesn’t capture the process: these parameters are effectively encoded in large,
inscrutable matrices whose meaning is wickedly hard to decipher, and whose organization is
alien. LLMs do not explain the why of their predictions.
You may have read, but would be wrong to conclude, that because the goal is to assign probability to the next word, these models simply imitate text they have seen elsewhere
or only develop a superficial model of the world. To effectively predict the next token in a
136 An analogy may capture the intuition behind gradient descent. Suppose you found yourself on a mountain
ridge on a pitch-black night and you are trying to find your way down to the valley below. Feeling with your
foot, you sense that going West would lead you upwards, East is levelled, South has a mild declining slope,
and North has a steep slope. You head North, and then after a few steps, you test again, to see which direction to take now. This process of finding your way is similar to gradient descent. The model identifies the
steepest slope (or gradient) for reducing its deviation from its targets, and adjust its parameters accordingly. It
then checks again with more data, iteratively improving until it finds the best or "lowest" configuration." For
a helpful and more formal introduction, see Sebastian Ruder, An overview of gradient descent optimization
algorithms, arXiv:1609.04747v2 (2017)
137 Based on actual predictions of the Tex-Davinci-003 model with temperature 0.7, max length = 256, top p
= 1, 0 frequency or presence penalty and best of 1.

[p. 30]
ARBEL & HOFFMAN
sequence, the models cannot simply memorize what they have seen elsewhere.138 To predict
the continuation of a new sentence like “When they moved to the USA, they set their first
home in the state of ______” would require the model to develop a mathematical sense of
what are states and immigrants, and which ones are popular destinations for those who recently arrived.139 As large as they are, the models are much smaller than the data they are
trained on. And so, models necessarily seek deeper representation of the information they
train on. This is not unlike how humans read books, learn from them, but cannot recite
them. You can see that model outputs are original because they produce entirely new but
responsive text. Of course, this sometimes results sometimes in making up facts.
Finally, consumer-facing chatbots simply invite the user to chat with the model directly. Behind the scenes, however, the model’s behavior is calibrated by settings called “hyperparameters.”140 The details are quite technical, but one of those hyperparameters is of
specific interest. LLMs have “temperature” settings that can be adjusted from low to high.
The lower the model’s temperature, the more predictable its output.141 A very low temperature ensures that the model always outputs the same answer to the same query. A higher one
introduces more randomness and outputs that you might think of as “creative.”
So far, so good. Now let’s return to our question: What is the model doing when it
assigns a 90% probability to the likelihood that a reasonable person would expect an insurance payment under certain circumstances? The first step for the model is to convert the
query we entered to numbers (really, tensors). The next step is crucial: Now the model attends to the context of words and uses it to adjust their meaning. If the model sees the word
“premium” in the current context, it will know to adjust its meaning away from dictionary
meanings such as “high quality” and towards “consideration for an insurance policy.”142
Armed with a contextual understanding of the query, the model can now run
through its vast internal network of parameters and calculate what is the most likely word
138 See Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin & Vedant Misra, Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, ARXIV: 2201.02177 (Jan. 6, 2022),
https://arxiv.org/abs/2201.02177.
139 For example, the Text-DaVinci-003 model predicts California (30.8%), New York (21%), and Florida
(17%) as the most likely continuations. https://i.imgur.com/uSBNKfh.png
140 The term “hyperparameter” is necessary to distinguish the model’s own parameters, from the parameters
that define its training and operation. We reference here the post-training hyperparamters, noting that there
are also hyperparameters that dictate the training of the model.
141 For a friendly technical review, see FRANÇOIS CHOLLET, DEEP LEARNING WITH PYTHON, CHAPTER 12.1
(2nd ed., 2021).
142 Premium, MERRIAM-WEBSTER DICTIONARY (July 27, 2023),
https://www.merriam-webster.com/dictionary/premium.

[p. 31]
GENERATIVE INTERPRETATION
(really, token) that would follow next. It will assign infinitesimally low probabilities to words
that relate to gardening or makeup, but will assign increasingly higher probabilities to words
that relate to the insurance context. Once the model determines the most likely continuing
words, it orders them by relevancy. In a zero temperature settings, the model will always
select the word with the highest probability to follow, but as we increase the temperature it
will occasionally pick other words as well. When the model chooses 90%, it is predicting that
this number is the most likely continuation of the text preceding it.
This explanation skips over the hardest question, which is why the model assigns the
highest probability to 90%. The honest answer is quite unsatisfactory: It picked this number
because based on its vast training data and internal statistical model, it found that 90% is a
more likely continuation than 10%. This is nothing like an explanation a human would give,
where reasons and factual considerations would be provided. The model’s outputs are a
brute statistical fact. It is possible to ask the model to justify itself. And the model will diligently reply with an answer. But it is critical to understand that whatever the model tells
you, it is really no explanation at all. It is a prediction of what explanation is likely to follow
the query. So, working with LLMs admittedly requires a leap of faith, a realization that no
better explanation is forthcoming than long inscrutable matrices that produce predictions.
B. LLMs as a Source of Contractual Meaning
With a grasp of the technology in hand, let’s work through some more quotidian
examples of LLMs’ potential use outside of the insurance context. Textualists—as we’ve described—think that texts have an inherent plain meaning, at least within the context of the
written document. The problem is deciding what it is, and whether our intuitions are representative. LLMs may serve as powerful tools to uncover those answers.
We’ll start with the divorce of Jennie and Mark Famiglio. Jennie and Mark entered
into a prenup before getting married, which committed to a sliding scale of payments from
Mark to Jennie if they divorced, tied to the length of their union. Section 5.3a read:
5.3. JENNIE's Benefits and Obligations. If the marriage ends by dissolution
of marriage or an action for dissolution of marriage is pending at the time of
MARK's death, then JENNIE shall receive the additional benefits and obligations described in 5.3.a. through d.
a. MARK shall pay to JENNIE, within ninety (90) days of the date
either party files a Petition for Dissolution of Marriage the amount

[p. 32]
ARBEL & HOFFMAN
listed below next to the number of full years they have been married at the time a Petition for Dissolution of Marriage is filed.143
Although Jennie filed a petition for divorce after seven years, she never served the
petition and later voluntarily dismissed the action. After ten years, she filed again, and meant
it. Under the prenup, seven years of marriage entitled her to $2.7 million; ten years a whopping $4.2 million. The parties were left with a consequential but basic interpretative question: When the prenup mentions the number of years at the time “a” petition is filed—did
the parties mean the first petition or the ultimate one?
Neither party thought witnesses were necessary, as both understood a Petition to be
unambiguous (and favoring their side). Unfortunately for Jennie, a Florida appellate court
ruled against her.144 Relying in part on dictionaries, it emphasized that “a” is an indefinite
article. Ordinarily, the court stated, when people predicate a condition on an indefinite
event, they mean its first occurrence. Thus, imagine if a golf course posts a rule: “when a
thunderstorm approaches, you must end your golf game.”145 That would be “universally understood . . . to mean the first time a thunderstorm approaches.” And so, “a” petition filing
simply must mean the first one filed. The court’s method of proof seems sensible. But was it
right to be so sure of itself?
We presented GPT with the prenuptial agreement and asked it: If one of the parties
files a divorce petition, withdraws it, and then a few years later a new petition is filed, what
date determines the number of full years of marriage: the first filing or the second one? It
produced a sentence that essentially supported Jennie’s view. But to illustrate how the model
can help courts be more precise, we can freeze the output in time and take a peek under the
hood, as Figure 3 illustrates.
143 279 So. 3d at 737.
144 Id. at 742
145 Id. at 743.

[p. 33]
GENERATIVE INTERPRETATION
Figure 3: Davinci-003, temp=1, top-p=1, frequency and repetition penalty =0, best
of 1, full spectrum, presented with Famiglio facts and asked “If one of the parties files a divorce petition, withdraws it, and then a few years later a new petition is filed, what date determines the number of full years of marriage: the first filing or the second one?”
This illustration captures the probabilistic way the model thinks of language and its
own process. When it started to produce its answer, it predicted that it ought to start with
“The.” Now, neither we nor the model know how it would continue the sentence. It read
our question and its partial answer and then made a prediction. Given the context and the
vast corpus on which it sits, what should have come next—second or first? It concluded that
“second” makes more sense. And once second is produced, the rest of the answer follows.146
Generative interpretation in this simple case thus offers courts a better sense of the
relevant probabilities if the parties were intending to use English in its most public and common sense. And it does so without reference to singular, perhaps idiosyncratic, illustrations
pulled from the golf course. Of course, it’s possible that in the context of their deal, extrinsic
evidence pointed to a private meaning—or perhaps trade practice could have pushed the
court away from the meaning that the model suggests is normal. And, as we’ll discuss, knowing that the court would use the model might have motivated both parties to not so quickly
assume that their meaning was unambiguously correct.
C. The Ambiguity Problem
As Famiglio illustrated, the question of whether a term is ambiguous, permitting
extrinsic evidence or not, can be outcome determinative. That’s true for interpretative methods of all stripes. Even the most free-spirited contextualists are not that free. They will not
146 The usual LLM caveats apply, and the probabilities shouldn’t be interpreted literally. The model could, for
example, continue the sentence with “The first filing would not control.”

[p. 34]
ARBEL & HOFFMAN
waste the parties’ time on a lengthy trial when they think that the language in the contract
is simply not “reasonably susceptible” to the interpretation proffered by one of the parties.
As a result, a key question in contextualist jurisdictions is which interpretation, exactly, the
language is reasonably susceptible to.
Take the well-known case of Trident v. Connecticut, often listed as a primary argument against California-style contextualism.147 A group of lawyers, assisted by other real estate investors, sought to buy commercial real estate to build their law offices. They borrowed
$56 million from Connecticut Insurance, with an agreement to pay it back over 15 years at
12.25% APR. At one point, the agreement stated that the principal could not be prepaid, at
least not within the first 12 years of the agreement. However, interest rates fell, and the borrowers sought to prepay the loan with money they would borrow elsewhere.148 When they
were rebuked, they turned to litigation.
The promissory note clearly stated that the borrowers “shall not have the right to
prepay the principal amount hereof in whole or in part.” But they pointed to a different
clause, creating a 10% prepayment penalty for defaulted loans if the lender accelerated.149
The borrowers’ lawyers relied on the famous statement of California’s contextualism rule,
Pacific Gas,150 to argue that they ought to be permitted to offer extrinsic evidence—negotiations, trade usage—in support of their contractual reading.151
In the Ninth Circuit, Judge Kozinski used the case to offer what others have described as a “shrill attack” on the looseness of the California parol evidence rule.152 He discounted the borrower’s prepayment argument, since it was at the lender’s option. And he
concluded that the contract’s “shall not have the right” clause was crystal clear that prepayment was forbidden—standing alone, it was not reasonably susceptible to the borrower’s
meaning. Nonetheless, Judge Kozinski remanded the case. He wrote:
Under Pacific Gas, it matters not how clearly a contract is written, nor how
completely it is integrated, nor how carefully it is negotiated, nor how
squarely it addresses the issue before the court: the contract cannot be rendered impervious to attack by parol evidence. If one side is willing to claim
147 847 F.2d 564.
148 Historic rates had fallen by around 3 percent, meaning an early pre-payment would have meant a saving of
~$1.1 million over the life of the loan.
149 847 F.2d 564.
150 Pacific Gas & Electric Co. v. G.W. Thomas Drayage & Rigging Co., 442 P.2d 641 (1968).
151 847 F.2d at 568 (noting reliance on Pacific Gas).
152 Peter Linzer, The Comfort of Certainty: Plain Meaning and the Parol Evidence Rule, 71 FORDHAM L. REV.
799, 805 (2002).

[p. 35]
GENERATIVE INTERPRETATION
that the parties intended one thing but the agreement provides for another,
the court must consider extrinsic evidence of possible ambiguity. If that evidence raises the specter of ambiguity where there was none before, the contract language is displaced and the intention of the parties must be divined
from self-serving testimony offered by partisan witnesses whose recollection
is hazy from passage of time and colored by their conflicting interests . . .
The opinion, written with flair, is in many contracts casebooks, but it is a puzzle in
its own right. California’s existing rule provided that extrinsic evidence was to be admitted
only if the language in the contract was “reasonably susceptible” to the interpretation proffered by the parties. Thus, if Kozinski really had been confident that the language was clear,
he should not have remanded.153 We wondered whether his factual premise was correct and
asked LLMs to help.
After obtaining the original promissory note,154 we introduced the relevant parts to
three leading LLMs: GPT-4, Claude 2, and a version of the open source model Llama-2,
and then asked for their evaluation.155 We asked them to read the entire contract and then
estimate, as a judge, the likelihood that the parties intended early repayment to be permitted
under the agreement. To capture a range of model responses, we repeated the same question
many times, while setting the “temperature” at a sufficiently high level to ensure that different responses might be picked.
153 Susan J. Matin-Davidson, Yes, Judge Kozinski, There Is A Parol Evidence Rule in California—The Lessons
of a Pyrrhic Victory, 25 S.W.U. L. REV. 1, 18–20 (1995). As Prof. Matin-Davidson points out, after remand
the defendants won a summary judgment motion and their attorneys’ fees. There never was a trial. Id. at 4,
n.22.
154 We thank Prof. Todd Rakoff for providing it from his collection.
155 The 70 billion parameter version of the Llama-2 model is considered the highest performing open source
model at this time, and we used the currently highest-performing fine-tuned version of this model, as measured by the HuggingFace Open LLM Leaderboard, https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. (https://imgur.com/a/brjE8Tb)

[p. 36]
ARBEL & HOFFMAN
Figure 4: Turbo GPT-4, Claude 2, and Llama-2 70b with set at temperature 1, and
fed with the Trident promissory note in full. The models asked whether the language of the
agreement is reasonably susceptible of being read as providing the borrower the right to early
repayment. On the x-axis, 0 indicates this interpretation is wrong and 100 is that it is correct.
Figure 4 is suggestive of how generative interpretation can deepen and enrich judicial analysis. Overall, the models roughly agree on average that prepayment is not allowed,
with a mean score of ~41. The least powerful model here, Llama 2, was more open to the
possibility than the more powerful, proprietary models. But the two most powerful models,
Claude 2 and GPT-4, both shared a similar evaluation: the contract was not “reasonably
susceptible” to the interpretation advanced by the Trident group.
One read of this result is that it suggests that Kozinski’s intuitive factual premise
was wrong, but that he reached the right conclusion. That is, even taking the borrower’s
argument seriously, the dominant reading rejects a finding of ambiguity. No further extrinsic evidence ought to have been admitted. This would align with common criticisms of the
opinion.156 On the other hand, the models were not uniform in their assessment; the probability distribution suggests that at least some probabilistic readings of the contract permit
156 See Matin-Davidson, supra note 153.

[p. 37]
GENERATIVE INTERPRETATION
early repayment. To determine the case, we would want to know more about those minoritarian readings: Are they reflective of discrete linguistic communities, private meanings, or
other legally relevant factors? Generative interpretation does not answer the question of
whether language is reasonably susceptible of a meaning, it instead helps us visualize a broad
spectrum of meaning and quantify how likely a particular result is.157
Now consider another case turning on ambiguity: Ellington v. EMI.158 The issue in
this case arose from a 1961 net receipts agreement between the musician Edward Kennedy
“Duke” Ellington and his record company, EMI. As was common at the time, the parties
agreed on a 50/50 royalty split, after deducting fees charged by third parties that intermediate in foreign markets. This net receipt agreement bound EMI and its “other affiliates.” In
the intervening decades, the music industry underwent significant consolidation, and EMI
began to use its own affiliates rather than rely on third parties for foreign operations. It
sought to deduct those affiliate fees before paying Ellington’s estate.
Feeling blue, Ellington’s grandson sued, arguing that two key phrases in the contract
were ambiguous: “(1) the phrase “net revenue actually received” in the royalty provision and
(2) the term “any other affiliate” in the definition of Second Party.”159 The New York Court
of Appeals––the country’s preeminent textualist tribunal—rejected the claim. The majority
held that the terms were unambiguous: They only reference affiliates that existed at the time
of contracting. There is simply no way that they could be read in any other way, given the
tense that the parties used and the court’s aversion to forward-looking language.160
Again we had access to the original contract. We presented it to the various models
for plain language analysis, asking: “Does ‘other affiliates’ naturally include only the existing
affiliates at the time of contract, or does it potentially encompass affiliates that might be
created over time?”
157 Whether a conclusion that is 20% likely is legally reasonable might turn on several factors we do not explore
in the text. Imagine a particular linguistic subcommunity whose understanding of terms correlates with the
parties’. (You could think of this as akin to trade usage, but for culture.) In that case, deferring to majoritarian
readings would tend to suppress important perspectives. See generally Dan Kahan, David A. Hoffman and
Don Braman, Whose Eyes are You Going to Believe: Scott v. Harris and the Perils of Cognitive Illiberalism,
122 HARV. L. REV. 837 (2009) (discussing how simulations can uncover discrete minority perspectives on legally-operative facts that the law should attend); David A. Hoffman, From Promise to Form: How Contracting Online Changes Consumers, 91 N.Y.U. L. Rev. 1595 (2016) (arguing that younger parties have distinct
views of contracting from older ones).
158 Ellington v. EMI Music, Inc., 21 N.E.3d 1000, 1001 (2014).
159 Id. at 245.
160 Id. at 246–47.

[p. 38]
ARBEL & HOFFMAN
Before we describe the model’s answer, we should highlight two robustness concerns
with model interpretation. Models are quite sensitive to the prompt used.161 This opens
them to a problem of “leading prompts,” queries that lead the model towards a desired answer. And, as we described earlier, models can be set to be hotter (more random) or colder
(more deterministic). This allows the user (judge, researcher, policymaker) many degrees of
freedom.
To deal with these issues we tried something new. Rather than a single prompt, we
used 20 variations of the same question, each queried 10 times at a relatively high temperature setting.162 We presented yes/no questions where yes indicates agreement with the
judge’s interpretation. The Figure below summarizes the results of the experiment among
four of the leading models.
161 See generally Laria Reynolds and Kyle McDonell, Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm, arXiv:2102.07350 (2022) (discussing the effect of prompting techniques on model outputs).
162 Specifically, we set temperature at 1 and top p=1 to encourage a broad range of responses. The 20 prompts
were generated by GPT-4, after seeding it with the background of the case and a seed question. AMBIGUOUS
CONTRACT INTERPRETATION, https://chat.openai.com/share/e9003c92-5e32-436c-816d-c2add7ac485b
(last visited July 29, 2023). For code, see supra note 22.

[p. 39]
GENERATIVE INTERPRETATION
Figure 5:Ellington v. EMI, analyzing the interpretation of “other affiliates” using
temperature 1, and responding ten times to twenty prompt variations generated by GPT4, after seeding it with the background of the case.
As the Figure illustrates, the four models don’t share the New York court’s confidence: The most common interpretation of “other affiliates” includes those that post-date
the contract. Llama-2, the open source model, is somewhat open to EMI argument, reflecting that it has some facial plausibility. Of course, even uniformity between powerful models
cannot decide cases. The point, rather, it to illustrate the value of LLMs as a convenient
check against overconfidence, and a spur to greater reflection. (Though the fact that the dissent thought that the contract was ambiguous might have produced that same introspection.)
D. Filling Gaps
One of the most difficult issues in contract interpretation is distinguishing silence
from unexpected gaps. Contracts are incomplete: The parties leave many topics to necessary

[p. 40]
ARBEL & HOFFMAN
implication. Such omissions are not always deliberate: Sometimes parties simply have not
contemplated a problem—a global pandemic, a supply chain disruption, another peerless
ship sailing ex Bombay163—and the court must engage in filling the gaps, rather than merely
interpreting words on the page. Whether a court is “filling a gap” by extrapolating from the
parties’ words and actions, or “constructing” a contract term by its own lights, is a highly
contested problem.164 But antecedent to that dispute is a simpler one: can we do a better job
of predicting what the parties would’ve said had they been asked at contracting about an area
they left blank?
Consider the 1977 New York Court of Appeals case, Haines v. City of New York.
It resolves a dispute about a 1924 contract between the City of New York and an upstate
village, in which the City promised to pay the town to process its own sewage so that the
city’s water supply could be cleaned. (That is, the city paid the village not to pollute.) As the
decades passed, the townships grew and the Federal Government passed environmental regulations. By the early 1970s, facing strong budgetary pressures, New York City refused to
continue to pay for the township’s expansion of the sewage facilities. A local developer sued,
arguing that the contract’s absence of a duration term or cabin on the scope of the city’s
obligation meant that the city was in breach.
The court considered those arguments in a decision that looked only to the written
contract. It determined that the parties did not mean for the contract to run forever, in a
provision notable for its brevity.
[W]here the parties have not clearly expressed the duration of a contract,
the courts will imply that they intended performance to continue for a reasonable time but also did not mean it to be terminable at will . . . .Thus, we
hold that it is reasonable to infer from the circumstances of the 1924 agreement that the parties intended the city to maintain the sewage disposal facility until such time as the city no longer needed or desired the water, the
purity of which the plant was designed to insure.165
The logic here isn’t compelling but rests on a empirical prior: By default, parties do
not intend contracts to be terminable at will when they write unlimited obligations, and
nothing about the language or circumstances of the contract compels a contrary conclusion.
163 Raffles v. Wichelhaus, 159 Eng. Rep. 376 (1864).
164 On this generally—and expressing useful skepticism about the borders—see Klass, supra note 45; see also
Larry Solum, Legal Theory Lexicon: Interpretation and Construction, LEGAL THEORY BLOG (May 31, 2020),
at https://lsolum.typepad.com/legaltheory/2020/05/legal-theory-lexicon-interpretation-and-construction.html (explaining the difference).
165 Haines, 41 N.Y. 2d 769 at 772 (breaks added, cleaned up).

[p. 41]
GENERATIVE INTERPRETATION
On the related question of whether the city promised (implicitly) to continue to
expand the system’s capacity, the court was less generous.
By the agreement, the city obligated itself to build a specifically described disposal
facility and to extend the lines of that facility to meet future increased demand. At
the present time, the extension of those lines would result in the overloading of the
system. Plaintiff claims that the city is required to build a new plant or expand the
existing facility to overcome the problem. We disagree. The city should not be required to extend the lines to plaintiffs' property if to do so would overload the system and result in its inability to properly treat sewage. In providing for the extension
of sewer lines, the contract does not obligate the city to provide sewage disposal services for properties in areas of the municipalities not presently served or even to new
properties in areas which are presently served where to do so could reasonably be
expected to significantly increase the demand on present plant facilities.166
Once more the court alludes to the agreement, but its decision is inattentive to the
details. It found an implicit condition to obligation: Extension is required only so long as
the system is not overloaded.167 But this was a gap-filling exercise, informed by the court’s
judgment about what the parties should have said.168 Such determinations were part of a
trend in New York courts’ in favor of a looser, Cardozian approach to missing terms.169
With the cooperation of the New York court system, we obtained the 1924 contract.170 This contract and the various exhibits are long, especially considering when they
were created: about eight pages of Word documents. We entered the text into the two models that can support such long inputs—GPT-4's experimental version and Claude 2—and
166 Id. at 773.
167 The City of New York at the time was under severe financial stress and courts rushed to protect it from
bankruptcy. Robert M. Jarvis, Phyllis G. Coleman & Gail Levin Richmond, Contextual Thinking: Why Law
Students (and Lawyers) Need to Know History, 42 WAYNE L. REV. 1603, 1613 (1996).
168 For an argument suggesting that there is no fact-of-the-matter about parties’ intent when filling gaps in
contracts, see Robert A. Hillman, More Contract Lore, 94 TUL. L. REV. 903, 910 (2020); Robert A. Hillman,
The Supreme Court’s Application of “Ordinary Contract Principles” to the Issue of the Duration of Retiree
Healthcare Benefits: Perpetuating the Interpretation/Gap-Filling Quagmire, 32 ABA J. LAB. & EMP. L. 299,
320 (2017).
169 Perhaps this part of the opinion responded to the City’s financial exigency. William E. Nelson, A Man’s
Word and Making Money: Contract Law in New York, 1920-1960, 19 MISS. COLL. L. REV. 1, 13 (1998).
170 E-mail from Marisa Gitto, Reference Services, New York State Library, to Michael Hurley, Research Assistant, University of Pennsylvania Carey Law School (May 22, 2023, 03:01 EST) (on file with authors).

[p. 42]
ARBEL & HOFFMAN
asked them to assess the validity of several legal arguments given the agreements.171 Figure 6
illustrates what we found.
Figure 7: Haines v. City of New York gap filling analysis using Chat GPT-4 (32k
context length) and Claude 2 (100k context length).
The first set of questions concerned duration. Both models reject the city’s claim
that the contract was terminable at will. And both (with different degrees of confidence)
were open to durational gap fillers of an indefinite time, a reasonable time, by joint agreement, or until a time when a legal excuse is present—which is indeed the common law rule
171 GPT-4: https://poe.com/s/Vp9tkyhGnMmHqFvdKp4n. You should take model’s self-reported degree of
confidence with a grain of salt; it is more meaningful to simply compare its expressed confidence with respect
to different questions, hence our experiment design here.

[p. 43]
GENERATIVE INTERPRETATION
for most contracts.172 GPT-4 (like the court) explained, “while a reasonable duration might
be inferred under common law principles, this argument does not strongly accord with the
contract’s language.”173 Overall, the models appear to generally support the court’s reading.
The second set of questions involved the scope of the city’s obligations. GPT-4 disagreed strongly with the court; it thought that the city’s obligation was unbounded. Importantly, it anchored its reasoning in a section of the contract neglected by the court: Section 6. That part obligates the city to extend sewage plans “[w]henever extensions of any of
the sewer lines are necessitated by future growth . . . of the respective communities.” For
ChatGPT-4, this provision implied the obligation to build additional treatment plants. But
Claude 2 was more amenable to the court’s interpretation and provided a plausible constraining argument: “The agreement provides for extensions when required by growth, implying a reasonable obligation.”
E. From Text to Context
So far, we have provided examples that showcase how large language models might
power a stronger, cheaper, more robust form of textualism. We now consider how such models can account for contextual evidence such as prior conversations, shared expectations, and
industry standards. Stewart v. Newbury provides a simple illustration.174 In Stewart, a contractor and a business corresponded about the construction of a new foundry. The contractor’s offer letter was brief; he offered to do the job and charge either by offering an itemized
list or by charging on a cost + 10% basis. This letter was followed by a telephone call where
they may have agreed that payment would be made “in the usual manner.” Finally, the
foundry responded in writing that, following the phone conversation, they accepted the bid.
As far as we know, that amounts to the entirety of the contracting case file.175
Once the contractor finished the first part of the project, he submitted a bill. The
foundry refused to pay. The contractor insisted that it was customary to pay 85% of payments due at the end of every month, but the foundry argued that its payments were only
due on (substantial) completion of the project. Seeing no payments made, the contractor
stopped work. The parties countersued for breach.
172 See Glacial Plains Coop. v. Chippewa Valley Ethanol Co., LLLP, 912 N.W.2d 233, 234 (Minn. 2018)
(holding that unless otherwise provided, a “contract is of indefinite duration and is terminable at will by either party after a reasonable time and with reasonable notice.”)
173 Likewise Claude-2 explained: “A reasonable duration could be implied, though not explicitly stated.”
174 220 N.Y. 379 (1917).
175 Id. at 380–84.

[p. 44]
ARBEL & HOFFMAN
Today, the default rule is that payments in construction contracts are not due until
the contract is substantially performed.176 It is unclear that this rule was in place when the
parties agreed in 1919. The foundry argued that no payment was due under the contract,
and hence, the contractor’s refusal to work was wrongful. So now we have an interpretive
question: Did the parties agree to a particular payment regime?
The written agreement is too sparse to help, but the phone conversation offers an
in. If we believe that the parties indeed agreed to make payments in the usual manner, then
it is possible to interpret usual as referring to an alleged common practice of monthly installment payments. It is also possible, however, that ‘usual’ refers to other standard payment
conventions—say, the payment on a cost +10% basis.
The court remanded because of faulty jury instructions, so the interpretative question was left undecided. We, however, are not so constricted. We asked today’s leading
LLMs, GPT-4 and Claude-4, to predict what the parties meant. To do so, we first told the
models to assume that the default legal rule would be that payment is conditioned on substantial performance.177 Then, we asked the models to estimate how the parties would have
interpreted their deal absent consideration of either extrinsic evidence of the phone conversation or evidence of industry norms. We then added the evidence of the phone conversation, to see how the model’s confidence changed, and finally, we added evidence of the custom in the industry. Table 1 summarizes the results:178
176 See 22 N.Y. JUR. 2D CONTRACTS § 352; Hillman, supra note 168, at 313 (“courts in construction cases find
a duty to pay only after substantial performance”).
177 This is not obviously the correct legal rule, then or now, but we had to start somewhere, and we took the
court at its word.
178 CLAUDE 2.0 POE CONSERVATION, https://poe.com/s/wLkeCDrPdFpKye3uApSa (last visited July 30,
2023). Again, you should be skeptical of model’s expressed confidence; the direction of change with every new
piece of evidence, not its quantification, is reliable.

[p. 45]
GENERATIVE INTERPRETATION
Table 2: Expressed confidence in “the duty to pay is monthly” based on legal and
transactional context. Presented to GPT-4 (32k context window) and Claude-2 (100k
context window).
Table 2 demonstrates how each additional piece of evidence alters the analysis. And
for purposes of this case, it shows that, for the models at least, extrinsic evidence was materially important to the outcome.
Illustrating the additional value of each piece of evidence can provided unexpected
value. Judges may fairly worry, when considering potentially unreliable evidence, that mere
exposure to the evidence would irreversibly prejudice their decisions. By estimating the probative value of some forms of evidence before closely examining them, the judge can develop
a heuristic assessment of probative value with relatively little exposure. The model can thus
give structure to the evaluation of extrinsic evidence, making it more attractive to factfinders. And within the limits of its prompts, its conclusions are coherent, cheap, and seemingly
plausible.
III. THE FUTURE OF CONTRACT INTERPRETATION
So convenient are today’s LLMs, and so seductive are their outputs, that it would be
genuinely surprising if judges were not using them to resolve questions of contract interpretation as we write this article, only a few months after the tools went mainstream. Looking
at practical guidance offered to lawyers in the summer of 2023, we see lawyers are encouraged to use LLMs to perform legal research, draft deposition questions and contracts, and
predict settlement values.179 And there are hints that judges are already using ChatGPT to
answer other kinds of interpretative questions, just as they would use Google.180 In one recent survey, one-quarter of judges confessed to using the tool, though many expressed concern about its reliability.181
These models are useful because they offer new tools—fast, cheap, sometimes incorrect ones—in service of old interpretative goals. Courts will soon take a phrase like “dozen”
179 Catherine Casey, Reveal Brainspace, Ronald J. Hedges, Ronald J. Hedges LLC, Marissa J. Moran, N.Y.C.
Coll. of Tech., Stephanie Wilson, Reed Smith LLP, Generative Artificial Intelligence in Practice: What It Is
and How Lawyers Can Use It (June 28, 2023) (on file with authors).
180 Luke Taylor, Colombian Judge Says He Used ChatGPT in Ruling, THE GUARDIAN (Feb. 2, 2023, 9:53
PM), https://www.theguardian.com/technology/2023/feb/03/colombia-judge-chatgpt-ruling (discussing
use by judges of ChatGPT in rulings).
181 Ed Cohen, Most Judges Haven’t Tried ChatGPT, and They Aren’t Impressed, THE NAT’L JUD. COLL.
(July 21, 2023), https://www.judges.org/news-and-info/most-judges-havent-tried-chatgpt-and-they-arentimpressed.

[p. 46]
ARBEL & HOFFMAN
and ask ChatGPT to interpret it, rather than turning to the dictionary or Google; or will
ask the model what’s the likely assumption a contract makes when it leaves a gap; or will
check if the model thinks an insurance policy contemplated deft burglars. They’ll do so both
covertly and overtly, both sua sponte and in response to briefing. Almost certainly the first
briefs to affirmatively argue for the use of the tool will come from resource-constrained
firms. As we illustrated in Part II of this Article, LLMs are already applicable to live problems
that courts face every day, and it would be naïve to think they aren’t using them.
Indeed, we’ve seen this story play out many times before. As some readers will recall,
when courts first realized that Wikipedia could be used as a source of information,182 they
were chastised for its use by higher courts,183 and then it was eventually folded into the normal set of legal research tools.184 But at least in the short run, judges won’t have the tool draft
opinions. And why would they? That courts are irreducibly part of the interpretative enterprise—no matter how sophisticated prediction machines get—follows from the obvious
point that there are two stages to every contract interpretation problem: figuring out what
the parties meant (at contracting), and deciding the “legal significance that should attach to
the semantic content.”185 The LLM method is simply better for many reference purposes
than those currently on offer.
The problem then is not whether courts will use LLMs as an aid to interpretation,
but how. Generative interpretation is a tool and as such, it has strengths, limits, and flaws.
To be sure, AI’s most enthusiastic wielders will be its least careful adopters. Thus, our goal
in Section III.A is to delimit some principles and limitations for LLM usage by lawyers and
judges. With the proper usage of the tool in mind, in Section III.B we suggest that generative
interpretation has implications for the continuing vitality of longstanding debates between
textualism and contextualism. Or to put it differently, while the uses that we suggest in Section III.A could be thought of as Textualism 2.0—better dictionaries and canons—we don’t
think that’s the practical limit of what this method of interpretation can do.
182 Lee F. Peoples, The Citation of Wikipedia in Judicial Opinions, 12 YALE J. L. & TECH. 1, 28 (2010) (“Citations to Wikipedia entries in judicial opinions have been steadily increasing since the first citation appeared
in 2004.”).
183 Campbell ex rel. Campbell v. Sec'y of Health & Hum. Servs., 69 Fed. Cl. 775, 781 (2006) (“rejecting special
master's reliance on Wikipedia, among other online sources, citing several “disturbing” disclaimers on the website and that it could be edited by “virtually anyone”); see also Kenneth H. Ryesky, Downside of Citing to
Wikipedia, N.Y. L.J., Jan. 18, 2007, at 2.
184 Jodi L. Wilson, Proceed with Extreme Caution: Citation to Wikipedia in Light of Contributor Demographics and Content Policies, 16 VAND. J. ENT. & TECH. L. 857, 907 (2014) (“The advent of Wikipedia
and other technological advances has changed legal research. It is unrealistic to believe that the legal community can ignore that reality. . . .”).
185 Schwartz & Scott, supra note 38, at 568 n.50; Edwin W. Patterson, The Interpretation and Construction
of Contracts, 64 COLUM. L. REV. 833, 833–35 (1964); Klass, supra note 45.

[p. 47]
GENERATIVE INTERPRETATION
A. Interpretation for the 99%?
As we’ve said, in the coming months and years, we’re sure you will read examples of
lawyers and judges using ChatGPT and related tools in perverse, sometimes outright silly
ways, and reaching absurd results you think would have been avoided had they just buckled
down and done their jobs like careful jurists ought to. Or, worse, they’ll have these tools
generate pedestrian prose that looks like soulless briefing or opinion-writing, but in fact is
built on a throne of lies. There’s no question that AI will sometimes be a crutch for lazy or
harried lawyers who simply didn’t focus on the details: It might not be ideally pitched at the
kinds of people who are reading sentences with care 20,000 words into a law review article.
And yet it’s precisely because LLMs are cheap and workmanlike that they will be of
real use in contract interpretation. The biggest single problem with all currently available
approaches to contract interpretation isn’t that they are incapable of getting correct results
some of the time. It’s that they are inaccessible to ordinary parties.186 Non-wealthy individuals who suffer breach have to lump it,187 tilt against corporations in internal dispute resolution systems,188 or face financially ruinous fees and prevail in pyrrhic victories.189 Simply
put: There is an access-to-justice problem at the center of contract law as pernicious as the
better recognized ones in criminal and constitutional adjudication. The costs and uncertainties of interpretating deals, which form the core of contract litigation, materially contribute
to this problem.190
186 See LEGAL SEVS. CORP., THE JUSTICE GAP: MEASURING THE UNMET CIVIL LEGAL NEEDS OF LOW-INCOME AMERICANS 6 (2017) (“86% of the civil legal problems reported by low-income Americans in the past
year received inadequate or no legal help.”); E.H. Geiger, The Price of Progress: Estimating the Funding
Needed to Close the Justice Gap, 28 CARDOZO J. EQUAL RTS. & SOC. JUST. 33, 34–39 (2021) (documenting
an array of causes behind the “justice gap”).
187 Geiger, supra note 170, at 38 (“[T]he average household faces 9.3 legal issues per year. 65% of those problems
are never resolved; potentially because the claimants cannot afford counsel and do not have the legal literacy
to pursue their claims pro se.”).
188 See generally Rory Van Loo, The Corporation as Courthouse, 33 YALE J. REG. 547 (2016) (describing internal dispute resolution system by firms).
189 Matthew R. Hamielec, Class Dismissed: Compelling a Look at Jurisprudence Surrounding Class Arbitration and Proposing Solutions to Asymmetric Bargaining Power Between Parties, 92 CHI.-KENT L. REV. 1227,
1231 (2017) (arguing that class action waivers and arbitration provisions can result in “negative value suits”
where low-resource claimants are pitted against wealthier opponents); Gideon Parchomovsky & Alex Stein,
The Relational Contingency of Rights, 98 VA. L. REV. 1313, 1340 (2012) (noting that class actions can transform individual negative value suits into a single positive value action).
190 Ben-Shahar & Strahilevitz, supra note 19, at 1757–58 (discussing interpretation costs); CATHERINE
MITCHELL, INTERPRETATION OF CONTRACTS: CURRENT CONTROVERSIES IN LAW 110 (2007) (noting expenses associated with contextual approaches to interpretation).

[p. 48]
ARBEL & HOFFMAN
Costly interpretation burdens judges too. Chambers are not endowed with reference experts on call for every query. Courts have fewer resources and competencies than the
layperson would imagine. This stylized fact alone can explain why dictionaries are popular,
and why corpus linguistics is at best experimental; why law office history exists but not law
office econometrics; and perhaps even why federal precedent on state issues is more cited
than the relevant state law, given that the former is thoroughly indexed in common commercial databases and the latter is not.191 To substitute for dictionaries and familiar Latin
canons, new interpretative tools must be free (or nearly so) and widely available. LLMs satisfy those conditions. Already today, interactions through a chat interface do not require
more skill than using a search engine. The deft burglar example offers a proof of concept,
and the remaining examples (though not immediately available in your chatbot window) are
likely months, not years, away.
Generative interpretation is a tool which responds to this access-to-justice concern,
at several levels.
First, if courts commit to the method, the costs of achieving accuracy in contract
interpretation disputes will fall.192 That’s so because the less precise, even if relatively cheap,
forms of textualist evidence—dictionaries and canons—will be replaced by better ones. As
dispute costs fall and outcomes become more predictable, the returns to opportunistic
breach, which generally benefits sophisticated players, will fall.193 It’s true that models may
arise to compete in the market, but as we’ve shown above, more sophisticated models tend
to converge on meaning:: unlike dictionaries, they are not offering idiosyncratic and curated
definitions which differ across people, place and time.
Second, as outcomes become more certain, and the cost of predicting them falls,
there will be fewer cases to adjudicate, because parties will likely have a much better sense of
what they’ll get at verdict, and settle accordingly.194 LLMs, unlike legal dictionaries, require
no specialized legal knowledge to access, and their ease of use will likely improve with time.
191 Samuel Issacharoff & Florencia Marotta-Wurgler, The Hollowed Out Common Law, 67 UCLA L. REV.
600 (2020) (documenting the “dominance of the federal forum”).
192 Cf. Schwartz & Scott, Redux, supra note 38, at 930 (noting the primacy of cost in evaluating the correct
interpretative rules).
193 Cf. Eric A. Posner, A Theory of Contract Law Under Conditions of Radical Judicial Error, 94 N.W. U. L.
REV. 749, 766–69 (2000) (noting that deterministic legal rules discourage opportunistic breach).
194 Cf. Schwartz & Scott, supra note 38, at 603 (“When a standard governs, the party who wants to behave
strategically must ask what a court will later do if the party is sued. The vaguer the legal standard and the more
that is at stake, the more likely the party is to resolve doubts in its own favor.”). This is a partial equilibrium
analysis—better adjudication processes invite more commercial activity, which in turn increases contracting.

[p. 49]
GENERATIVE INTERPRETATION
This implies that there will be a levelling of access to information about law, and a redistribution from more to less repeat players. Further, better calibrated results ex post means that
parties can spend less time (and money) contracting ex ante.195 A promise of generative interpretation—which it may yet fulfill—is that it will open a form of textualism up to the
99%.196
The pages of law reviews are littered with proposed technological solutions to supposed problems of excessive legal costs, and unequal access to information about legal outcomes, which turn out to be either more intractable than the authors thought or ignore virtues that the authors discounted. We should proceed with care, especially when recommending the widespread adoption of a chatbot that sits on matrices whose outputs even its
creators do not well-understand. The question is not (in our view) whether generative interpretation offers predictions that are superior in all cases to artisanal, careful, linguistic
analysis. It’s whether the method is good enough, right now or soon, for resource-deprived
courts to adopt in ordinary cases. In evaluating that question of basic competency, it’s meaningful that even today’s unspecialized models can replicate the results of well-considered
cases (as Part II explored) and prompt courts to consider their own priors.
But Part II offered a curated tour of generative interpretation’s greatest hits. It didn’t
show you where things can go wrong. To make this tool perform as well as it can, users
should be cognizant of these issues and use it according to evolving best practices. To begin,
let’s start with hallucinatory outputs.197 In a now-famous case from May 2023, lawyers in a
New York Federal court turned to ChatGPT for help researching a motion. The tool
obliged with helpful cites, but unfortunately had completely made up the opinions in question.198 A sanctions order and plenty of bad press followed.199 In response to the case, other
195 See Spencer Williams, Predictive Contracting, 2019 COLUM. BUS. L. REV. 621 (arguing that parties could
use information about contract outcomes, harnessed through machine learning of large datasets, to change out
they contract ex ante). But for an insightful discussion of how selection operates to make difficult machine
predictions about litigation outcomes, see David Freeman Engstrom and Jonah Gelbach, Legal Tech, Civil
Procedure, and the Future of Adversariliasm, 169 U. PA. L. REV. 1001, 1065–67 (2021) (discussing obstacles
to prediction).
196 Schwartz & Scott, supra note 38, at 941 (“[T]he more time the court spends on a particular interpretive
issue, the less time it can spend on other issues or other cases”).
197 Sharon D. Nelson, John W. Simek & Michael C. Maschke, Beware of Ethical Perils When Using Generative
AI!, MD. STATE BAR ASS’N (Apr. 19, 2023), https://www.msba.org/beware-of-ethical-perils-when-using-generative-ai/ (“In fact, it can come up with very plausible language that is flatly wrong. It doesn't ‘mean to’ but it
makes things up--and that is what AI researchers call a ‘hallucination’. . . .”).
198 Benjamin Weiser, Here’s What Happens When Your Lawyer Uses ChatGPT, N.Y. TIMES (May 27, 2023),
https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html.
199 See Mata v. Avianca, Inc., __ F. Supp. 3d. __, 2023 WL 4114965 (June 22, 2023).

[p. 50]
ARBEL & HOFFMAN
judges have required lawyers to certify that they had not used any form of Artificial Intelligence in their filings.200
False outputs arise from the predictive nature of generative models.201 Hallucinations are generated texts asserting facts that are not quite true.202 Large language models,
remember, are statistical tools optimized to make predictions. But LLMs are not like a helpful librarian that simply pulls out the most relevant book on a topic. Facts are stored in the
LLM similar to the way other reasoning and statistical facts are stored, as floating points in
a labyrinthian array of vectors. When asked to provide a source on a legal matter, the model
employs the same method to elicit both facts and inferences. The output doesn’t distinguish
facts from inferred facts, and sometimes will predict the world incorrectly.
Recent work has made significant advances in understanding and mitigating hallucination errors, and more powerful models are less susceptible.203 One solution that is already used in some contexts is connecting the model to a database of facts, so that it can act
200 Devin Coledwey, No ChatGPT in my court: Judge orders all AI-generated content must be declared and
checked, TECHCRUNCH (May 30, 2023, 7:32 PM), https://techcrunch.com/2023/05/30/no-chatgpt-in-mycourt-judge-orders-all-ai-generated-content-must-be-declared-and-checked/ (explaining the order, which
states that “no portion of the filing was drafted by generative artificial intelligence (such as ChatGPT, Harvey.AI, or Google Bard) or that any language drafted by generative artificial intelligence was checked for accuracy, using print reporters or traditional legal databases, by a human being”).
201 Benj Edwards, Why ChatGPT and Bing Chat are so good at making things up, ARS TECHNICA (Apr. 6,
2023, 11:58 AM), https://arstechnica.com/information-technology/2023/04/why-ai-chatbots-are-the-ultimate-bs-machines-and-how-people-hope-to-fix-them (“[T]he model is fed a large body of text . . . and repeatedly tries to predict the next word in every sequence of words. If the model’s prediction is close to the actual
next word, the neural network updates its parameter’ to reinforce the patterns that led to that prediction.”);
waka55 (u/wakka55), REDDIT (Apr. 16, 2023, 2:48 PM), https://www.reddit.com/r/OpenAI/comments/12okltx/openais_whisper_api_sometimes_returns_what_looks/ (showing that this problem is not
limited to textual generation).
202 Beren Millidge, LLM’s confabulate not hallucinate, BEREN’S BLOG (Mar. 19, 2023),
https://www.beren.io/2023-03-19-LLMs-confabulate-not-hallucinate/ (describing problem).
203 See e.g., Matt L. Sampson & Peter Melchior, Spotting Hallucinations in Inverse Problems with Data Driven
Priors, ARXIV: 2306.13272 (June 23, 2023), https://arxiv.org/pdf/2306.13272.pdf (arguing that hallucinations can be qualitatively differentiated from fact-based inferences by focusing on activation regions); see also
Philip Feldman, James R. Foulds, & Shimei Pan, Trapping LLM Hallucinations Using Tagged Context
Prompts, ARXIV: 2306.06085 (June 9, 2023), https://arxiv.org/abs/2306.06085; see also Ayush Agrawal,
Lester Mackey, & Adam Tauman Kalai, Do Language Models Know When They’re Hallucinating References?, ARXIV: 2305.18248 (May 29, 2023), https://arxiv.org/abs/2305.18248; see also Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, & Noah D. Goodman, Certified Reasoning with Language Models, ARXIV:
2306.04031 (June 6, 2023), https://arxiv.org/pdf/2306.04031.pdf.

[p. 51]
GENERATIVE INTERPRETATION
more like the helpful librarian.204 Another involves reflective self-evaluation.205 So while it is
appropriate to pay attention to the hallucination problem, we tend to think that this problem will be less salient in the future than it is today. That said, as a best practice, judges would
do well to cross-verify the answers that they get from one platform against another, just as
in the early days of legal research it would pay to check both Lexis and Westlaw to make sure
that your research was complete.206
Second, models are subject to manipulation. Large language models are malleable;
“leading prompts” can lead them to different conclusions. This is roughly analogous to leading questions for witnesses or jury instructions that frame disputes for or against a particular
outcome. As anyone who has experience with an LLM chat bot will attest, it is relatively easy
to drive conversations toward desired outcomes. In litigation practice, we should expect that
the parties themselves will submit competing prompts, just as they vie to control the framing
of the legal questions in litigation today. In response, factfinders can (as we illustrated above)
ask the model to itself produce competing prompts, and then, rather than relying on a single
query, the factfinder can look at the general trend of responses and share those varying outcomes in their decisions. Factfinders will also have to decide whether to defer to the parties’
choice of model, should they make that explicit in their contract.
The Katrina analysis raises the related problem of model interpretability.207 The way
models encode language is not based in semantics. Unlike human-based reasoning, models
have a precise sense in which “chocolate” is closer to “bread” than to “nutrition”. This precision can be misleading if interpreted naively. The Katrina example illustrates how distances correspond with a sensible account of meaning. It also shows that the policy exceptions were closer to ‘fire’ than to the arbitrarily chosen word ‘police.’ It is difficult to understand why, precisely, this result followed. Possibly, fire is a category of disaster and in this
sense it is closer to the insurance policy. Still, it would be misleading to say that the policy
excludes fire damage rather than damage caused by the police. Other terms may lead to more
counterintuitive results. This interpretability gap should caution care in the direct translation of model outputs to legal judgments. Yet, it is also the case that, on average, these models
204 See generally James Briggs & Francisco Ingham, Fixing Hallucination with Knowledge Bases, PINECONE,
https://archive.pinecone.io/learn/langchain-retrieval-augmentation/.
205 Charlie George and Andreas Stuhlmüller, Factored Verification: Detecting and Reducing Hallucination
in Summaries of Academic Papers, arXiv:2310.10627 (2023)
206 See generally Robert J. Munro, J. A. Bolanos & Jon May, LEXIS vs. WESTLAW: An Analysis of Automated
Education, 71 LAW LIBR. J. 471 (1978) (evaluating platforms against each other).
207 See supra notes 1-25

[p. 52]
ARBEL & HOFFMAN
predict with great accuracy linguistic distinctions that humans make. 208 This presents a general tension in language models. They are generally extremely good at capturing meaning,
but they still make errors and it is not always possible to rationalize or foresee these errors.
A third consideration focuses on the models’ strength: They are naturally inclined
to make predictions that maximize probability—in other words, they are biased towards
majoritarian interpretations. Models offer an approximation of general understanding that
may simply not be available in any other way, and thus advance long-held goals of contract
theory.209 But majoritarian interpretations are just that: they embed and advance the values
of the majority. This is doubly problematic. First, courts really ought to be attentive to local,
more private, meanings: Public meaning is second best, prioritized because it is efficient and
not because it is correct.210 But more generally, because the linguistic conventions of underrepresented communities are submerged by majoritarian public meanings, they will find
it more difficult to have their voices surfaced (and thus subsidized) in contract adjudication.
Majoritarian interpretative approaches risk silencing entire communities.211
Surely, this is not a problem unique to generative interpretation: dictionaries, canons, and corpora are equally, if not more, vulnerable to the charge.212 And unlike dictionaryand-canon-textualism, it is at least theoretically possible to counter the majoritarian-bent of
models in several ways. Models trained on curated datasets that reflect the linguistic conventions of distinct communities would bend towards the majoritarian patterns within those
communities. Adjustments to the model’s hyperparameters elicit more-or-less majoritarian
behaviors from the model. And careful prompt engineering can attune the model to specific
208 For a discussion of the evaluation metrics, see Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and
Nils Reimers, MTEB: Massive Text Embedding Benchmark
209 Schwartz & Scott, supra note 38, at 583–84.
210 For the foundational work distinguishing local from popular interpretative modes, see 2 SAMUEL WILLISTON, THE LAW OF CONTRACTS, § 604, 1162 (1920). Even textualists understand that strict adherence to the
public meaning of words, bereft of any commercial understanding of what the parties could have been, will
sometimes lead courts astray. See generally Stephen J. Choi, Mitu Gulati & Robert E. Scott, The Black Hole
Problem in Commercial Boilerplate, 67 DUKE L.J. 1, 2 (2017) (describing pari passu clauses as “a standard
provision in sovereign debt contracts that almost no one seems to understand”).
211 See, e.g., Majorie Florestal, Is a Burrito a Sandwich, 14 MICH. J. RACE & L. 1, 36–39 (2008) (discussing role
of race and class in an interpretation dispute); Alexandra Buckingham, Note, Considering Cultural Communities in Contract Interpretation, 9 DREXEL. L. REV. 129 (2016) (arguing for the use of cultural meaning in
interpretation); see also supra note 157.
212 Steven J. Burton, A Lesson on Some Limits of Economic Analysis: Schwartz and Scott on Contract Interpretation, 88 IND. L.J. 339, 350 (2013) (arguing that majoritarian readings can privilege certain views).

[p. 53]
GENERATIVE INTERPRETATION
contexts.213 This is an active area of research and regulatory scrutiny and should check factfinders.214
Fourth, models may become subject to parties’ adversarial attacks, prompt injections, or will be otherwise fragile in unexpected ways.215 By way of illustration, modern AI
systems can reliably differentiate between pictures of panda bears and horses, or stop signs
and yield signs. But if a sophisticated party can imperceptibly change the color of a pixel here
and there, that will be enough to make the model erroneously see a horse or a yield sign.216
The same manipulations can be used to “attack” LLM models.217 Slight changes in the wording of a contract—e.g., subtle changes in the presentation of the words—might hack the
model logic system and alter its interpretation.218 There is no known general solution to such
issues. But if judges and parties become aware of the possibility of such subtle manipulations,
they might develop defenses, like using sanitized versions of the contract in their analyses.219
Fifth, models are sensitive to time. As your neighborhood originalist will tell you,
the meaning of words is embedded in the time they were used. If we want to interpret the
meaning of a contract signed in 1924, we should account for the linguistic conventions of
213 For an illustration of this use case, see Arbel & Becher, supra note 15, at 99–104.
214 Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonized Rules
on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislation Acts, at 4,
COM (2021) 206 final (Apr. 21, 2021) (stating that a goal of the proposal is to “minimise the risk of algorithmic discrimination, in particular in relation to the design and the quality of data sets used for the development
of AI systems.”).
215 For an expanded discussion, see Arbel & Becher, supra note 15.
216 Agnieszka M. Zbrzezny & Andrzej E. Grzybowski, Deceptive Tricks in Artificial Intelligence: Adversarial
Attacks in Ophthalmology, 12(9) J. CLIN. MED. 3266 (2023) (“Suppose we consider even minor perturbations
to the image, such as the change in colour of just one pixel. Then, such models are uncertain for small perturbations.”).
217 For a formal exploration, see Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong
Wang, Linyi Yang, Wei Ye, Haojun Huang, Xiubo Geng, Binxing Jiao, Yue Zhang & Xing Xie, On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective, ARXIV: 2302.12095 (Mar. 29,
2023), https://arxiv.org/pdf/2302.12095.pdf.
218 From the model’s perspective, “please” and “please” are not the same word. For an accessible exploration,
see Computerphile, Glitch Tokens-Computerphile, YOUTUBE (Mar. 7, 2023),
https://www.youtube.com/watch?v=WO2X3oZEJOA. Various other examples are esoteric: certain models
act unexpectedly when presented with specific nonsensical words like “SolidGoldMakigarp.” See FORBIDDEN
TOKENS PROMPTING RESULTS, https://docs.google.com/spreadsheets/d/1PAZNCks11qoUpiojTJpj0odCYQL2_HGQgam8HSwAopQ/edit#gid=0 (last visited July 20, 2023). But in high stakes settings, such vulnerabilities can be exploited.
219 Courts could require, for example, that texts will be presented in plain text format. This would limit some
forms of attacks—especially those that are embedded in the graphical layer of the document. But the bitter
lesson from cybersecurity is that security is a process, not a product. For illustration, see Riley Goodside,
https://twitter.com/goodside/status/1713000581587976372.

[p. 54]
ARBEL & HOFFMAN
the time. Models today are trained on data indiscriminately: It is unlikely that they will be
able to interpret a term as it was read in a specific period in time. The problem is compounded since the training data may include information that was not available for the contracting parties at the time of contracting. This may well include the decision of a trial court
when the appellate court seeks to interpret the contract. We can think about this as pollution of the database: For example, perhaps Hurricane Katrina associated “levee” with “flood”
more closely than it was at the time the relevant insurance contracts were signed.220 Or perhaps the Stewart example was confounded by the subsequent decades of linguistic evidence
of payment defaults.
This problem is longstanding. Judges’ innate sense of language is also grounded in
the linguistic conventions in which they are personally embedded. Dictionaries and corpus
linguistics have an advantage here, because one could seek a dictionary or a corpus from the
relevant time period. But even this advantage is limited, because dictionaries are updated in
intervals of decades,221 and corpora cover considerably fewer texts when they are sliced to
relevant time periods.222 Thus, courts will have to consider whether the use of language has
shifted over time, and perhaps restrain the use of generative interpretation in cases where its
training data suffers from linguistic drift. Another way to put this is that generative interpretation is likely to be least useful for old contracts, where worries about subsequent judicial
opinions interpreting like terms are most severe, unless and until specialized models with
time delineated training data come online.
Sixth, generative interpretation will need a language of its own. Although scholars
often hype objective, scientific methods of proof and judgment, this way of explaining and
justifying the exercise of power is uncompelling, and perhaps repulsive, to the population at
large.223 (Which is one reason we’ve tried to tamp down the statistics and claims to singular
answers in this paper.) Juries, after all, aren’t presented with simple probabilistic proofs, and
judges don’t typically justify their decisions by saying they have a 51% chance of being
220 A more far-fetched problem is parties trying to inject meaning into the record, just as they would in a normal
interpretation dispute by way of after-action lawyer letters and the like. But because parties expect performance, not breach, and the relevant corpora for LLMs is so vast, jurists should worry less about this problem
than the internal-to-the-text adversarial attacks we describe above.
221 See HISTORY OF THE OED, https://www.oed.com/information/about-the-oed/history-of-theoed/?tl=true (last visited July 20, 2023); See MERRIAM-WEBSTER ABOUT US ONGOING COMMITMENT,
https://www.merriam-webster.com/about-us/ongoing-commitment (last visited July 20, 2023).
222 Mouritsen, supra note 20, at 1378 (“One of the challenges for examining usage in context in a corpus is that
the greater the specificity of the search, the fewer examples appear in the corpus.”).
223 David A. Hoffman & Michael P. O’Shea, Can Law and Economics Be Both Practical and Principled?, 53
ALA. L. REV. 335, 339 (2002) (“Most intriguingly, the studies suggest that in certain cases people prefer that
legal decisions not be made on an economic basis.”).

[p. 55]
GENERATIVE INTERPRETATION
right.224 Thus, a real problem for the method—which it shares with corpus linguistics and
the survey methodologies discussed in Part I—is how to explain itself to lay audiences in
ways that reinforce, rather than diminish, judicial legitimacy.225 It’s sociologically normal to
say that the word chicken takes meaning from the dictionary and trade usage.226 This sociological framework does not yet exist for black box language models.227 Courts will have to
find ways to wrap the results from automated interpretation in packages that help laypeople
to see law as engaging in a values-driven, communal, constrained exercise, and not merely
the highest probability next-token predictions.228
The solution likely lies in a specific type of transparency. Just as much as judges are
sociologically committed to certain types of dictionaries, so will it be the case that certain
models will emerge as robust and trustworthy. The current practice of interpretation is
largely indefensible on this score; because we have no window into the court’s processes, we
cannot see the dictionaries it did not select or the words it chose not to focus on. But we can
know what model a court picks, and from that selection, what probabilities it assessed. We
cannot know exactly how the model produced those outcomes, as this knowledge lies in its
vast inscrutable matrices. But so long as a judge not only discloses the version of the model
that she employed, but also the particular prompts that she used, generative interpretation
224 As Nesson famously argued, the fact-finding system (and juries) exists to achieve legitimacy, not just accuracy. Charles Nesson, The Evidence or the Event?: On Judicial Proof and the Acceptability of Verdicts, 98
HARV. L. REV. 1357, 1358 (1985).
225 Cf. Benjamin Minhao Chen, Alexander Stremitzer & Kevin Tobia, Having Your Day in Robot Court, 36
HARV. J. L. & TECH. 1 (2022) (presenting experimental evidence that subjects are not biased against algorithmic decisionmakers).
226 See Frigaliment Importing Co. v. B.N.S. Int’l Sales Corp., 190 F. Supp. 116 (S.D.N.Y. 1960) (adopting the
broader meaning of the word after contextual inquiry).
227 Hasala Ariyaratne, The Impact of Chatgpt on Cybercrime and Why Existing Criminal Laws Are Adequate,
60 AM. CRIM. L. REV. ONLINE 1, 7 (2023) (“Since ChatGPT uses complex deep learning algorithms, it is often
a black box with no clear reason why it provided a certain output.”); David S. Rubenstein, Acquiring Ethical
AI, 73 FLA. L. REV. 747, 766 (2021) (“[D]eep learning neural networks drive some of the most powerful, sophisticated, and functional AI systems, but their complexity renders them inscrutable to humans.”); Nelson,
Simek & Maschke, supra note 197, at 30 (“AI is largely a ‘black box’––you cannot see inside the box to see how
it works.”).
228 Related to this rhetorical concern is one about attribution and basic fairness that citizens may have about
use of LLMs. See, e.g., Sheera Frenkel & Stuart A. Thompson, ‘Not for Machines to Harvest’: Data Revolts
Break Out Against A.I., N.Y. TIMES (July 15, 2023), https://www.nytimes.com/2023/07/15/technology/artificial-intelligence-models-chat-data.html; Mark A. Lemley & Bryan Casey, Fair Learning, 99 TEX. L. REV.
743, 748 (2021) (“In this Article, we argue that ML systems should generally be able to use databases for training, whether or not the contents of that database are copyrighted.”); see also Peter Henderson, Xuechen Li,
Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley & Percy Liang, Foundation Models and Fair Use,
ARXIV: 2303.15715 (Mar. 28, 2023), https://arxiv.org/pdf/2303.15715.pdf.

[p. 56]
ARBEL & HOFFMAN
is more replicable than any other method on offer.229 (We have tried to show how that would
work in the notes of this article.) Indeed, courts might go further: They can capsule the results of their inquiries and incorporate them as permanent links to their opinions.
In summary, generative interpretation promises an accessible, relatively predictable,
tool that will help lawyers and judges interpret contracts. If it’s to achieve that promise,
courts will need to be careful to use this tool while being mindful of its uses and limitations.
To guide what would inevitably be a process of exploration, we offered a series of best practices based on the technical foundations and legal constraints that define the limits of this
tool. As a default, judges should disclose the models and prompts they use and try to validate
their analyses on different models and with multiple inputs. Ideally, they’d capsule their
findings online. They’ll want to be careful about parties’ manipulative behavior, and to consider how (and whether) to excavate private, non-majority meanings. By doing so—and by
saying what they are doing clearly and with appropriate recognition of LLMs’ foibles—
courts can fairly experiment with this new technology and achieve a better grasp on the contract’s meaning, without abusing the tool or subjecting themselves to reversals.
B. Beyond the Textualist/Contextualist Divide
As we described in Part I, the modern debate about interpretation takes as a given
that prediction is the goal. But in dividing about how to best accomplish prediction, scholars
and courts disagree about an empirical meta-question: How would most parties prefer that
courts interpret their deals?230 Many have argued that sophisticated parties prefer textualism.231 Others assert that contextualism is preferred, especially within longer-term relational
contexts.232 Some argue such preferences are, well, contextual.233 Litigated cases appear to be
all over the map.234 The views of poorer parties are more rarely studied. True, contextualism
promises to protect parties from bait-and-switch maneuvers and opportunistic drafting. But
who can afford it?
229 The model disclosure should include the model’s hyperparameters, much like judges share the version of
the dictionary they consulted.
230 Bayern, supra note 98, at 1101.
231 See, e.g., Schwartz & Scott, supra note 38, at 941 (“[P]arties prefer textualist interpretive defaults.”).
232 See Lisa Bernstein, Merchant Law in a Merchant Court: Rethinking the Code's Search for Immanent Business Norms, 144 U. PA. L. REV. 1765, 1769–70 (1996) (business arbitrators avoid business norms); Benoliel,
supra note 56 (sophisticated parties prefer textualism). For a survey of the scholarly literature, see Silverstein,
supra note 99, at 278–81; see also U.C.C. § 2-202(a) (AM. L. INST. & UNIF. L. COMM’N 1951) (usage of trade).
233 See Adam B. Badawi, Interpretive Preferences and the Limits of the New Formalism, 6 BERKELEY BUS.
L.J. 1, 1 (2009).
234 Silverstein, supra note 57, at 259 (noting courts mixed approaches in litigated cases).

[p. 57]
GENERATIVE INTERPRETATION
Generative interpretation challenges the utility of this old binary. Starting with textualism, its proponents have said that it builds a common commercial vocabulary and motivates clear contract drafting.235 But if applied correctly, generative interpretation (as a form
of textualism) can predict parties’ intent well even without invocation of specialized language or expensive drafting. And if courts follow our proposed best practices, this method is
also predictable ex ante. When parties can anticipate in advance the choice of model—and
we argue that they should be able to contract for it explicitly—then they can clarify disputes
well ahead of litigation. Even if the judge consults a broader evidentiary base than the contract itself, models can incorporate it and produce consistent outputs.
By contrast, contextualism promises accuracy by integrating all relevant evidence.
Its champions think it protects the weak from the powerful and reflects the real premises of
relational contracting relationships.236 But as a judicial practice, it encourages gamesmanship,237 exposes decisionmakers to bias-inducing testimonies, increases uncertainty,238 and
more than anything, is simply very expensive. Generative interpretation can also serve as a
form of contextualism. It is cheaper to incorporate context into the process when the model
can feed on dozens of pages of evidence. Models are not prejudiced by parcels of evidence
like human decisionmakers. And armed with LLMs, judges can assess, at the summary judgment stage, the incremental probative value of proposed elements of evidence. As we demonstrated with respect to Stewart, the judge can weigh in advance whether litigation over, say,
the records of a phone conversation would be materially important to the outcome. This
kind of prioritization is generally the approach of Uniform Commercial Code. 239 Courts
might be more comfortable adopting the UCC’s generally contextual approach outside the
law of sales were they to believe that each type of evidence could be (in fact) separately evaluated and weighed.
235 Gilson, Sabel & Scott, supra note 54, at 40–41.
236 See supra notes 89–98 and accompanying text (discussing contextualism).
237 Gilson, Sabel & Scott, supra note 54, at 41 (“Under a contextualist theory, a party for whom a deal has
turned out badly has an incentive to claim that the parties meant their contract to have a different meaning
than the obvious or standard one. Such a party can often find in the parties’ negotiations, in their past practices,
and in trade customs, enough evidence . . . force a settlement . . .”).
238 Schwartz & Scott, supra note 38, at 587; Schwartz & Scott, Redux, supra note 38, at 944–47 (arguing that
certain parties prefer textualist defaults in part because of the risk of error).
239 U.C.C. § 1-303(e) (AM. L. INST. & UNIF. L. COMM'N 1977) (order of hierarchy). The hierarchy doesn’t
always control. See, e.g., Air Prod. & Chemicals, Inc. v. Roberts Oxygen Co., No. CIV.A. 10C12243 FSS, 2011
WL 7063681, at *3 (Del. Super. Ct. Nov. 30, 2011) (“Much like Delaware law, Pennsylvania law prefers the
contract's express terms. But, Air Products’s course of dealing and course of performance allegations might
illuminate the contract and bring its terms into sharper relief.”).

[p. 58]
ARBEL & HOFFMAN
All of this suggests a disruption of the traditional impasse. Generative interpretation
allows both predictability and restraint, while also offering better linguistic accuracy. And it
corrals litigation costs.240 Or to put it differently, the choice between four corners or no corners at all is a product of its time and of a specific adjudicatory technology. As this technology improves, judges can relax old safeguards towards a more inclusive approach.
To be sure, generative interpretation would be a simple flip in the default: Parties
could indicate that their meaning was not to be determined by large language models, just as
they can now commit to avoiding certain dictionaries or choosing others.241 Just as using a
dictionary to interpret a secret cipher is a foolish way to interpret a deal,242 following parties’
expressed interpretative preferences is wise. Generally speaking, giving parties the ability to
control how contracts are interpreted respects their autonomy and carries efficiency benefits.243 So too here: Generative interpretation expands the kinds of evidence that most parties would like courts to consider, but it won’t be for everyone.
Even if all generative interpretation does is flip the default on extrinsic evidence surrounding contracting, it still has important distributive effects. Textualism’s many virtues
can be recast as its elitist faults. Poorer parties, or uncounseled ones, often misunderstand
the relationship between contractual disclaimers of reliance and oral sales talk.244 Though
the Restatement of Consumer Contracts suggests that courts should be more open to the
idea that contracts that disclaim obligation in the face of contrary promises should not be
enforced,245 it does little to help with interpretative disputes which are less obviously unjust.
And yet there are many examples of parties’ proffered meaning being excluded as violative
of the parol evidence rule,246 or simply not considered because the meaning is purportedly
plain.247 As the Ellington example above demonstrates, even on their own terms such decisions may be questioned.248 But if the parties have not otherwise indicated, generative
240 Schwartz & Scott, Redux, supra note 38, at 946 (suggesting that controlling litigation costs is one reason
that sophisticated parties prefer to avoid extrinsic evidence).
241 5 MARGARET N. KNIFFIN, CORBIN ON CONTRACTS: INTERPRETATION OF CONTRACTS § 24.9 (Joseph
M. Perillo ed., rev. ed. 1998) (courts should enforce private meanings, “however we may marvel at the caprice”);
see, e.g., Smith v. Wilson (1832) 3 B. & Ad. 728, 728 (holding that “parol evidence was admissible to sh[ow]
that . . . the word thousand, as applied to [the contract], denoted twelve hundred”).
242 KNIFFIN, supra note 241, § 24.13 (courts should and do enforce the parties’ vernacular).
243 Schwartz & Scott, supra note 38, at 569.
244 Lawrence M. Solan, The Written Contract as Safe Harbor for Dishonest Conduct, 77 CHI.-KENT L. REV.
87, 92 (2001) (identifying ways in which integrated agreements promote injustice).
245 Restatement of Consumer Contracts § 6 (2023).
246 Gold Kist, Inc. v. Carr, 886 S.W.2d 425, 430 (Tex. App. 1994), writ denied (Mar. 23, 1995).
247 Greenfield v. Phillies Recs., Inc., 98 N.Y.2d 562, 570 (2002).
248 See supra notes 158-165 and accompanying text.

[p. 59]
GENERATIVE INTERPRETATION
interpretation will provide more evidence to courts that extrinsic meaning ought to matter
in discerning what the parties contemporaneously would have said they meant.
An exemplary case that generative interpretation could benefit is Smith v. Citicorp.249 The Smiths needed to borrow money to repay an old loan and pay for some home
improvements. They turned to Citicorp, which purported to create a revolving loan agreement, secured by their home. The key to the dispute was that the interest rate on this loan
was 13.99% APR, a rate only permissible for revolving loans, not closed ones. The Smiths
argued that closed is exactly what the loan agreement was. Miraculously, the Smiths had
signed affidavits from two Citicorp employees, attesting that Citicorp never intended to
make advances on this loan (which would have defined an open-ended, revolving, loan). But
the Supreme Court of Alabama ignored that highly-probative and rare evidence because it
laid outside the four corners of the contract.
We think this result gives too much weight to generalized worries about courts’
competency to evaluate extrinsic evidence. It would be a trivial task to incorporate the affidavits into the generative analysis, and, as we’ve shown, they can be weighted according to
the judge’s priors. This would not resolve questions of credibility and relevance, but the flexibility of incorporating it at the margins might radically improve the accuracy of the court’s
analysis.
Because generative interpretation blurs the line between textualism and methods of
interpretation that are more capacious in their evidentiary sources, and because it enables a
new set of evaluative metrics and socio-legal advantages, we think that it ultimately won’t be
(just) Textualism 2.0. Rather, it will become a distinctive method of evaluating contractual
meaning, marked by its own jargon, normative commitments, and practitioner community.
That new methodology will take time to develop. As we said, in the early days, judges will
dip in and out of the application, using it as one would a dictionary, or a refresher CLE on
the canons of contract interpretation. Only when lawyers start to argue that the tool can
provide better answers to interpretative questions will courts ask if that is true, and whether
answers from ChatGPT should supplant those from Merriam’s or Black’s dictionaries.
CONCLUSION
In this Article we introduced generative interpretation, a method of interpreting
legal texts using large language models. Our work follows a rapidly evolving practice: Lawyers
249 Smith v. Citicorp Pers.-to-Pers. Fin. Centers, Inc., 477 So.2d 308, 311 (Ala. 1985).

[p. 60]
ARBEL & HOFFMAN
and judges are already experimenting with these models in law offices and chambers across
the country, some covertly, others less so. We offered a deep dive into the way the technology
works (and fails) and explored techniques of using it to better perform interpretative tasks.
We demonstrated that the technique can be applied to famous contracts cases, often arriving
at the same answers at lower cost and with greater certainty, and while sometimes exposing
ambiguities, dislodging sticky priors about meaning, and parceling out the marginal effect of
new evidence on interpretation.
In our view, generative interpretation is a tool with important implications for legal
practice and contract theory. Because language models are attentive to context, and because
they can voraciously digest long texts, they offer a much more robust form of textualism.
The models’ complex encoding of language far outstrips that of any dictionary, and extensive
training data give them a superior sensitivity to actual usage. All of that promises a considerably better way to predict meaning, but it won’t replace judges. Attempting to do so would
ignore the model’s real limitations, which include their opacity, hallucinatory nature, latent
biases, and susceptibility to adversarial attacks by sophisticated parties.
Keeping these limitations in mind, we argued that generative interpretation nevertheless paves an important middle ground between too-cold textualism and too-hot contextualism. The traditional tradeoffs between textualism and contextualism take as a given that
our textualist inquiry must depend on dictionaries and that extrinsic evidence is necessarily
costly and prone to manipulation. Because generative interpretation is easy to deploy, cheap,
and accurate, and because it is not prone to those specific biases, it suggests a workable third
way. We argue that, given this technology, parties would prefer courts to ascertain meaning
using some extrinsic evidence. As such, generative interpretation will become a majoritarian
default.
With time, these discussions will spill over to even broader debates about statutory
and Constitutional interpretation, originalism and public meaning, and the relative competencies of courts and agencies to reach unbiased, predictable outcomes. We deferred direct
discussion of these issues, not least because we’re not competent to resolve them. This work,
nonetheless, fits into this broader interpretative project of assigning meaning to legal instruments.
We close by offering a different sort of prediction. If, in fact, these models can ascertain party intent to a close-enough approximation, it seems obvious that courts will (and
should) use them to make interpretation better. But if that’s really true, we wonder why
parties would continue to commit to contracts at all? Formal contracting is expensive. Why
not, instead, simply write out jointly-held goals at the beginning of the relationship and let

[p. 61]
GENERATIVE INTERPRETATION
models spit out codes of conduct and legal responsibility as problems arise down the line?250
Or to put it differently, right now, generative AI looks like a promising judicial adjunct. But
the future of this technology is more disruptive by far: Formal contracts themselves may be
made obsolete. Or, at the very least, jurists should consider the marginal value of contracting
if the terms themselves are fairly determinable from the parties’ goals.
250 Cf. Cathy Hwang, Deal Momentum, 65 UCLA L. REV. 376, 380 (2018) (describing the use of terms
sheets as deal motivators).