AI Against Racism

Juan Manuel Ortiz de Zarate
Jul 4
10 min read

When California’s Assembly Bill 1466 (2021) ordered every county to “affirmatively identify and redact” racially restrictive covenants, it opened an archival Pandora’s box. Santa Clara County alone holds 24 million deeds—some 84 million pages—stretching back to the 1850s. A purely manual review was projected to consume 160 years of staff time and cost more than $22 million, an estimate that effectively paralyzed compliance efforts.

The paper AI for Scaling Legal Reform [1] documents an academic–government partnership that broke that stalemate. Leveraging a fine-tuned open-source language model, the team processed 5.2 million pages from the county’s pre-1980 archive, surfacing thousands of illegal clauses for legal action while preserving a searchable historical record. Their approach reduced human labor by an estimated 86 500 hours and cut direct costs to just $258 . The following article unpacks the paper’s data pipeline, technical architecture, empirical findings, and policy implications, showing how artificial intelligence can illuminate—and help dismantle—structural racism embedded in property law.

Example found in a 1940 real property deed from Santa Clara County’s archives

Background: From Shelley v. Kraemer to AB 1466

The story of racially restrictive covenants (RRCs) is one of persistence—of legal language crafted to exclude, surviving long after the social and legal systems that enabled it have been formally dismantled. These clauses, embedded in property deeds and subdivision agreements across the United States during the early to mid-20th century, explicitly barred people of certain races—most commonly African Americans, Asians, and Latinos—from owning, leasing, or occupying real estate. In many neighborhoods, it was common to find language such as “No person of any race other than the Caucasian race shall use or occupy any building or any lot.” These provisions codified segregation into the very foundation of residential geography, helping produce the racially stratified housing patterns that persist to this day.

Legal Landscape

While private property owners had long used exclusionary practices, RRCs gained momentum in the 1910s and 1920s after the U.S. Supreme Court’s decision in Buchanan v. Warley (1917)[3], which struck down municipal racial zoning laws. In response, developers and homeowners associations increasingly turned to private covenants to maintain racial homogeneity in white neighborhoods. These clauses carried legal weight, and courts frequently upheld them. In 1926, the Supreme Court ruled in Corrigan v. Buckley that such covenants were private contracts beyond the reach of the Constitution, effectively greenlighting their spread nationwide.

That legal status held until Shelley v. Kraemer (1948) [4], when the Court finally ruled that courts could no longer enforce racial covenants without violating the Equal Protection Clause of the Fourteenth Amendment. Even so, the ruling did not prohibit the writing or recording of such covenants—only their enforcement. Many developers and title companies continued to include the language by default, and the discriminatory text was rarely removed from the official record. As late as the 1960s, major banks and insurance companies treated the presence of such clauses as a sign of “stability,” reinforcing redlining and credit discrimination [2].

Further legal reform came with the Fair Housing Act of 1968, which made it illegal to discriminate in housing based on race, religion, or national origin. Yet once again, the law did not mandate removal of the covenants themselves from property records. As a result, millions of deeds across the U.S. still contain racist language—unenforceable in court, but symbolically powerful and emotionally damaging to homeowners who encounter it.

Policy Response in California

California has taken incremental steps to address this legal residue. In 2000, the state passed legislation allowing homeowners to record a modification that declares racially restrictive language in their deed to be void. However, this was a voluntary, homeowner-initiated process, which left the burden on individuals—many of whom were unaware of the covenants or lacked the time and legal guidance to pursue removal. The result was a patchwork of isolated redactions, with little effect on the broader historical record or the administrative systems that perpetuated the presence of these clauses.

In response to mounting pressure from civil rights groups and housing justice advocates, California passed Assembly Bill 1466 [6] in 2021. This landmark legislation shifted the responsibility from individual homeowners to county recorder offices. It required each county in the state to proactively “identify, redact, and preserve” racially restrictive covenants found in official property records. The law aimed to create a systematic, statewide effort to remove these remnants of legalized racism while retaining them in an archival context for historical transparency.

Yet the scope of the challenge quickly became apparent. In populous counties like Santa Clara, with tens of millions of pages of real estate documents, the scale of the task was daunting. A purely manual review—where county staff read every deed from the early 1900s onward—was projected to cost millions of dollars and take over a century of labor. Without scalable tools, compliance was effectively impossible.

A Civic Challenge with Technical Stakes

This impasse between legal obligation and logistical infeasibility set the stage for innovation. Could natural language processing (NLP) and artificial intelligence (AI) be used not only to detect illegal clauses buried in vast document corpora, but also to do so with enough precision and transparency to satisfy legal review? Could open-source models match—or even outperform—expensive proprietary systems in this public-sector setting?

The project documented in AI for Scaling Legal Reform was born of these questions. It represents a collaboration between Stanford researchers, county legal officials, and data scientists. More than a technical proof-of-concept, the project offers a model for how governments might use AI to tackle long-neglected remnants of structural discrimination embedded in public records, starting with racial covenants, but potentially extending to broader areas of law and governance.

Brief overview of legal developments that impacted California’s housing market in the 20th century

Data: Digitizing a Century of Deeds

Digitization and OCR

Santa Clara’s deeds before 1980 were locked in microfiche scans hosted on a proprietary platform. The researchers extracted 5.2 million pages covering 1902-1980—the period when most covenants were written—and ran optical-character recognition (OCR) with docTR’s DB-ResNet-50/VGG-16 stack. Handwritten pre-1902 documents were excluded because available OCR models proved unreliable.

Augmentation

To improve generalization, the team added 10,000 freely downloadable deeds from seven out-of-state counties, selected via keyword searches for high RRC likelihood. This cross-jurisdictional mix exposed the model to varied page layouts, writing styles, and racial terminology .

Annotation

Ultimately, 3,801 pages were hand-labeled—2,987 containing covenants—to create training and evaluation splits. A custom web app lets annotators draw bounding boxes around offending text, enabling span-level supervision. The annotation guidelines warned of potentially offensive language and offered opt-out counseling, reflecting ethical sensitivity.

Method: Building an End-to-End Covenant Detector

To process millions of historic property records and identify racially restrictive covenants (RRCs), the team developed a modular, end-to-end pipeline that combined document digitization, natural language processing, and geospatial matching. The design emphasized legal reliability, transparency, and scalability—key requirements for a system operating within the constraints of public administration.

Diagram of the pipeline for detecting racial covenants

From Manual Rules to Semantic Models

The starting point was a basic keyword-matching heuristic, using a list of known racial descriptors and exclusionary terms. While useful as a baseline, this approach produced many false positives (e.g., names or street titles) and failed to capture variations in phrasing or euphemistic language common in mid-20th-century legal documents.

To improve on this, the team experimented with fuzzy string matching techniques, which allowed for some tolerance to OCR noise and spelling variations. However, these methods remained limited to surface-level patterns and could not grasp the legal context or intent behind the clauses.

Leveraging Large Language Models

Recognizing the limitations of pattern matching, the authors incorporated large language models (LLMs)[5] capable of semantic understanding. They tested both proprietary and open-source models. While off-the-shelf LLMs like GPT-3.5 showed promise, their outputs were not always legally reliable, especially when prompted in zero- or few-shot settings. This motivated the decision to fine-tune an open-weight model (Mistral-7B) on a labeled corpus of real covenant examples, enabling domain-specific performance and full control over deployment.

The model was trained to predict not just whether a deed contained a covenant, but also to extract the exact span of discriminatory language, a critical feature for redaction and legal auditability.

Span Alignment and OCR Integration

Since the model operated on OCR text, aligning its output to the original document image required additional logic. Predicted spans had to be matched back to character-level coordinates provided by the OCR engine. The authors implemented a fuzzy alignment algorithm that tolerated slight wording discrepancies while recovering accurate bounding boxes for visual overlays—ensuring the flagged language could be precisely located and redacted.

Geolocation via Metadata Parsing

To enable public mapping and policy analysis, the system also extracted tract-level metadata (e.g., subdivision names, book and page numbers) from each deed. A lightweight LLM was used to parse these references from noisy OCR text. Then, using fuzzy matching, the extracted tokens were linked to the county’s GIS parcel data. This allowed the system to associate identified covenants with specific geographic areas and display them interactively on a public map.

Modular and Auditable Architecture

The pipeline was designed with modularity in mind. Each step—from OCR to clause detection to mapping—can be independently audited, replaced, or adapted to new jurisdictions. This ensures the system’s transparency and flexibility while respecting legal requirements, such as human-in-the-loop validation and archival preservation of the original documents.

The result is not just a technical tool, but a legally actionable system for accelerating compliance with anti-discrimination law while preserving historical traceability.

Evaluation: Accuracy, Speed, and Cost

The practical success of this project hinged not only on technical performance but also on meeting the economic and legal demands of a public-sector deployment. The authors conducted a comprehensive evaluation of four different approaches to detecting racially restrictive covenants (RRCs) in the Santa Clara County deed archive, assessing each one across three critical dimensions: accuracy, processing time, and cost. This comparison underscores the feasibility of deploying open-source models in a high-stakes legal setting and highlights how artificial intelligence can unlock massive productivity gains in government workflows.

Comparative Benchmarking

To quantify the trade-offs, the team benchmarked four methods:

Manual Review by legal staff
GPT-3.5 API (zero-shot classification)
GPT-4 Turbo API (zero-shot classification)
Fine-tuned Mistral-7B with span prediction on local infrastructure

Each method was tasked with processing 5.2 million pages of scanned deeds. Results were summarized in a comparative table (Table 1 in the paper), showing dramatic differences in both resource requirements and effectiveness.

Approach	Time Required	Cost	Accuracy (P/R)
Manual Review	9.9 years (single staffer)	$1.4M+	~100% (human)
GPT-3.5 API	3.6 days	$13,634	99% / 77–96%
GPT-4 Turbo API	3.9 days	$47,944	99% / 83–98%
Fine-tuned Mistral-7B	6 days (on 1×A100 GPU)	$258	100% / 99.4%

The results are striking: the open-source Mistral model matched or exceeded the accuracy of the proprietary models while reducing direct costs by more than 200× compared to GPT-4. The model completed the task in six days, running on a single rented A100 GPU (a widely available cloud instance), with total compute costs estimated at just $258.

False positive and false negative predictions from each detection approach considered in our study

Accuracy: Precision and Recall Trade-offs

In legal contexts, the stakes of both false positives (flagging innocent deeds) and false negatives (missing discriminatory clauses) are high. The evaluation used two key metrics:

Precision: the percentage of flagged spans that were actual RRCs
Recall: the percentage of total RRCs that were successfully identified

The fine-tuned Mistral model achieved 100% precision and 99.4% recall on the held-out evaluation set, meaning it correctly identified almost every covenant without hallucinating any non-existent ones. This high precision is crucial: it means that downstream legal review is not overwhelmed by spurious alerts, making the system viable at scale.

In contrast, GPT-3.5 and GPT-4 Turbo achieved strong precision (99%), but their recall was inconsistent: 77% in the zero-shot setting, and up to 96% with few-shot prompting. However, this came at the cost of hallucinated spans, introducing reliability concerns that limit their use in legal workflows without human correction.

Span-Level Evaluation

Unlike traditional document classification tasks, this project required the model to identify the exact location of problematic language within a page. This span-level prediction was evaluated using BLEU scores (a metric borrowed from machine translation) to measure overlap between predicted and true bounding boxes. The Mistral model achieved a BLEU score of 0.93, indicating high fidelity in locating discriminatory clauses—another critical factor for successful downstream redaction.

The team also developed a fuzzy alignment algorithm to map the model’s output strings back to precise character coordinates on the OCR’d page. This enabled accurate redaction overlays and improved the user experience for both legal staff and the public map viewer.

Legal Review Efficiency

California law (AB 1466) requires legal counsel to review each flagged deed before redaction. To minimize this burden, the model assigned confidence scores to every prediction. The team calibrated a 75% confidence threshold, below which predictions were discarded. Random sampling of high-confidence results revealed that the model’s real-world precision ranged between 96.4% and 99.7%, depending on the page type and OCR quality.

This high precision allowed the legal team to process flagged pages quickly: most decisions were binary (“yes, redact” or “no, discard”), with minimal ambiguity. The AI pipeline didn’t eliminate the human-in-the-loop requirement—it simply streamlined it by 86,500 hours of estimated review labor.

Time and Cost Implications for Public Agencies

From a policy perspective, perhaps the most important result is cost reduction. The manual approach was estimated to cost over $1.4 million and consume 160 staff-years, a scale that would have delayed compliance by decades. By contrast, the open model’s total cost—including GPU rental, storage, and minimal human annotation—was under $300, making it accessible even to resource-constrained countries.

This cost-performance ratio is not merely a technical achievement; it’s a political enabler. Without this intervention, many jurisdictions would have failed to meet AB 1466’s mandate—or worse, quietly declined to attempt it at all. By demonstrating that full-county processing is possible with minimal budget, the project empowers other recorder’s offices to act and provides a replicable blueprint.

Limitations and Future Work

Imperfect Recall. A 0.6 % miss rate still leaves some covenants undiscovered; boundary-pushing euphemisms (e.g., “injurious to the locality”) remain a challenge.
Beyond Race. California law bars discrimination on gender, religion, disability, etc. The current model targets racial terms; extending it requires additional annotation and few-shot prompts.
Domain Transfer. Counties with different deed formats or OCR quality should perform validation before wholesale deployment.
Community Engagement Trade-offs. Automating detection may reduce the volunteer labor that fuels grassroots projects like Mapping Prejudice. The authors argue, however, that freed human effort can shift from labeling to interpretation and advocacy.

Conclusion

AI for Scaling Legal Reform does more than shave clerical hours; it shows that modern NLP can surface buried evidence of discrimination and accelerate statutory compliance without sacrificing due process or historical memory. In one county’s archive, a fine-tuned 7-billion-parameter model transformed a century of microfilm into actionable insights: thousands of racist clauses slated for removal, a public map that makes segregation’s legacy visible, and a replicable technical blueprint for jurisdictions nationwide.

The project exemplifies a civic-minded AI ethos: identify a concrete public-interest bottleneck, co-design with domain experts, measure rigorously, and release tools openly. As other states weigh covenant-redaction bills—and as lawmakers confront equally sprawling legal corpora—the Santa Clara experiment stands as proof that the “boring” work of document cleanup can be both technically elegant and socially transformative. Artificial intelligence, properly harnessed, does not erase history; it helps us read it, learn from it, and write the next chapter more justly.

References

[1] Delaney, K., Golub, D., Hu, J., Lo, J., et al. (2024). AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County. arXiv preprint [arXiv:2503.03888v2].

[2] Freund, D. M. (2010). Colored property: State policy and white racial politics in suburban America. In Colored Property. University of Chicago Press.

[3] Underwriting Manual of the Federal Housing Administration

[4] Shelley v. Kraemer, 334 U.S. 1 (1948), Justia U.S. Supreme Court

[5] The Architecture That Redefined AI, Transcendent AI

[6] Assembly Bill No. 1466, California Legislative Information