Foundation Models
- Juan Manuel Ortiz de Zarate
- May 7
- 11 min read
Artificial Intelligence is going through a major transformation. Just a few years ago, most AI systems were built for specific tasks: detect spam, translate a sentence, tag objects in images. But now, we’re entering a new era—one where giant, general-purpose models can do all of those things and more, without being built from scratch each time. These are called foundation models. If you’ve heard of BERT [1], GPT-3 [2], or DALL·E[4], you’ve already met one.
A group of over 100 researchers from Stanford’s Center for Research on Foundation Models (CRFM), led by Rishi Bommasani and Percy Liang, took on the monumental task of unpacking the opportunities and risks behind these powerful tools. Their report, “On the Opportunities and Risks of Foundation Models”[3] is a sweeping, 200-page deep dive into how foundation models are built, where they’re used, and what they mean for the future of AI—and society at large.
Let’s break it down.
Two Key Concepts: Emergence and Homogenization
To really grasp what makes foundation models special—and potentially risky—it helps to understand two central ideas the Stanford report emphasizes: emergence and homogenization. These aren't just buzzwords. They describe how foundation models behave in ways we didn’t fully predict, and how they’re changing the way we build all kinds of AI systems.

🔮 Emergence: When Bigger Means Unexpectedly Smarter
Let’s start with emergence. In science, emergence refers to complex behaviors that arise from simpler rules, often unexpectedly. Think about how individual ants follow simple rules, but collectively build entire colonies with tunnels and farming systems. Foundation models work a bit like that.
When researchers trained early language models like GPT-2, they were good at autocomplete. But when they scaled up to GPT-3—with 175 billion parameters—something strange happened. The model suddenly started doing things it hadn’t been explicitly trained to do. Like solving math problems, answering trivia questions, or translating languages. It even showed signs of reasoning—just from reading enough examples online.
This wasn’t because someone gave it extra labels or supervised learning. It just... happened.
One of the most striking emergent behaviors is in-context learning. That’s where the model learns to complete a task just from a few examples in the prompt—without any parameter updates or fine-tuning. You paste in:
Q: Translate “Hello” to Spanish. A: Hola > Q: Translate “Goodbye” to Spanish. A: Adiós Q: Translate “Thank you” to Spanish.
A:
...and GPT-3 figures it out on the fly. That’s wild. No retraining. No explicit instructions. Just inference-time reasoning.
Other emergent behaviors have included writing functioning code from natural language (as in Codex), generating photorealistic images from text (as in DALL·E), and even early signs of planning or deception in multi-agent simulations. These capabilities weren’t part of the original training goals, and they weren’t always predictable based on smaller models.
The takeaway? Scale changes the game. Once models get big enough and are trained on enough data, entirely new abilities can surface. That’s exciting—but it also means we don’t fully understand what we’re creating, and we can’t easily anticipate how these models will behave in new contexts.
🧪 Homogenization: One Model to Rule Them All?
The second big idea is homogenization, which refers to the growing trend of using similar methods, architectures, and even the same models across a wide range of tasks and domains.
Back in the early days of AI, every task had its own custom pipeline. You’d build a model for image classification, another for speech recognition, another for machine translation. Each one was fine-tuned, domain-specific, and required tons of engineering.
Now? Foundation models are flipping that script.
With models like GPT-3, BERT, or CLIP, we’re seeing a “one-size-fits-most” approach. The same general-purpose model—often trained on the same web-scale data and using the same Transformer architecture—can be adapted to dozens of tasks across natural language, vision, and even scientific domains like protein folding or genomics.
This has big upsides:
Efficiency: Instead of reinventing the wheel, you fine-tune or prompt a foundation model.
Transferability: Knowledge gained from one domain (say, internet text) can help in another (like law or medicine).
Speed: Teams can build prototypes faster, using pretrained APIs rather than starting from scratch.
But there are major risks too.
The biggest one? Shared failure modes. If a foundation model has a flaw—like racial or gender bias, factual hallucinations, or security vulnerabilities—every system built on top of it inherits those issues. When one model powers hundreds of applications, it becomes a single point of failure.
Imagine a world where almost every product, chatbot, translation tool, or virtual assistant relies on a small number of massive models. That’s the world we’re heading toward. It centralizes power (usually in a few tech giants), and it reduces diversity in AI approaches. If one foundation model has a blind spot or gets it wrong, the error could ripple out across millions of users.
The report also points out that homogenization is happening across research communities. The Transformer architecture is now everywhere—from NLP to vision to speech to biolog. What started in language modeling is becoming the dominant paradigm for all of AI.
And that raises some philosophical and practical questions:
Are we limiting ourselves by putting all our eggs in the Transformer basket?
What happens to innovation when everyone is building on the same foundation?
Should we diversify architectures and training data to reduce systemic risk?
The Big Picture
Emergence and homogenization are two sides of the same coin. As we build bigger, more generalized models, they develop surprising abilities we didn’t plan for—and they also start to dominate every corner of the AI landscape.
That’s a powerful combination. But as the report reminds us, power demands caution. Emergence means we don’t fully understand the models we’re deploying. Homogenization means the consequences of a mistake can spread far and wide.
The challenge for researchers, developers, and policymakers isn’t just how to make these models better—it’s how to make them safer, fairer, more transparent, and more inclusive.
What Can Foundation Models Do?

The Stanford report explores five major areas of capability:
1. Language
This is where foundation models have made the biggest splash. From autocomplete to writing entire articles, language models like BERT and GPT have revolutionized NLP (natural language processing). They outperform older systems on nearly every benchmark. But there’s a catch: they often struggle with nuance, context, or underrepresented dialects.
2. Vision
Computer vision is catching up fast. Models like CLIP[5] and DALL·E learn from text-image pairs, letting them understand images in richer ways—or even generate new ones. The promise? Better medical diagnostics, smarter surveillance (hopefully with safeguards), and AI that “sees” the world more like humans do.
3. Robotics
Robotics is trickier, since it involves the messy, physical world. But foundation models trained on video, language, and sensor data could give robots a head start—learning general skills that can be fine-tuned to specific environments (like your kitchen).
4. Reasoning
Tasks like math, coding, and puzzle-solving require logical thinking. Here, foundation models show surprising promise—OpenAI’s Codex can already generate functional code from plain English. Still, deep reasoning is a work in progress.
5. Interaction
Foundation models are powering new types of interfaces—from smart chatbots to voice assistants that understand context and respond naturally. This opens up new opportunities for accessibility, education, and creative expression.
What Exactly Is a Foundation Model?
Let’s get one thing straight: foundation models are not just “bigger AI models.” They represent a new way of thinking about how AI is built, used, and deployed. Instead of training a new model for every single task—like writing emails, classifying images, answering questions—we now train one giant model that can do many tasks, often with little or no task-specific training. That’s the foundation model approach.
So, what makes a model a foundation model?
According to the Stanford report, a foundation model is:
“A model trained on broad data (usually with self-supervision and at massive scale), which can then be adapted to a wide range of downstream tasks.”
Let’s unpack that.
🧠 It’s Trained on Broad Data
Unlike traditional models trained on carefully curated, labeled datasets, foundation models are trained on raw, messy, broad data—often scraped from the open internet. Think Wikipedia, books, news articles, code repositories, and social media posts.
The goal isn’t to teach the model one specific thing. It’s to expose it to a massive chunk of the world’s knowledge in its natural, unstructured form. That’s why GPT-3 reads web pages and DALL·E looks at image-caption pairs: they’re learning general patterns from everything.
This massive data diversity is part of what makes them so flexible. Once they’ve learned “how language works” or “what objects look like,” they can generalize to tasks no one explicitly trained them for.
🤖 It Uses Self-Supervised Learning
Foundation models aren’t spoon-fed with clean labels like “this is a cat” or “this sentiment is positive.” Instead, they learn by predicting parts of their input. For instance:
In BERT, the model learns to guess masked-out words in a sentence. Example: “The capital of France is [MASK].” → “Paris”
In GPT, the model predicts the next word in a sequence. Example: “Once upon a time, there was a brave…” → “knight”
This process is called self-supervised learning. It’s powerful because it lets us train on huge unlabeled datasets—no human annotation required. It’s also surprisingly effective at teaching models how language, vision, or code behaves, because the “fill-in-the-blank” task forces the model to develop rich internal representations.
🧮 It’s Massive—And Getting Bigger
Let’s talk size. Foundation models are huge. GPT-3 has 175 billion parameters. Google’s PaLM model goes up to 540 billion. These aren’t just large—they’re unprecedented.
Why so big? Because size + data = power. As the Stanford team explains, scale unlocks emergent capabilities (see the previous section). With enough training, these models start showing general skills like reasoning, summarization, translation—even coding or math—without being explicitly trained for those tasks.
But it’s not just about size for its own sake. It’s about capacity: the ability to store patterns, represent complex relationships, and generalize to new problems. Think of it like increasing the brainpower of the model.
🔧 It’s Adapted, Not Rebuilt
Once you’ve got a trained foundation model, you don’t need to start from scratch for every task. You just adapt it.
There are a few popular ways to do this:
Fine-tuning: Train the model a bit more on your specific task (e.g., legal contracts, medical summaries).
Prompting: Give the model an example in plain language and let it infer the task. (e.g., “Translate this to French: ‘Good morning’ → ”)
Adapters / LoRA: Plug in lightweight modules or tweak just a few parameters to customize behavior with minimal compute.
This adaptability is what makes foundation models so efficient—and so disruptive. A single model can power dozens of applications, from chatbots and recommendation systems to diagnostic tools and creative writing assistants.
🧩 It’s Incomplete by Design
One of the most important points the report makes is that a foundation model is not a final product. It’s not a chatbot, a lawyer, or a medical assistant out of the box.
It’s more like a general-purpose engine—raw, powerful, and flexible, but also potentially dangerous if used carelessly. To make it useful (and safe), you have to build on top of it: add safeguards, interfaces, evaluation pipelines, and domain knowledge.
That’s why the Stanford team chose the term foundation: it’s the base layer, not the whole house.
🧠 A Mental Model
If you’re still wrapping your head around it, here’s an analogy.
Imagine you trained a really smart kid on every book in the library, every article online, and every YouTube transcript. This kid didn’t memorize everything, but they learned general patterns of how people talk, think, and solve problems.
Now, whenever you ask them to do something—write a summary, explain a concept, make up a story—they give it a shot using everything they’ve seen before.
That’s what a foundation model is: a generalist, trained at scale, capable of adapting to new tasks with minimal extra guidance.
Real-World Applications: Healthcare, Law, and Education
The report highlights three fields where foundation models could be game-changers—and where the stakes are high.
Healthcare
Imagine an AI that helps doctors synthesize patient records, spot anomalies in X-rays, or even suggest treatments based on clinical guidelines. That’s the promise of foundation models in healthcare. But there are huge challenges: patient privacy, bias in medical data, and the need for models that explain their decisions.
Law
Legal documents are dense and complicated. Foundation models could help summarize cases, find relevant precedents, or draft contracts. But again, transparency and accuracy are key—especially when people’s rights are on the line.
Education
Personalized tutoring, automated feedback, AI-generated practice problems—foundation models could make education more accessible and engaging. On the flip side, they also make plagiarism and cheating easier. The question is how to use them responsibly to empower students and teachers.
What Makes These Models Work?
The secret sauce behind foundation models boils down to three ingredients:
Architecture: Most foundation models use Transformers—a neural network structure designed for handling sequences. It's like a super-powerful attention mechanism that lets models weigh the importance of each part of their input.
Scale: These models are huge. GPT-3 has 175 billion parameters. More data, more compute, more capacity—it turns out scale often leads to better performance and surprising abilities.
Self-Supervised Learning: Rather than relying on human-labeled data, these models learn by predicting missing parts of their input (like the next word in a sentence). This makes it possible to train on web-scale datasets without needing expensive annotations.
But Here Come the Risks…

Foundation models are powerful—but also potentially dangerous. The Stanford report dives deep into the societal implications:
Bias and Fairness
Because they’re trained on real-world data, foundation models absorb real-world biases. That means stereotypes, discrimination, and misinformation can be baked into the model. And since these models are reused everywhere, the damage can multiply.
Misuse
Bad actors can use foundation models to generate deepfakes, spam, or targeted disinformation. The better the model gets at mimicking human language or behavior, the harder it is to detect manipulation.
Environmental Impact
Training a large foundation model consumes huge amounts of electricity. The carbon footprint of a single training run can be enormous—raising ethical questions about sustainability.
Legal and Economic Uncertainty
Who’s liable if a foundation model makes a mistake that harms someone? What happens to jobs when AI can draft contracts, write code, or tutor students? The legal and economic systems are still catching up.
Who Controls the Future?
A major concern raised in the report is centralization. Right now, only a handful of companies (OpenAI, Google, Meta, etc.) have the compute and data to build state-of-the-art foundation models. This concentration of power could stifle innovation, transparency, and public accountability.
One solution? Open research ecosystems. Projects like Hugging Face’s BigScience [6] or EleutherAI [7] are building large, community-driven models to keep AI research open and inclusive. Governments and universities also have a role to play—investing in infrastructure, regulation, and oversight.
Where Do We Go From Here?

The Stanford team ends on a hopeful but urgent note: foundation models are not going away. They’re becoming the infrastructure for modern AI—like operating systems were for computing.
But unlike software, these models evolve in unpredictable ways. They require us to rethink how we evaluate, regulate, and co-develop AI systems. We need interdisciplinary collaboration—computer scientists, ethicists, lawyers, social scientists—to build models that reflect not just technical excellence, but human values.
We’re still at the beginning of this journey. The questions we ask now—about who builds foundation models, how they’re trained, and what they’re used for—will shape the AI future for decades.
“On the Opportunities and Risks of Foundation Models” isn’t just a technical report. It’s a call to action. Foundation models are laying the groundwork for the next generation of AI—but what we build on that foundation is up to us.
References
[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[2] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI Blog.
[3] Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
[4] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., ... & Sutskever, I. (2021, July). Zero-shot text-to-image generation. In International conference on machine learning (pp. 8821-8831). Pmlr.
[5] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
[6] The BigScience project, Hugging Face
[7] EleutherAI, EleutherAI
Comments