Orca: The New LLM Teacher

Juan Manuel Ortiz de Zarate
Oct 9, 2024
9 min read

The field of artificial intelligence has made significant strides over the last decade, particularly in the development and application of large language models (LLMs). These models, such as GPT-4, have demonstrated remarkable proficiency in generating human-like text, answering complex questions, and even reasoning through problems. However, these cutting-edge models require immense computational resources, making them impractical for widespread use in many scenarios. Enter the concept of the "student model"—a smaller, more computationally efficient language model that learns to mimic the performance of a much larger teacher model. Recent research, particularly the work of Arindam Mitra and his colleagues at Microsoft, has introduced "Orca 2" [1] a groundbreaking approach that refines this teacher-student dynamic, producing student models that can perform nearly as well as their larger counterparts. This article explores how this technique works and why it holds such promise for the future of artificial intelligence.

The generic teacher-student framework for knowledge distillation. Source [5]

A New Approach to Efficient AI: Orca 2 and Its Potential

In a typical teacher-student relationship in machine learning, a large, high-performing model—known as the teacher—guides a smaller, less computationally expensive model, the student. The student, while limited in size and resources, is trained to imitate the teacher's outputs as closely as possible. However, Mitra's team took this idea a step further by developing Orca 2, a method that enables the student model to not only replicate the teacher's answers but also learn and adopt the most effective reasoning strategies that the teacher uses to reach those answers.

The key insight behind Orca 2 is that reasoning strategies significantly influence the performance of language models. In other words, the way a model approaches and breaks down a problem can be as important—if not more so—than its size or the amount of data it has been trained on. Different reasoning strategies, such as "think step by step," "recall then generate," or "explain then generate," can yield varying results depending on the nature of the task at hand. What’s more, models of different sizes and capabilities may excel using different strategies. This insight led the researchers to propose a novel idea: Instead of always teaching the student model to use the teacher’s best-performing strategy, why not focus on teaching the student the strategy that works best for it?

The Core of Orca 2: Reasoning Matters

Orca 2’s design is based on the premise that the reasoning process a model uses can determine its success on a given task. While larger models like GPT-4 may perform better using a specific reasoning strategy for certain types of problems, smaller models like Llama 2 [2] (the student model used in this study) might benefit from entirely different approaches to the same task.

The goal of Orca 2 is to identify the best reasoning strategy for the student model rather than forcing it to mirror the teacher's exact behavior. In practice, this means that while GPT-4 might be the teacher model in this setup, Llama 2 learns to solve tasks using its own optimal strategy, which may differ from GPT-4's.

The Fine-Tuning Process: Creating a Smarter Student

The process of training Llama 2 in Orca 2 is both rigorous and methodical. First, the researchers assembled a dataset of examples that covered roughly 1,500 tasks, sourced from various datasets. These datasets included tasks such as text classification, math questions, logic puzzles, and multiple-choice questions. They used FLAN [4]—a collection of datasets that cover a broad range of tasks—as a foundation, but they also incorporated additional math problems from ten different datasets not included in FLAN. This ensured that the dataset was diverse and representative of a wide array of problem types.

Once the dataset was prepared, the researchers began testing Llama 2 on each task using several different reasoning strategies. These strategies were not limited to straightforward question-answering but included more complex approaches like "think step by step" or "explain then answer." While the authors did not disclose the complete list of strategies used, they emphasized the importance of using varied reasoning processes to maximize the model's potential.

Next, GPT-4—the teacher model—was tasked with augmenting the dataset. For each task, GPT-4 was prompted to respond using the reasoning strategy that had led Llama 2 to its best performance. In essence, GPT-4 provided not just the answers to the tasks but also a detailed explanation of the reasoning behind those answers. This enriched the dataset by including not only the final responses but also the step-by-step logic that led to them.

The fine-tuning of Llama 2 was then based on this augmented dataset. When given a new prompt, Llama 2 was not explicitly told which reasoning strategy to use. Instead, it was trained to produce both the reasoning process and the final response that GPT-4 had generated for the same prompt using Llama 2's best-performing strategy. This allowed Llama 2 to learn from GPT-4’s reasoning without being limited to GPT-4's preferred strategies, making it a much more versatile and efficient model.

Outperforming Comparable Models

Orca 2’s performance stands out particularly in comparison to models of similar or larger sizes, demonstrating that it is not only efficient but highly competitive in reasoning and comprehension tasks. The extensive evaluation conducted on Orca 2, particularly the 13B parameter version, reveals significant advancements over comparable models like WizardLM and LLaMA-2 Chat, both at the 13B and 70B parameter scales. Orca 2 has proven capable of matching or exceeding models that are 5 to 10 times larger in terms of parameter count on several key benchmarks that measure reasoning capabilities in a zero-shot setting.

Macro-average Performance of different models on reasoning benchmarks

Key Benchmarks and Comparative Performance

The evaluation of Orca 2 involved a wide array of benchmarks designed to test various reasoning and comprehension skills. These benchmarks included AGIEval, BigBench Hard (BBH), MMLU, and other complex reasoning tasks such as the Discrete Reasoning Over Paragraphs (DROP) and GSM8K (which focuses on multi-step mathematical problem-solving). Across these tasks, the 13B version of Orca 2 consistently surpassed both WizardLM-13B and LLaMA-2 Chat-13B in zero-shot evaluations.

Reasoning Performance: Orca-2-13B achieved a relative improvement of 47.54% over LLaMA-2-Chat-13B and 28.15% over WizardLM-13B on average across reasoning benchmarks like AGIEval, which includes multiple-choice and fill-in-the-blank questions. These results are particularly impressive when considering that all three models are based on the same LLaMA-2 foundation, highlighting the efficiency of the reasoning strategies learned by Orca 2.

Zero-shot Setting Superiority: The fact that Orca-2-13B outperformed models five to ten times its size, such as LLaMA-2-Chat-70B and WizardLM-70B, further emphasizes the advantages of its training approach. For instance, on benchmarks like AGIEval and BigBench Hard, Orca-2-13B demonstrated nearly equivalent or better performance compared to LLaMA-2-Chat-70B, a model with roughly five times as many parameters. This result demonstrates that Orca 2’s focused reasoning strategies allow it to compete with significantly larger models.
Math Problem Solving: On GSM8K, which measures the ability to solve multi-step mathematical problems, Orca-2-13B achieved a 59.14% exact match rate. This is particularly notable because GSM8K involves complex reasoning steps that smaller models typically struggle with. Orca 2’s result was comparable to larger models like WizardLM-70B and close to ChatGPT’s performance in similar evaluations, reflecting Orca 2’s sophisticated reasoning abilities.

Text Completion Abilities

In addition to benchmarks measuring advanced reasoning capabilities, Orca 2’s performance in text completion tasks was equally noteworthy. Two prominent benchmarks used to evaluate this ability were HellaSwag and LAMBADA, both of which test models on their ability to complete text in a way that makes logical and contextual sense.

Performance of different models on text completion test sets in zero-shot setting

HellaSwag: On this dataset, which focuses on testing a model's ability to complete narratives or descriptions of physical and social situations, Orca-2-13B performed exceptionally well. It achieved a relative improvement of 33.13% over LLaMA-2-Chat-13B and a 61.94% improvement over WizardLM-13B. Orca 2’s performance on HellaSwag was competitive even when compared to much larger models like LLaMA-2-Chat-70B. This indicates that despite its smaller size, Orca-2-13B has a strong grasp of context and sequence in text generation tasks.
LAMBADA: This dataset measures a model’s ability to predict the final word of a passage based on a long-range context. While LAMBADA is particularly challenging for smaller models, Orca-2-13B delivered solid performance, scoring 63.69%, which placed it ahead of both WizardLM-13B and LLaMA-2-Chat-13B. The performance gap between Orca-2-13B and models like LLaMA-2-Chat-70B in this task was also narrower than expected, with Orca-2-13B coming in just a few percentage points lower than the much larger 70B model. This highlights Orca 2’s capability to understand and predict language over extended contexts, a critical skill in many natural language processing tasks.

Grounding Performance

Grounding is a crucial aspect of natural language models, as it ensures that generated content is based on a given context and does not introduce factual inaccuracies. Orca 2’s performance on grounding tasks was assessed through several abstractive summarization benchmarks, which evaluated how well the model could generate summaries or answers based on provided context, and how truthful its output was.

The hallucination rate evaluated by GPT-4 as discriminator averaged over three abstractive summarization benchmarks described in section 5 (the lower the better).

Grounding in Summarization: In tasks like query-based multi-domain meeting summarization and doctor-patient conversation summarization, Orca-2-13B demonstrated the lowest hallucination rate among models of its size, with a 76.92% reduction in hallucination compared to LLaMA-2-Chat-13B. Hallucinations refer to instances where the model generates information that is not supported by the given context. Orca-2’s ability to avoid this common pitfall demonstrates its proficiency in generating contextually accurate content. Despite this strong performance, when the "cautious system message" was applied, the hallucination rate slightly increased, as Orca 2 tended to extrapolate information beyond the provided context. While these extrapolations were often factually accurate, they did not strictly adhere to the information given, leading to ungrounded content. This suggests that while the cautious message boosts certain reasoning capabilities, it may also lead to more creative but less contextually constrained outputs.
Truthfulness: On the TruthfulQA benchmark, which tests a model’s ability to provide factually accurate answers to tricky questions, Orca-2-13B performed significantly better than WizardLM and LLaMA-2-Chat-13B. It achieved 54.39% accuracy, closely rivaling larger models such as WizardLM-70B. This benchmark is particularly challenging because it focuses on questions that are commonly answered incorrectly due to biases or misconceptions, making Orca 2’s high accuracy a strong indicator of its grounding capabilities.

Performance of different models on TruthfulQA benchmark. They report the accuracy as the percentage of times the model generated the correct answer to the given multiple-choice questions.

The Importance of Reasoning in AI

The success of Orca 2 highlights a crucial aspect of AI development: the importance of teaching models not just what to think, but how to think. Most language models, especially smaller ones, are typically trained to regurgitate facts or patterns they’ve learned during their training. However, reasoning is a more complex skill that requires a model to actively engage with the problem at hand, break it down into smaller steps, and work through those steps to arrive at a solution.

Incorporating reasoning into a model’s training process can dramatically improve its performance, as evidenced by the results of Orca 2. By allowing Llama 2 to learn the reasoning strategies that worked best for it, the researchers enabled the student model to reach levels of accuracy that would have been impossible otherwise. This approach also reduces the need for users to specify reasoning strategies when interacting with the model. Instead of having to instruct the model to "think step by step" or "explain then answer," users can simply input a prompt, and the model will determine the best way to reason through the problem on its own.

The Broader Implications for AI Development

Orca 2’s success points to a future in which smaller, more efficient models can perform complex reasoning tasks with near-human accuracy. This has enormous implications for the deployment of AI in real-world applications. Currently, the cost of training and running massive models like GPT-4 limits their use to well-funded organizations or cloud-based solutions. However, if smaller models can achieve similar results at a fraction of the computational cost, AI can become more accessible to a wider range of industries and individuals.

Moreover, the approach used in Orca 2 could lead to new innovations in AI training techniques. Instead of focusing on making models bigger and more data-intensive, researchers could explore ways to improve reasoning processes, leading to smarter and more adaptable models. This could also have implications for areas like education, healthcare, and decision-making, where reasoning and problem-solving are essential.

Conclusion

The teacher-student paradigm has long been a staple of machine learning, but Orca 2’s innovative approach adds a new layer of sophistication to this concept. By focusing not just on the output of the teacher model but also on the reasoning strategies that lead to the best performance, Orca 2 has demonstrated that smaller models can punch above their weight class. As AI continues to evolve, techniques like Orca 2 will be crucial in ensuring that these systems are not just large and powerful, but also efficient, adaptable, and capable of reasoning like humans.

References

[1] Mitra, A., Del Corro, L., Mahajan, S., Codas, A., Simoes, C., Agarwal, S., ... & Awadallah, A. (2023). Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045.

[2] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[3] Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., ... & Duan, N. (2023). Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.

[4] Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

[5] Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789-1819.