Phi-2: The Surprising Power of Small Language Models

Microsoft released Phi-2, a 2.7 billion parameter language model that demonstrates outstanding reasoning and language understanding capabilities, achieving state-of-the-art performance among base language models with fewer than 13 billion parameters. On complex benchmarks, Phi-2 matches or outperforms models roughly 25 times its size, thanks to innovations in model scaling and training data curation.

Given its compact size, Phi-2 serves as an ideal platform for researchers to explore mechanistic interpretability, safety improvements, or fine-tuning experiments across various tasks. Microsoft has included Phi-2 in the Azure AI Studio model catalog to facilitate language model research and development.

Key Technical Innovations

Scaling language models to hundreds of billions of parameters has unlocked a range of emergent capabilities, reshaping the natural language processing landscape. A fundamental question remains: can these emergent capabilities be achieved at smaller scale through strategic training choices like data selection?

The Phi model series aims to answer this question by training small language models (SLMs) that rival much larger models in performance (though still far from frontier models). Two key insights emerged through Phi-2:

First, training data quality plays a crucial role in model performance. This idea has existed for decades, but the Phi team took it to the extreme by focusing on “textbook-quality” data, continuing previous work from “Textbooks Are All You Need.” The full training data mix includes synthetic datasets specifically designed to teach the model common-sense reasoning and general knowledge, covering science, everyday activities, and theory of mind. The team also augmented the training corpus with carefully selected web data filtered by educational value and content quality.

Second, innovative scaling techniques were used. Starting from the 1.3B parameter Phi-1.5, its knowledge was embedded into the 2.7B parameter Phi-2. This scaled knowledge transfer not only accelerated training convergence but also showed clear improvements in Phi-2’s benchmark scores.

Training Details

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens through multiple passes over a mix of synthetic and web datasets for NLP and coding. Training took 14 days on 96 A100 GPUs. Phi-2 is a base model — it was not aligned through reinforcement learning from human feedback (RLHF) or instruction fine-tuned. Despite this, Phi-2 shows better toxicity and bias metrics compared to aligned open-source models. This is consistent with findings from Phi-1.5 and is attributed to the tailored data curation techniques.

Evaluation

Phi-2’s performance was evaluated on academic benchmarks compared to popular language models. Benchmarks span multiple categories including Big Bench Hard (BBH, 3-shot with CoT), common-sense reasoning (PIQA, WinoGrande, ARC Easy and Challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU 5-shot, SQuADv2 2-shot, BoolQ), math (GSM8k 8-shot), and coding (HumanEval, MBPP 3-shot).

With only 2.7B parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters across various comprehensive benchmarks. Notably, on multi-step reasoning tasks like coding and math, Phi-2 outperforms the 25x larger Llama-2-70B model. Phi-2 also matches or exceeds the recently announced Google Gemini Nano 2, despite its smaller size.

Of course, model evaluation faces challenges — many public benchmarks may leak into training data. An exhaustive decontamination study was conducted for Phi-1 to rule this out, detailed in “Textbooks Are All You Need.” In the same spirit, Phi-2 was also evaluated on several Microsoft internal proprietary datasets and tasks, again compared with Mistral and Llama-2. Similar trends were observed: Phi-2 outperforms Mistral-7B on average, which in turn outperforms Llama-2 models (7B, 13B, and 70B).

Beyond these benchmarks, Microsoft conducted extensive testing with prompts commonly used by the research community.

Key Technical Innovations#

Training Details#

Evaluation#

Key Technical Innovations

Training Details

Evaluation