Saturday 01 March 2025
The quest for a more efficient and effective way to answer multiple-choice questions has led researchers to explore the capabilities of smaller language models, such as Microsoft’s PHI-3. This compact model, designed primarily for text generation, has shown promising results when fine-tuned for MCQ answering.
Researchers have long recognized the value of multiple-choice questions in assessing a model’s ability to reason and understand complex information. However, adapting larger language models like GPT-3 or BERT for this task can be computationally expensive and resource-intensive. In contrast, PHI-3’s smaller size makes it an attractive option for environments with limited computational resources.
The researchers’ approach involved fine-tuning PHI-3 on the TruthfulQA dataset, a challenging benchmark designed to test models’ ability to answer factual questions without generating misleading or incorrect information. The team employed a combination of prompt engineering and preprocessing techniques to optimize the model’s performance.
One key challenge in developing a successful MCQ answering system is crafting effective prompts that guide the model towards accurate responses. In this study, the researchers experimented with different prompt formats before settling on a design that balanced precision and flexibility. By incorporating elements from both basic text completion and Alpaca-style prompts, they created a structure that allowed the model to generate more coherent and relevant answers.
The results of the fine-tuning process were striking. Perplexity, a measure of the model’s uncertainty when generating responses, decreased significantly from 4.68 to 2.27. Accuracy and F1 score also improved dramatically, rising from 62% to 90.8% and 66 to 90.6, respectively.
While PHI-3 still has limitations – it occasionally generates irrelevant or incorrect responses, particularly when faced with ambiguous MCQ options – its performance is competitive with larger models in resource-constrained environments. Moreover, the team’s approach offers valuable insights into prompt engineering and fine-tuning techniques for smaller language models.
The implications of this research are significant for educational applications, where automated assessments and adaptive learning tools rely on accurate and efficient question answering systems. By exploring the capabilities of smaller models like PHI-3, researchers can develop more practical solutions that balance computational efficiency with performance.
In addition to its potential impact on education, this study highlights the importance of considering smaller language models as viable options for various NLP applications.
Cite this article: “Unlocking the Potential of Smaller Language Models for Multiple-Choice Question Answering”, The Science Archive, 2025.
Here Are The Keywords: Language Models, Multiple Choice Questions, Phi-3, Fine-Tuning, Prompt Engineering, Text Generation, Computationally Efficient, Resource-Constrained Environments, Adaptive Learning, Natural Language Processing







