The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
Large Language Models (LLMs) have been praised for their ability to solve novel tasks by reasoning step-by-step, a process known as Chain-of-Thought (CoT) reasoning. However, these impressive capabilities have primarily been demonstrated in LLMs with parameters exceeding 100 billion. This raises an important question: How can we imbue LLMs with the ability to reason step-by-step on unseen tasks while possessing fewer than 100 billion parameters?
In an attempt to answer this question, a recent paper introduces the CoT Collection, a groundbreaking instruction-tuning dataset. This dataset comprises 1.88 million CoT rationales spread across 1,060 tasks. By utilizing the CoT Collection, the authors demonstrate how continual fine-tuning of Flan-T5 models, with 3 billion and 11 billion parameters respectively, enables these LLMs to significantly improve their CoT reasoning capabilities on previously unseen tasks.
Improving Zero-shot Accuracy
The researchers conducted experiments to evaluate the impact of fine-tuning Flan-T5 models using the CoT Collection. The results showed a remarkable improvement in the average zero-shot accuracy on 27 datasets from the BIG-Bench-Hard benchmark. Specifically, the 3B Flan-T5 model achieved an increase of +4.34% in zero-shot accuracy, while the 11B Flan-T5 model exhibited a gain of +2.44%.
To illustrate the effectiveness of the CoT Collection, let’s consider an example. Suppose we have a task that requires a language model to generate step-by-step instructions on assembling a complex piece of furniture. Prior to fine-tuning with the CoT Collection, the Flan-T5 models may struggle to provide coherent and accurate instructions for this unseen task. However, after undergoing the fine-tuning process using the CoT Collection, the models exhibit enhanced reasoning capabilities, leading to improved zero-shot accuracy in providing detailed and accurate instructions for assembling the furniture.
Strengthening Few-shot Learning
In addition to enhancing zero-shot accuracy, instruction tuning with the CoT Collection also empowers LLMs with stronger few-shot learning capabilities. The researchers conducted experiments to assess the performance of the fine-tuned models on domain-specific tasks. These tasks are specifically designed to test the models’ ability to generalize and adapt to new tasks with minimal training examples.
The results were impressive. The fine-tuned 3B Flan-T5 model achieved a +2.97% improvement, while the fine-tuned 11B Flan-T5 model exhibited a gain of +2.37% on these domain-specific tasks when compared to their respective base models. This significant boost in few-shot learning capabilities highlights the potential of the CoT Collection in equipping LLMs with adaptability and versatility.
To illustrate the impact of instruction tuning with the CoT Collection on few-shot learning, let’s consider another example. Suppose we have a domain-specific task involving sentiment analysis of customer reviews for a particular product. Before fine-tuning, the Flan-T5 models may struggle to accurately predict the sentiment of reviews in this domain. However, after undergoing instruction tuning using the CoT Collection, the models showcase improved few-shot learning, enabling them to make more accurate sentiment predictions even with limited training examples.
Conclusion
The CoT Collection presents a groundbreaking approach to enhance the zero-shot and few-shot learning capabilities of LLMs with fewer than 100 billion parameters. By continually fine-tuning Flan-T5 models using this instruction-tuning dataset, significant improvements in CoT reasoning on unseen tasks have been achieved. The average zero-shot accuracy on various datasets saw considerable gains, and the models demonstrated enhanced few shot learning capabilities on domain-specific tasks.