The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Anote
3 min readMay 27, 2023

--

Large Language Models (LLMs) have been praised for their ability to solve novel tasks by reasoning step-by-step, a process known as Chain-of-Thought (CoT) reasoning. However, these impressive capabilities have primarily been demonstrated in LLMs with parameters exceeding 100 billion. This raises an important question: How can we imbue LLMs with the ability to reason step-by-step on unseen tasks while possessing fewer than 100 billion parameters?

In an attempt to answer this question, a recent paper introduces the CoT Collection, a groundbreaking instruction-tuning dataset. This dataset comprises 1.88 million CoT rationales spread across 1,060 tasks. By utilizing the CoT Collection, the authors demonstrate how continual fine-tuning of Flan-T5 models, with 3 billion and 11 billion parameters respectively, enables these LLMs to significantly improve their CoT reasoning capabilities on previously unseen tasks.

Improving Zero-shot Accuracy

The researchers conducted experiments to evaluate the impact of fine-tuning Flan-T5 models using the CoT Collection. The results showed a remarkable improvement in the average zero-shot accuracy on 27 datasets from the BIG-Bench-Hard benchmark. Specifically, the 3B Flan-T5 model achieved an increase of +4.34% in zero-shot accuracy, while the 11B Flan-T5 model exhibited a gain of +2.44%.

To illustrate the effectiveness of the CoT Collection, let’s consider an example. Suppose we have a task that requires a language model to generate step-by-step instructions on assembling a complex piece of furniture. Prior to fine-tuning with the CoT Collection, the Flan-T5 models may struggle to provide coherent and accurate instructions for this unseen task. However, after undergoing the fine-tuning process using the CoT Collection, the models exhibit enhanced reasoning capabilities, leading to improved zero-shot accuracy in providing detailed and accurate instructions for assembling the furniture.

Strengthening Few-shot Learning

In addition to enhancing zero-shot accuracy, instruction tuning with the CoT Collection also empowers LLMs with stronger few-shot learning capabilities. The researchers conducted experiments to assess the performance of the fine-tuned models on domain-specific tasks. These tasks are specifically designed to test the models’ ability to generalize and adapt to new tasks with minimal training examples.

The results were impressive. The fine-tuned 3B Flan-T5 model achieved a +2.97% improvement, while the fine-tuned 11B Flan-T5 model exhibited a gain of +2.37% on these domain-specific tasks when compared to their respective base models. This significant boost in few-shot learning capabilities highlights the potential of the CoT Collection in equipping LLMs with adaptability and versatility.

To illustrate the impact of instruction tuning with the CoT Collection on few-shot learning, let’s consider another example. Suppose we have a domain-specific task involving sentiment analysis of customer reviews for a particular product. Before fine-tuning, the Flan-T5 models may struggle to accurately predict the sentiment of reviews in this domain. However, after undergoing instruction tuning using the CoT Collection, the models showcase improved few-shot learning, enabling them to make more accurate sentiment predictions even with limited training examples.

Conclusion

The CoT Collection presents a groundbreaking approach to enhance the zero-shot and few-shot learning capabilities of LLMs with fewer than 100 billion parameters. By continually fine-tuning Flan-T5 models using this instruction-tuning dataset, significant improvements in CoT reasoning on unseen tasks have been achieved. The average zero-shot accuracy on various datasets saw considerable gains, and the models demonstrated enhanced few shot learning capabilities on domain-specific tasks.

--

--

Anote
Anote

Written by Anote

General Purpose Artificial Intelligence. Like our product, our medium articles are written by novel generative AI models, with human feedback on the edge cases.

No responses yet