Introducing Falcon: A Powerful Language Model Family for Advanced Natural Language Processing
In the realm of Natural Language Processing (NLP), the Falcon family of language models has emerged as a force to be reckoned with. Comprising of two base models, Falcon-40B and Falcon-7B, these models have made a significant impact on various NLP tasks. In this blog post, we’ll explore the key features, training data, technical considerations, and evaluation results of the Falcon models.
The Falcon Models
The Falcon family consists of Falcon-40B and its smaller counterpart, Falcon-7B. Falcon-40B, with 40 billion parameters, currently dominates the Open LLM Leaderboard, while Falcon-7B is hailed as the best model in its weight class. Falcon-40B requires approximately 90GB of GPU memory, which, although substantial, is still less than other models like LLaMA-65B that Falcon outperforms. On the other hand, Falcon-7B only requires around 15GB of GPU memory, making it accessible even on consumer hardware.
Instruct Versions and Training Data
TII (The Inference Institute) has also released instruct versions of the Falcon models, Falcon-7B-Instruct and Falcon-40B-Instruct. These experimental variants have been fine-tuned on instructions and conversational data, making them well-suited for popular assistant-style tasks. Additionally, TII has publicly released a 600 billion tokens extract of RefinedWeb, the high-quality web dataset predominantly used in training the Falcon models. This release empowers the community to leverage RefinedWeb for their own large language models (LLMs).
Multiquery Attention and Improved Scalability
One intriguing feature of the Falcon models is their use of multiquery attention, where one key and value are shared across all attention heads. While this doesn’t significantly impact pretraining, it greatly enhances inference scalability. The shared key and value approach reduces memory costs, enables optimizations such as statefulness, and maintains impressive performance.
Technical Considerations for Running Falcon Models
To run the Falcon models on your own hardware, it’s recommended to use the bfloat16
datatype, as the models were trained with it. Additionally, you need to allow remote code execution since the Falcon models utilize a new architecture not yet integrated into transformers. Detailed instructions and necessary files are provided by the model authors in the repository. The transformers pipeline API simplifies model loading and enables straightforward text generation.
Inference and Deployment
Inference of the Falcon-40B model can be challenging due to its size. However, by loading it in 8-bit mode, it becomes feasible to run on GPUs with sufficient memory. Alternatively, 4-bit loading can be used with the latest versions of bitsandbytes
, transformers
, and accelerate
, further reducing memory requirements. Hugging Face's Text Generation Inference offers a production-ready inference container for deploying Falcon models. It supports Falcon-7B and Falcon-40B natively, providing optimized transformers code and efficient tensor parallelism.
Evaluation and Performance
While an in-depth evaluation of the Falcon models by their authors is forthcoming, an initial assessment was conducted using the open LLM benchmark. The Falcon-40B base and instruct models showcased exceptional performance, ranking first and second on the leaderboard for various reasoning and truthfulness tasks, such as AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA.
Conclusion
The Falcon family of language models, including Falcon-40B and Falcon-7B, has made significant strides in the field of Natural Language Processing. With their impressive training data, efficient multi-query attention, and accessibility considerations, these models offer exciting opportunities for advanced NLP tasks. Researchers and practitioners can leverage the power of Falcon models to tackle complex language understanding and generation challenges.