A Novel AI Approach to Improving Language Models: Multi-Token Prediction

Language models are incredibly powerful tools that can understand and generate human-like text by learning patterns from massive data sets. The traditional method of training these models, called “predicting the next token,“has its limits. It essentially teaches the model to predict the next word in a sequence, but this approach can lead to suboptimal performance, especially on more complex tasks.

The researchers behind this study propose a new technique called Multi-token prediction. Instead of predicting one token (word) at a time, this method trains the model to predict multiple future tokens at the same time. Think of it this way: When learning a language, you don’t just have to guess one word at a time, you have to predict entire phrases or even sentences. Sounds fascinating, right?

How does this multi-token prediction work? The researchers designed a model architecture with a common root that creates a latent representation of the input context. This common root is then connected to multiple independent issuance heads, each of which is responsible for predicting one of the future tokens. For example, if the model is set to predict four future tokens, it will have four issuing heads working in parallel.

During training, the model is fed a corpus of text and at each position it is tasked with predicting the next one N tokens at the same time. This approach encourages the model to learn longer-term patterns and dependencies in the data, potentially leading to better performance, especially on tasks that require an understanding of the broader context.

In addition, the researchers also addressed a crucial challenge: reducing the GPU memory usage of these multi-token predictors. They implemented a clever technique that sequentially calculates the forward and backward passes for each output head, accumulating gradients on the common stem. This approach reduces the maximum GPU memory usage and makes it possible to train larger models efficiently.

The researchers conducted extensive experiments and the results are quite promising. This is what they found out Multi-token prediction becomes more useful as the model size increases. For example, in coding evaluation benchmarks such as MBPP and HumanEval, models trained on multi-token prediction outperformed their counterparts on predicting the next token, sometimes significantly. The 13B parameter models solve 12% more problems on HumanEval and 17% more on MBPP than comparable Next Token models.

Additionally, the additional output headers can be used to speed up inference using techniques such as speculative decoding. The researchers observed up to one 3x acceleration in decoding times for their best 4-token prediction model for code and natural language tasks.

But it’s not just about coding; Multi-token prediction also showed promising results in natural language tasks. When evaluated against summary benchmarks, models trained with multi-token prediction achieved higher ROUGE scores compared to the closest-token baseline, indicating better text generation capabilities.

The next interesting question to answer is: “Why does it work?”

The researchers offer some insightful explanations as to why multi-token prediction works so well. A key idea is that this mitigates the distributional mismatch between teacher enforcement at training time (where the model obtains ground truth for each future token) and autoregressive generation at inference time (where the model generates tokens without guidance).

Additionally, multi-token prediction implicitly assigns higher weights to tokens that represent “choice points” – decisions that significantly impact the rest of the text. By reinforcing these critical decision points during training, the model learns to make better decisions, resulting in more coherent and useful text generation. Additionally, information theoretic analysis suggests that multi-token prediction encourages the model to focus on predicting highly relevant tokens for subsequent text, potentially capturing longer-term dependencies more effectively.

Although the results are promising, researchers acknowledge that there is still room for improvement. An area for future investigation is the automatic determination of the optimal value of N (the number of future tokens to predict) based on the task and data distribution. Furthermore, they suggest that adjusting vocabulary size and exploring alternative auxiliary prediction losses could lead to even better trade-offs between compressed sequence length and computational efficiency. Overall, this research opens up exciting possibilities for improving the capabilities of language models and paves the way for more powerful and efficient natural language processing systems.

Visit the Paper. All credit for this research goes to the researchers of this project. Also don’t forget to follow us Twitter. Join our… Telegram channel, Discord channelAnd LinkedIn GrOup.

If you like our work, you will love ours Newsletter..

Don’t forget to join our 41k+ ML SubReddit

Vineet Kumar is a consultant intern at MarktechPost. He is currently completing his bachelor’s degree at the Indian Institute of Technology (IIT) in Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in deep learning, computer vision and related areas.

Source link