Dilated Attention Is All You Need : Microsoft LongNet 1 Billion Tokens
Introducing LONGNET, a Transformer variant capable of processing over 1 billion tokens, revolutionizing large language models.
In the era of large language models, scaling sequence length has become a critical demand. However, existing methods face challenges related to computational complexity and model expressivity, limiting the maximum sequence length. Addressing this issue, researchers from Microsoft have introduced a groundbreaking innovation known as LONGNET. This Transformer variant overcomes the limitations, enabling the scaling of sequence length to more than 1 billion tokens without compromising performance on shorter sequences. The key to LONGNET’s success lies in its novel component called dilated attention, which exponentially expands the attentive field as the distance between tokens increases. Let’s explore the details of this remarkable research.
The Challenge of Scaling Sequence Length
The significance of scaling sequence length lies in several advantages it offers to language models. Firstly, it provides larger memory and a wider receptive field, enabling models to interact effectively with humans and the world. Secondly, longer sequences encompass more complex causality and reasoning paths, enhancing model training with richer contextual information. In contrast, shorter dependencies often lead to spurious correlations, hindering generalization. Lastly, longer sequences push the boundaries of in-context learning, which can potentially revolutionize many-shot learning by helping models alleviate catastrophic forgetting.
Also Read : Reporter Explores if AI can replace News Reporters
The major challenge in scaling sequence length is striking the right balance between computational complexity and model expressivity. RNN-style models, while capable of handling longer sequences, suffer from sequential nature that limits parallelization during training. State space models offer an alternative but are limited by model expressivity. Various approaches have been explored to decrease the complexity of Transformers, such as sliding windows or convolution modules over attention. However, these approaches sacrifice the ability to recall early tokens, thereby affecting performance. Sparse attention, which sparsifies the attention matrix, retains the ability to recall long-distant information. However, none of these methods have successfully scaled to 1 billion tokens, creating the need for a novel solution.
Introducing LONGNET and Dilated Attention
LONGNET, the proposed solution, replaces the attention mechanism of vanilla Transformers with dilated attention. The design principle of LONGNET is based on exponentially decreasing attention allocation as the distance between tokens grows. This approach efficiently tackles the contradiction between limited attention resources and the accessibility to every token. Dilated attention splits the input into segments and then sparsifies them along the sequence dimension using a specified interval. The computation is carried out in parallel, and the results are scattered and concatenated to produce the final output. Importantly, dilated attention can seamlessly integrate with existing Transformer-based optimization techniques, making it a drop-in replacement for standard attention.
Advantages and Applications of LONGNET
LONGNET offers several significant advantages. Firstly, it exhibits linear computation complexity and logarithmic dependency between tokens, ensuring efficient processing of long sequences. Secondly, LONGNET can function as a distributed trainer, enabling training on extremely long sequences by parallelizing the process across multiple devices. This breakthrough opens up new possibilities for modeling very long sequences, such as treating an entire corpus or even the entire Internet as a sequence. The in-context learning capability of LONGNET makes it suitable for various applications, including medical research, where it can assist in reviewing vast amounts of literature and highlighting the most relevant papers.
Also Read : FlashAttention Is All You Need : Bypassing HuggingFace & Nvidia’s Benchmarks
Incredible Future Prospects
LONGNET’s ability to scale sequence length to over 1 billion tokens while maintaining strong performance on shorter sequences marks a significant milestone in language modeling. By introducing dilated attention and leveraging its advantages, LONGNET surpasses the limitations of existing methods. The research showcases the potential for future advancements in large language models. Going forward, the focus will be on further extending LONGNET’s capabilities to support additional tasks, such as multimodal language modeling, BEiT pretraining, and genomic data modeling.
The research paper and code for LONGNET can be found at : https://thegenerality.com/agi/
Paper – https://doi.org/10.48550/arXiv.2307.02486