TrainingFlow and Knowledge Retrieval for LLMs

Siddhant Sancheti
5 min readMay 15, 2023

--

Oops! You missed a good meme

When Alan Turing came up with the Turing Test in 1950, it was a test of a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human. Turing proposed that — a computer can be said to possess artificial intelligence (AI) if it can create human-like responses to questions.

Thanks to large language models, we’re now at the point where computers can write text on just about any subject we give them — and for the most part, it’s very convincing and human-like.

In recent years, language models have revolutionized natural language processing (NLP) tasks, enabling advancements in machine translation, question answering, and text generation. However, training such large foundational models is a non-trivial exercise that requires a significant amount of computing power and expertise from machine learning and systems experts. A recent research paper titled “Training Flow and Knowledge Retrieval for LLMs” delves into the intricacies of language model pretraining and proposes novel approaches to enhance their performance. In this article, we explore the technical aspects, background, and related research of this topic, shedding light on the exciting possibilities it presents for advancing NLP.

Pun of the day: Training language models is like teaching a parrot to speak fluently, except the parrot can generate novel and grammatically correct sentences on any topic you give it!

Sparsity is a promising technique that offers potential benefits in reducing the compute requirements for training large language models. In the context of training language models, sparsity refers to the intentional introduction of zeros in the weight matrices of neural networks. It can alleviate the computational demands by reducing the number of floating-point operations required during training.

However, there’s always a BUT!

Sparsity introduces new challenges in training the sparse model to the same quality as the dense counterparts. Furthermore, sparsity drops the operation intensity and introduces irregular memory access patterns that makes it challenging to efficiently utilize compute resources.

Harnessing the Power of Dataflow Execution:

The computations in LLMs involve numerous matrix multiplications, activation functions, and other operations, which can be highly parallelizable. By organizing these computations in a dataflow fashion, where intermediate results flow through the network as soon as they become available, dataflow execution enables efficient utilization of compute resources and accelerates the training process. One advantage of dataflow execution is its ability to automatically fuse and pipeline operations. In dataflow architectures, operators can be connected together to form a pipeline, allowing the data to flow seamlessly from one operation to the next. This fusion and pipelining of operations eliminate the need for manual kernel fusion and enables efficient utilization of the available compute and memory resources.

Moreover, dataflow execution can exploit data locality by keeping intermediate results on-chip, reducing the need for off-chip memory accesses. By minimizing data movement and leveraging the high-bandwidth on-chip communication channels, dataflow execution can further enhance computational efficiency and reduce memory bottlenecks.

KBK(Kernel-by-Kernel): In this execution model, the compute and memory resources available are fully utilized for each individual operation. Operators are executed independently, and the results are stored in memory before being accessed by subsequent operations. This sequential execution allows for efficient resource utilization within each operation, taking advantage of parallelism available within the operation itself. However, KBK execution can suffer from increased memory access and higher memory bandwidth requirements. Since intermediate results are stored in off-chip memory, the frequent exchange of data between operations can result in significant memory traffic.

White arrows represent traffic to off-chip memory, gray arrows represent on-chip traffic. Blue boxes represent on-chip memory, white boxes represent on-chip compute resources
Sparsity-DataFlow version

The combined version of sparsity-dataflow combines the benefits of sparsity and dataflow execution to achieve efficient training in large language models. It leverages sparsity techniques to reduce the compute workload, while dataflow execution enables parallelism, memory bandwidth efficiency, and adaptability. This combined approach offers improved compute efficiency, memory bandwidth utilization, and flexibility compared to traditional KBK execution.

Training Methodology: In the research paper, the researchers trained a 13 billion GPT model from scratch on the C4 dataset, and evaluated on a variety of downstream tasks on the SambaNova RDU. Also, they showed that the S2D version achieves the same accuracy as the dense model while achieving an end-to-end speedup of 4.5x over the dense model on A100. The researchers found that the version of sparsity-dataflow(S2D) combines the benefits of sparsity and dataflow execution to achieve efficient training in large language models. It leverages sparsity techniques to reduce the compute workload, while dataflow execution enables parallelism, memory bandwidth efficiency, and adaptability. This combined approach offers improved compute efficiency, memory bandwidth utilization, and flexibility compared to traditional KBK execution.

Leveraging Knowledge Retrieval in Dataflow Execution:

While dataflow execution enhances compute efficiency, incorporating external knowledge sources through knowledge retrieval techniques can further enrich the understanding and generation capabilities of LLMs. Further in this article, we explore the integration of knowledge retrieval with dataflow execution to achieve efficient training in LLMs. Sparsity techniques reduce the compute workload by selectively activating non-zero elements, while dataflow execution optimizes compute resource utilization. Knowledge retrieval adds an additional dimension to S2D execution by providing access to external knowledge sources, enriching the understanding and generation capabilities of LLMs.

Conclusion: Incorporating knowledge retrieval techniques, such as RAG, REALM, and RETRO, into dataflow execution enhances the contextual understanding and generation capabilities of LLMs. Knowledge retrieval mechanisms enable LLMs to access external knowledge sources, such as Wikipedia or text corpora, during training. By retrieving relevant information, LLMs can incorporate this knowledge into their computations, improving the accuracy and contextuality of generated responses or predictions. The integration of knowledge retrieval with dataflow execution facilitates the efficient processing and utilization of retrieved knowledge within the dataflow pipeline.

References:

  1. Training Large Language Models Efficiently with Sparsity and Dataflow
  2. Knowledge Retrieval Architectures for LLMs

--

--

Siddhant Sancheti
Siddhant Sancheti

Written by Siddhant Sancheti

Siddhant, SJSU MS in Software Eng. | AI/ML enthusiast | Exploring the nexus of tech and innovation in software and data science.

No responses yet