PyTorch researchers introduce an optimized Triton FP8 GEMM (General Matrix-Matrix Multiply) kernel TK-GEMM that leverages SplitK parallelization

PyTorch introduced TK-GEMM, an optimized Triton FP8 GEMM kernelto address the challenge of accelerating FP8 inference for large language models (LLMs) like Llama3 using Triton kernels. The default execution of PyTorch often struggles with the overhead of launching multiple kernels on the GPU for each operation in LLMs, leading to inefficient inferences. The researchers aim to overcome this limitation by leveraging SplitK parallelization to improve performance for Llama3-70B inference problem sizes on Nvidia H100 GPUs.

Current methods for running LLMs, particularly with FP8 precision, often suffer from inefficiencies in PyTorch execution due to the overhead associated with launching multiple kernels on the GPU for each operation. The proposed method, Triton Kernels, provides custom kernel optimizations for specific hardware, such as: B. Nvidia GPUs. By integrating Triton kernels into PyTorch models via Torch.compile() This feature allows developers to merge multiple operations into a single kernel boot, reducing overhead and significantly improving performance. Additionally, Triton kernels leverage special FP8 tensor cores available on Nvidia GPUs, improving computational efficiency compared to standard FP16 cores used by PyTorch’s cuBLAS library.

TK-GEMM leverages SplitK parallelization to improve performance for Llama3-70B by decomposing work along the k dimension and launching additional thread blocks to compute partial output sums. TK-GEMM achieves finer-grained work decomposition, resulting in significant speedups over the base implementation of Triton GEMM. Experimental results show up to 1.94x speedup over the base Triton-Matmul implementation, 1.87x speedup over cuBLAS FP8, and 1.71x speedup over cuBLAS FP16 for Llama3-70B sizes -Inference problems. Additionally, the introduction of CUDA graphs further improves end-to-end acceleration by reducing kernel startup latencies. By creating and instantiating a graph instead of launching multiple kernels, developers can minimize CPU startup overhead and achieve significant performance gains in production environments.

In summary, PyTorch presents a novel approach to speeding up FP8 inference for large language models using Triton kernels. The proposed method overcomes the inefficiencies of standard PyTorch execution and cuBLAS FP8 computations by introducing an optimized TK-GEMM kernel with SplitK parallelization and CUDA graphs for end-to-end speedup. The solution provides significant performance improvements for Llama3-70B inference problem sizes on Nvidia H100 GPUs, representing a promising advance in the field of deep learning model inference optimization. Overall, the method successfully accelerates FP8 inference for large language models such as Llama3 by optimizing kernels and performance can be improved.

Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the areas of software and data science applications. She always reads about developments in various areas of AI and ML.

Source link