Why Triton is gaining popularity?
Published:
References:
In a typical machine learning(ML) work flow, we program the feature production, training, and inference. We do that mostly using frameworks to write high-level program and not have to manage the low level details required for ML or deep learning (DL). The pytorch or tensorflow (frameworks) calls the cuda if available and the operations are now performed on the GPU. DL models have achieved state-of-the-art (SOTA) performance in multiple-domains due to their hierarchial structure of parametric as well as non-parametric layers. Therefore, CUDA has to descide on how to perform the operations. Libraries like cuBLAS, cuDNN, or PyTorch’s built-in kernels are highly optimized for common operations (matrix multiply, convolutions, etc.). But if our applications have specialized algorithms, unique data layouts, and non-standard precision or formats, then CUDA might not perform well. Therefore, you write CUDA program for faster execution.
However, CUDA programming is very manual and tedious. It works on the principle of Scalar Program, Blocked threads
. This means we have to define what each thread does and manage it. It is a low-level programming method. Therefore, Triton
was developed to make the specialized algorithms faster and CUDA programming a little less tedious and manual. Triton is a high-level CUDA programming method. It works on the principle of Blocked Program, Scalar Threads
. This means that instead of managing each thread we manage a group of threads instead. And Trition handles the actual operation based on our memory flow and our data flow and chooses the optimum way to perform the given task/operation. Making it faster for the specialized use cases.
Therefore, Triton has gained popularity and is helping researchers and developers with cuda programming.