An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models

Transformer-based large language models (LLMs) such as Generative Pre-trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator...

Full description

Saved in:

Bibliographic Details
Published in:	2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA) pp. 970 - 982
Main Authors:	Park, Sang-Soo, Kim, KyungSoo, So, Jinin, Jung, Jin, Lee, Jonggeon, Woo, Kyoungwan, Kim, Nayeon, Lee, Younghyun, Kim, Hyungyo, Kwon, Yongsuk, Kim, Jinhyun, Lee, Jieun, Cho, YeonGon, Tai, Yongmin, Cho, Jeonghyeon, Song, Hoyoung, Ahn, Jung Ho, Kim, Nam Sung
Format:	Conference Proceeding
Language:	English
Published:	IEEE 02-03-2024
Subjects:	Bandwidth CXL CXL-PNM Full stack Graphics processing units LLM LPDDR Memory architecture Memory management Training Transformers
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Transformer-based large language models (LLMs) such as Generative Pre-trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator with its pervasive software development ecosystem and powerful computing capability. However, as the size of LLMs keeps increasing for higher performance and/or more complex applications, a single GPU cannot efficiently accelerate LLM training and inference due to its limited memory capacity, which demands frequent transfers of the model parameters needed by the GPU to compute the current layer(s) from the host CPU memory/storage. A GPU appliance may provide enough aggregated memory capacity with multiple GPUs, but it suffers from frequent transfers of intermediate values among GPU devices, each accelerating specific layers of a given LLM. As the frequent transfers of these model parameters and intermediate values are performed over relatively slow device-to-device interconnects such as PCIe or NVLink, they become the key bottleneck for efficient acceleration of LLMs. Focusing on accelerating LLM inference, which is essential for many commercial services, we develop CXL-PNM, a processing near memory (PNM) platform based on the emerging interconnect technology, Compute eXpress Link (CXL). Specifically, we first devise an LPDDR5X-based CXL memory architecture with 512GB of capacity and 1.1TB/s of bandwidth, which boasts 16× larger capacity and 10× higher bandwidth than GDDR6and DDR5-based CXL memory architectures, respectively, under a module form-factor constraint. Second, we design a CXLPNM controller architecture integrated with an LLM inference accelerator, exploiting the unique capabilities of such CXL memory to overcome the disadvantages of competing technologies such as HBM-PIM and AxDIMM. Lastly, we implement a CXLPNM software stack that supports seamless and transparent use of CXL-PNM for Python-based LLM programs. Our evaluation shows that a CXL-PNM appliance with 8 CXL-PNM devices offers 23% lower latency, 31% higher throughput, and 2.8× higher energy efficiency at 30% lower hardware cost than a GPU appliance with 8 GPU devices for an LLM inference service.
ISSN:	2378-203X
DOI:	10.1109/HPCA57654.2024.00078