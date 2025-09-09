2025 vLLM Meet-up was held, involving PyTorch's Korean user group / source=IT dongA

The landscape of generative AI is shifting at breakneck speed. Since the debut of GPT-3, the spotlight has moved from text-only systems to multimodal AI capable of handling images and audio. That quickly evolved into a race to build larger models that could process longer and more complex inputs. More recently, the conversation turned toward lightweight models and on-device AI, enabling broader adoption for consumers and industry use cases alike. Now, the focus has shifted again – to AI agents that can set goals and make decisions autonomously. In every stage, AI technology has adapted quickly to align with market demands and expectations.As AI adoption accelerates, the efficiency of operating AI services has become a critical priority. These factors will determine not only how quickly the market expands, but also who secures long-term leadership and competitiveness. In AI chips, performance and performance per watt (PPW) define the battleground. On the software side, the ability to optimize those chips and deliver results as quickly and efficiently as possible is increasingly viewed as the gold standard.One of the biggest priorities for developers in today’s AI market is figuring out how to use large language models (LLMs) more efficiently. With most AI services built on LLMs, success depends on delivering faster and cheaper responses to a wider audience—both to boost their edge in the market and to keep costs under control. A number of software libraries aim to solve this challenge, including vLLM, NVIDIA’s TensorRT-LLM, Hugging Face’s TGI, and Microsoft’s DeepSpeed-MII.Each library has its strengths and weaknesses depending on the deployment environment, but vLLM has quickly emerged as the most widely adopted. Because it is open, free to use, and can be customized or redistributed, vLLM has become the backbone of the ecosystem powering modern AI applications.As vLLM increasingly defines the efficiency of LLM-based services, related events have been brought up around the globe, hosted by ventures such as Andreessen Horowitz(a16z), Meta, Google, IBM, and Snowflake. Korea held its first-ever vLLM meetup on August 19, co-hosted by AI chipmaker Rebellions and enterprise open-source leader Red Hat. The event was jointly organized with the PyTorch Korea User Group and AI quantization startup Squeezebits.“This meetup was not about promoting any single company but about strengthening the open-source community,” said Dr. Hong-seok Kim, Chief Software Architect at Rebellions. “Both Rebellions and Red Hat are here to contribute to open source, with a focus on supporting the long-term growth of vLLM. By doing so, we hope to create better conditions for developers and AI semiconductor companies like Rebellions to build and operate AI more effectively.”Further on, Nicolò Lucchesi, Senior Machine Learning Engineer at Red Hat, took the stage with a talk titled “Fast and Efficient AI Inference for vLLM.” “VLM is for serving the model's production scale. So think beyond your personal VLM running on your laptop. Here we mainly tune, configure and optimize for models running on big GPUs, cyber scale GPUs, H100 or more. The most important resource of what we have there, which is the GPU, is basically what this talk is about.”Technically, vLLM’s strength lies in two innovations: PagedAttention for memory management and Continuous Batching for throughput optimization. LLMs rely on a KV cache which stores key-value pairs representing word meaning and contextual information in GPU memory during processing. But as conversations get longer, cache grows, consuming significant space. PagedAttention addresses this by splitting the KV cache into small blocks and distributing them across available memory, allocating only what is needed. The result – less wasted memory and the ability to handle far more concurrent requests with greater efficiency.The data streamlined by PagedAttention is then processed through Continuous Batching. Traditionally, LLM systems grouped incoming requests into a batch and processed them together before moving on to the next set. However, if the requests varied in response length, some GPUs would sit idle, dragging down efficiency. Continuous Batching solves that problem by dynamically filling gaps. Promptly, a new request comes in, it is placed into the queue, and any unused GPU cycles are immediately allocated to the next task. This keeps utilization high and throughput steady. Together, these two techniques form the foundation of the vLLM library.More recently, optimization techniques such as custom kernels that deliver hardware-optimized computation beyond memory management, attention backends that apply context-appropriate mechanisms, custom fusions that boost performance by merging operations, Torch compiling for dynamic optimization, and multi-level parallelization have been introduced. Hardware support has also expanded beyond NVIDIA CUDA to include AMD ROCm, Google TPU, Intel Gaudi, and AWS Neuron.Daniele Trifirò, AI Inference Server Engineer at Red Hat, conveyed a talk on “Building and Testing Infrastructure for vLLM.” “vLLM makes releases roughly every 3 weeks.” Trifirò noted. “There’s almost 40,000 lines changed, 55,000 lines deleted, so that’s a 30,000 lines net change in three weeks. That’s a massive amount of code”, he said. “And it’s not always as simple as it may seem. There are several targets – NVIDIA CUDA, AMD ROCm, Google TPU, Intel CPU and more – and for each target you might need a different tool chain.”He pointed out that running AI workloads on CUDA can demand more than 4GB of memory per process, and excessive parallelization risks crashing the system—making it essential to set limits on the number of concurrent tasks. He also recommended specifying only the necessary GPU architectures with ‘TORCH_CUDA_ARCH_LIST’ to shorten build times. Additional tips included methods for refreshing environment variables when editing Python code, and evaluating model performance with tools such as LM Evaluation Harness and CataLM.Jin-Yeun Park, Director and Head of Red Hat Korea, emphasized the company’s broader commitment: “Red Hat has been supporting the vLLM community worldwide and driving open-source AI innovation. Last November, we acquired Neural Magic, a core contributor to the vLLM community, strengthening our technical leadership in AI inference optimization. We’ve also built strategic partnerships with NVIDIA, AMD, and others,” Park said. “This meetup reflects both the rising interest and the technical needs of Korean developers, and it was especially meaningful to host it with Rebellions, a next-generation AI semiconductor company. Moving forward, Red Hat will continue to accelerate open-source AI innovation by deepening collaboration with Korea’s AI community and enterprises.”Hong-seok Kim spoke next on the role of vLLM in accelerator environments and how it operates in plugin-based systems. “Founded in 2020, Rebellions has been designing AI semiconductors and has released a new accelerator each year since 2022,” Kim said. “We launched our first chip, ION, followed by ATOM, and are now preparing to introduce the Rebel-Quad.”Kim highlighted vLLM’s versatility, noting that it not only enables various inference optimizations but is also provided in an API format that developers can easily adopt. He highlighted that the hardware abstraction layer—software that bridges operating systems and hardware—has expanded beyond NVIDIA GPUs to include Google TPUs, NPUs, and other AI accelerators. This makes it easier for AI chip companies to integrate into the broader ecosystem, potentially reshaping the dynamics of the market.Rebellions sees PyTorch as the fulcrum of the AI hardware ecosystem. By leveraging PyTorch, enterprises can integrate Rebellions’ NPUs into their services more seamlessly. Moreover, functions supported by vLLM can run directly on Rebellions’ NPUs without additional code changes. Through vLLM, Rebellions NPUs already support a variety of Hugging Face MoE (Mixture of Experts) models. MoE architectures, which divide workloads across specialized sub-models, are emerging as a key technique for improving the efficiency and performance of large-scale LLMs. By aligning with vLLM, Rebellions aims to quickly bring these cutting-edge capabilities into its own products.Hyung-jun Kim, CEO of SqueezeBits, followed with a presentation on vLLM’s quantization methods. “vLLM supports either directly quantized models or its own LLM compressor,” he explained. “The compressor works layer by layer, searching quantization scales during inference and storing data in compressed tensor formats. vLLM then automatically recognizes these formats to run inference appropriately.”He added that the approach is not tied to any specific hardware and can run across NPUs and other accelerators. For algorithms like GPTQ that still require FP32 precision, the necessary operations are handled on CPUs. With SqueezeBits’ Fits-on-Chips tool, developers can visually manage the entire pipeline—from selecting models to adjusting serving options and device frameworks—ensuring compression and testing are optimized for the deployment environment.Reflecting on the event, Kim recalled, “When I attended a vLLM meetup in the U.S. earlier this year, I was impressed to see developers sharing their experiences and debating the future of the ecosystem. Hosting Korea’s first vLLM meetup, I felt our open-source community was no less vibrant than Silicon Valley’s. As AI advances, the open-source ecosystem will drive that progress. To strengthen everyone’s capabilities, we need more dialogue, fewer repeated mistakes, and regular forums for meetups, studies, and sharing real-world use cases.”Jung-hwan Park, Lead maintainer of the PyTorch Korea User Group, echoed that sentiment: “With AI adoption accelerating, interest in vLLM is growing across the board, from startups to large enterprises. Unfortunately, Korean companies have historically been passive in supporting or contributing to open-source communities. That’s why Rebellions’ proactive commitment to the vLLM ecosystem is such a welcome change. It signals a real initiative to contribute to sustainable growth. I hope more domestic companies will join these exchanges and help establish a virtuous cycle of community-driven development.”“Unlike closed-source software, open source allows anyone to contribute, which accelerates innovation,” said Dr. Kim of Rebellions. “That openness and agility are exactly why today’s AI development ecosystem revolves around open source.” Indeed, while companies like NVIDIA and Microsoft have released their own libraries to optimize LLM efficiency, vLLM has captured the largest share, thanks to its free distribution and rapid improvements driven by collective intelligence.Still, as Park noted, support for open source in Korea remains limited. With Silicon Valley leading the system, few domestic companies have stepped up assuming “somebody else will do it.” That makes Rebellions’ proactive role a notable example for the local AI industry. For Rebellions, the initiative also offers a first-mover advantage by engaging developers early, creating a mutual reinforcing dynamic for the entire ecosystem. Friendly competition, with more AI chipmakers joining to support open source, would only strengthen that ecosystem.Rebellions has assured to continue supporting not only vLLM but other open-source projects on a consistent basis. It is hoped that other leading AI companies in Korea will also join in, allowing everyone to benefit and ultimately bolstering the country’s global standing in AI.By Si-hyun Nam (sh@itdonga.com)