대인배 로키의 기술 블로그

https://rockyrunnr.github.io/대인배 로키의 기술 블로그시스템 소프트웨어, 스토리지, AI 도구 개발에 대한 기술 블로그 2026-04-11T12:26:58+09:00 https://rockyrunnr.github.io/ Jekyll © 2026 /assets/img/favicons/favicon.ico /assets/img/favicons/favicon-96x96.png Dynamic KV Cache Resize in llama.cpp — 8 GB Savings on a 27B Model2026-03-02T13:00:00+09:00 2026-04-11T12:26:14+09:00 https://rockyrunnr.github.io/posts/dynamic-kv-cache-resize-llama-cpp/

Start with 16 MB and grow on demand. An experimental llama.cpp patch that reduced upfront KV allocation and avoided GPU OOM on Apple Silicon.

Why OS-Level Demand Paging Fails on Apple Silicon GPU2026-03-02T12:00:00+09:00 2026-03-02T12:00:00+09:00 https://rockyrunnr.github.io/posts/demand-paging-fails-apple-silicon-gpu/

I tried to use macOS demand paging to save GPU memory in llama.cpp. It didn't work — and understanding why led to a better approach.

How llama.cpp Manages KV Cache — and How PagedAttention Fits In2026-03-01T00:00:00+09:00 2026-03-01T14:14:31+09:00 https://rockyrunnr.github.io/posts/paged-attention-llama-cpp-deep-dive/

A technical analysis of llama.cpp's KV cache architecture, the llama_memory_i interface, and how PagedAttention can be implemented as a composable new memory strategy.

Isolating Memory Swap Degradation in Ollama: A Pure Memory Pressure Experiment2026-02-25T23:00:00+09:00 2026-03-01T14:14:31+09:00 https://rockyrunnr.github.io/posts/ollama-memory-pressure-experiment/

What happens when you hold GPU load constant and only increase memory pressure? I designed an experiment to isolate swap-induced slowdown from GPU contention on a 32GB Mac Mini.

Finding the Performance Cliff: Parallel Request Benchmarking with Ollama2026-02-25T02:50:00+09:00 2026-03-01T14:14:31+09:00 https://rockyrunnr.github.io/posts/ollama-parallel-benchmark-cliff/

Sweeping parallelism from 1 to 10 concurrent requests on a 30B model to find exactly where performance falls off a cliff — and why KV cache pre-allocation is the culprit.