<feed xmlns="http://www.w3.org/2005/Atom"> <id>https://rockyrunnr.github.io/</id><title>대인배 로키의 기술 블로그</title><subtitle>시스템 소프트웨어, 스토리지, AI 도구 개발에 대한 기술 블로그</subtitle> <updated>2026-04-11T12:26:58+09:00</updated> <author> <name></name> <uri>https://rockyrunnr.github.io/</uri> </author><link rel="self" type="application/atom+xml" href="https://rockyrunnr.github.io/feed.xml"/><link rel="alternate" type="text/html" hreflang="ko-KR" href="https://rockyrunnr.github.io/"/> <generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator> <rights> © 2026 </rights> <icon>/assets/img/favicons/favicon.ico</icon> <logo>/assets/img/favicons/favicon-96x96.png</logo> <entry><title>Dynamic KV Cache Resize in llama.cpp — 8 GB Savings on a 27B Model</title><link href="https://rockyrunnr.github.io/posts/dynamic-kv-cache-resize-llama-cpp/" rel="alternate" type="text/html" title="Dynamic KV Cache Resize in llama.cpp — 8 GB Savings on a 27B Model" /><published>2026-03-02T13:00:00+09:00</published> <updated>2026-04-11T12:26:14+09:00</updated> <id>https://rockyrunnr.github.io/posts/dynamic-kv-cache-resize-llama-cpp/</id> <content type="text/html" src="https://rockyrunnr.github.io/posts/dynamic-kv-cache-resize-llama-cpp/" /> <author> <name></name> </author> <category term="Research" /> <category term="LLM Internals" /> <summary>Start with 16 MB and grow on demand. An experimental llama.cpp patch that reduced upfront KV allocation and avoided GPU OOM on Apple Silicon.</summary> </entry> <entry><title>Why OS-Level Demand Paging Fails on Apple Silicon GPU</title><link href="https://rockyrunnr.github.io/posts/demand-paging-fails-apple-silicon-gpu/" rel="alternate" type="text/html" title="Why OS-Level Demand Paging Fails on Apple Silicon GPU" /><published>2026-03-02T12:00:00+09:00</published> <updated>2026-03-02T12:00:00+09:00</updated> <id>https://rockyrunnr.github.io/posts/demand-paging-fails-apple-silicon-gpu/</id> <content type="text/html" src="https://rockyrunnr.github.io/posts/demand-paging-fails-apple-silicon-gpu/" /> <author> <name></name> </author> <category term="Research" /> <category term="LLM Internals" /> <summary>I tried to use macOS demand paging to save GPU memory in llama.cpp. It didn't work — and understanding why led to a better approach.</summary> </entry> <entry><title>How llama.cpp Manages KV Cache — and How PagedAttention Fits In</title><link href="https://rockyrunnr.github.io/posts/paged-attention-llama-cpp-deep-dive/" rel="alternate" type="text/html" title="How llama.cpp Manages KV Cache — and How PagedAttention Fits In" /><published>2026-03-01T00:00:00+09:00</published> <updated>2026-03-01T14:14:31+09:00</updated> <id>https://rockyrunnr.github.io/posts/paged-attention-llama-cpp-deep-dive/</id> <content type="text/html" src="https://rockyrunnr.github.io/posts/paged-attention-llama-cpp-deep-dive/" /> <author> <name></name> </author> <category term="Research" /> <category term="LLM Internals" /> <summary>A technical analysis of llama.cpp's KV cache architecture, the llama_memory_i interface, and how PagedAttention can be implemented as a composable new memory strategy.</summary> </entry> <entry><title>Isolating Memory Swap Degradation in Ollama: A Pure Memory Pressure Experiment</title><link href="https://rockyrunnr.github.io/posts/ollama-memory-pressure-experiment/" rel="alternate" type="text/html" title="Isolating Memory Swap Degradation in Ollama: A Pure Memory Pressure Experiment" /><published>2026-02-25T23:00:00+09:00</published> <updated>2026-03-01T14:14:31+09:00</updated> <id>https://rockyrunnr.github.io/posts/ollama-memory-pressure-experiment/</id> <content type="text/html" src="https://rockyrunnr.github.io/posts/ollama-memory-pressure-experiment/" /> <author> <name></name> </author> <category term="Projects" /> <category term="ollama-bench" /> <summary>What happens when you hold GPU load constant and only increase memory pressure? I designed an experiment to isolate swap-induced slowdown from GPU contention on a 32GB Mac Mini.</summary> </entry> <entry><title>Finding the Performance Cliff: Parallel Request Benchmarking with Ollama</title><link href="https://rockyrunnr.github.io/posts/ollama-parallel-benchmark-cliff/" rel="alternate" type="text/html" title="Finding the Performance Cliff: Parallel Request Benchmarking with Ollama" /><published>2026-02-25T02:50:00+09:00</published> <updated>2026-03-01T14:14:31+09:00</updated> <id>https://rockyrunnr.github.io/posts/ollama-parallel-benchmark-cliff/</id> <content type="text/html" src="https://rockyrunnr.github.io/posts/ollama-parallel-benchmark-cliff/" /> <author> <name></name> </author> <category term="Projects" /> <category term="ollama-bench" /> <summary>Sweeping parallelism from 1 to 10 concurrent requests on a 30B model to find exactly where performance falls off a cliff — and why KV cache pre-allocation is the culprit.</summary> </entry> </feed>
