Dynamic KV Cache Resize in llama.cpp — 8 GB Savings on a 27B Model

Start with 16 MB and grow on demand. An experimental llama.cpp patch that reduced upfront KV allocation and avoided GPU OOM on Apple Silicon.

2026/03/02 Research, LLM Internals

Why OS-Level Demand Paging Fails on Apple Silicon GPU

I tried to use macOS demand paging to save GPU memory in llama.cpp. It didn't work — and understanding why led to a better approach.

2026/03/02 Research, LLM Internals

How llama.cpp Manages KV Cache — and How PagedAttention Fits In

A technical analysis of llama.cpp's KV cache architecture, the llama_memory_i interface, and how PagedAttention can be implemented as a composable new memory strategy.

2026/03/01 Research, LLM Internals

Isolating Memory Swap Degradation in Ollama: A Pure Memory Pressure Experiment

What happens when you hold GPU load constant and only increase memory pressure? I designed an experiment to isolate swap-induced slowdown from GPU contention on a 32GB Mac Mini.

2026/02/25 Projects, ollama-bench

Finding the Performance Cliff: Parallel Request Benchmarking with Ollama

Sweeping parallelism from 1 to 10 concurrent requests on a 30B model to find exactly where performance falls off a cliff — and why KV cache pre-allocation is the culprit.

2026/02/25 Projects, ollama-bench

ollacode System Prompt Optimization: Korean → English Benchmark Results

Switching ollacode's system prompt from Korean to English and measuring real performance gains with ollama-bench. 60% faster TTFT, 55% fewer input tokens.

2026/02/25 Projects, ollacode

ollama-bench: Building a Performance Benchmark Tool for Ollama

I built ollama-bench to measure local LLM performance degradation — token generation speed, prefill speed, TTFT, and memory usage over progressive requests.

2026/02/25 Projects, ollama-bench

ollacode Day 2: Memory Optimization — Making Local LLMs Practical

Day 2 of ollacode development — optimizing token usage for local LLMs with English prompts, context compaction, and smart history management.

2026/02/25 Projects, ollacode

Building ollacode: A Local AI Coding Assistant with Telegram Integration (Day 1)

Day 1 of building ollacode — a lightweight CLI coding assistant powered by Ollama's qwen3-coder:30b model with Telegram bot integration.

2026/02/24 Projects, ollacode

Solidity Deep Dive (4): EIP-4337 Account Abstraction — Audit Report Review

Key findings from OpenZeppelin's audit of the EIP-4337 Account Abstraction implementation — deposit manipulation, silent token transfer failures, and the dangers of Solidity's transfer().

2022/04/29 Development, Blockchain

Dynamic KV Cache Resize in llama.cpp — 8 GB Savings on a 27B Model

Why OS-Level Demand Paging Fails on Apple Silicon GPU

How llama.cpp Manages KV Cache — and How PagedAttention Fits In

Isolating Memory Swap Degradation in Ollama: A Pure Memory Pressure Experiment

Finding the Performance Cliff: Parallel Request Benchmarking with Ollama

ollacode System Prompt Optimization: Korean → English Benchmark Results

ollama-bench: Building a Performance Benchmark Tool for Ollama

ollacode Day 2: Memory Optimization — Making Local LLMs Practical

Building ollacode: A Local AI Coding Assistant with Telegram Integration (Day 1)

Solidity Deep Dive (4): EIP-4337 Account Abstraction — Audit Report Review

인기 태그