ollacode System Prompt Optimization: Korean → English Benchmark Results
Switching ollacode's system prompt from Korean to English and measuring real performance gains with ollama-bench. 60% faster TTFT, 55% fewer input tokens.
Background
In the previous post, I optimized ollacode’s memory usage by switching the system prompt from Korean to English. The theory is simple: LLM tokenizers are optimized for English, so the same meaning requires fewer tokens.
But theory alone isn’t enough. I built ollama-bench and measured the difference with real data.
Experiment Setup
| Parameter | Value |
|---|---|
| Model | qwen3-coder:30b |
| Hardware | Mac Mini (Apple Silicon) |
| Benchmark mode | context-growth (5 rounds) |
| Seed | 42 |
| Temperature | 0.0 |
| Workload | Same 20-prompt coding sequence |
What we’re comparing:
- Before (Korean): Korean system prompt (~2000 chars, 732 input tokens)
- After (English): English system prompt (~1200 chars, 331 input tokens)
Results
Per-Round Data
Korean System Prompt
| Round | In Tok | Out Tok | Gen t/s | Prefill t/s | TTFT(ms) | Total(ms) | Mem(MB) |
|---|---|---|---|---|---|---|---|
| 1 | 732 | 296 | 36.4 | 317.4 | 2,307 | 25,103 | 19,335 |
| 2 | 1,057 | 591 | 34.6 | 2,181.7 | 484 | 17,806 | 19,333 |
| 3 | 1,683 | 865 | 32.8 | 2,218.2 | 759 | 27,439 | 19,338 |
| 4 | 2,578 | 1,150 | 31.2 | 3,012.6 | 856 | 38,083 | 19,345 |
| 5 | 3,754 | 1,365 | 28.9 | 2,831.7 | 1,326 | 48,870 | 19,343 |
English System Prompt
| Round | In Tok | Out Tok | Gen t/s | Prefill t/s | TTFT(ms) | Total(ms) | Mem(MB) |
|---|---|---|---|---|---|---|---|
| 1 | 331 | 232 | 38.2 | 339.6 | 975 | 7,211 | 19,356 |
| 2 | 591 | 527 | 36.9 | 1,934.2 | 306 | 14,748 | 19,357 |
| 3 | 1,152 | 923 | 35.1 | 3,266.7 | 353 | 26,937 | 19,358 |
| 4 | 2,104 | 624 | 32.6 | 6,098.8 | 345 | 19,672 | 19,356 |
| 5 | 2,753 | 1,071 | 27.3 | 8,401.2 | 328 | 40,304 | 19,289 |
Summary Comparison
| Metric | Korean | English | Change |
|---|---|---|---|
| Avg Gen Speed | 32.8 t/s | 34.0 t/s | +3.8% ✅ |
| Avg TTFT | 1,146 ms | 461 ms | -59.8% ✅ |
| Avg Prefill Speed | 2,112 t/s | 4,008 t/s | +89.8% ✅ |
| Round 1 Input Tokens | 732 | 331 | -54.8% ✅ |
| Peak Memory | 19,345 MB | 19,358 MB | +0.1% (negligible) |
Analysis
1. 55% Fewer Input Tokens — The Root Cause
This is the most fundamental difference. The same system prompt content uses 732 tokens in Korean but only 331 tokens in English.
graph LR
A["🇰🇷 Korean Prompt<br/>732 tokens"] -->|"-55%"| B["🇺🇸 English Prompt<br/>331 tokens"]
style A fill:#e74c3c,stroke:#c0392b,color:#fff
style B fill:#27ae60,stroke:#1e8449,color:#fff
Why such a big difference? LLM tokenizers (e.g., BPE) are optimized for English text. An English word typically maps to 1–2 tokens, while Korean syllables often require separate tokens each:
"function"→ 1 token"함수"(same meaning) → 2–3 tokens
2. TTFT Improved by 60% — Perceived Responsiveness
Average TTFT dropped from 1,146ms → 461ms — a 60% improvement.
This is the wait time before the first character appears. Going from 1.1 seconds to 0.5 seconds is a substantial UX improvement. TTFT is directly proportional to input token count, so fewer input tokens = faster first response.
3. Prefill Speed Doubled (+90%)
Prefill speed jumped from 2,112 → 4,008 t/s — nearly 2× faster.
Key insight: In Rounds 4–5, the English prompt achieves 6,000–8,400 t/s prefill. As input grows larger, GPU matrix operation parallelization becomes more efficient. Korean’s higher token count reduces this efficiency since there are more tokens to process for the same semantic content.
4. Generation Speed — Modest Improvement
Gen speed improved from 32.8 → 34.0 t/s (+3.8%).
Generation speed depends more on model architecture and hardware than input size, so the difference is small. However, a smaller KV cache (from fewer input tokens) improves cache hit efficiency slightly.
5. Memory — No Change
Memory usage stayed at ~19.3GB. The prompt size difference (~400 tokens) is negligible compared to the 30B model’s parameter memory footprint.
Key Takeaway
flowchart TD
A["System Prompt<br/>Korean → English"] --> B["Input Tokens 55% ↓"]
B --> C["TTFT 60% ↓<br/>(Responsiveness ↑)"]
B --> D["Prefill 90% ↑<br/>(Processing speed ↑)"]
B --> E["Gen Speed 4% ↑<br/>(Modest gain)"]
B --> F["Memory: No change"]
style A fill:#3498db,stroke:#2980b9,color:#fff
style C fill:#27ae60,stroke:#1e8449,color:#fff
style D fill:#27ae60,stroke:#1e8449,color:#fff
Always write internal prompts in English for local LLMs. Regardless of the user’s language, keep system prompts, tool descriptions, and internal messages in English for better token efficiency. The AI will still respond in whatever language the user writes in, so the user experience is unaffected.
Tools Used
This benchmark was run with ollama-bench:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Korean prompt benchmark
ollama-bench --model qwen3-coder:30b --rounds 5 \
--system-prompt korean_prompt.txt \
--system-prompt-label korean \
--output bench_korean.json
# English prompt benchmark
ollama-bench --model qwen3-coder:30b --rounds 5 \
--system-prompt english_prompt.txt \
--system-prompt-label english \
--output bench_english.json
# Compare
ollama-bench --compare bench_korean.json bench_english.json
ollacode optimization results. Data-driven decisions are the foundation of meaningful optimization.