포스트

ollacode System Prompt Optimization: Korean → English Benchmark Results

Switching ollacode's system prompt from Korean to English and measuring real performance gains with ollama-bench. 60% faster TTFT, 55% fewer input tokens.

ollacode System Prompt Optimization: Korean → English Benchmark Results

Background

In the previous post, I optimized ollacode’s memory usage by switching the system prompt from Korean to English. The theory is simple: LLM tokenizers are optimized for English, so the same meaning requires fewer tokens.

But theory alone isn’t enough. I built ollama-bench and measured the difference with real data.

Experiment Setup

ParameterValue
Modelqwen3-coder:30b
HardwareMac Mini (Apple Silicon)
Benchmark modecontext-growth (5 rounds)
Seed42
Temperature0.0
WorkloadSame 20-prompt coding sequence

What we’re comparing:

  • Before (Korean): Korean system prompt (~2000 chars, 732 input tokens)
  • After (English): English system prompt (~1200 chars, 331 input tokens)

Results

Per-Round Data

Korean System Prompt

RoundIn TokOut TokGen t/sPrefill t/sTTFT(ms)Total(ms)Mem(MB)
173229636.4317.42,30725,10319,335
21,05759134.62,181.748417,80619,333
31,68386532.82,218.275927,43919,338
42,5781,15031.23,012.685638,08319,345
53,7541,36528.92,831.71,32648,87019,343

English System Prompt

RoundIn TokOut TokGen t/sPrefill t/sTTFT(ms)Total(ms)Mem(MB)
133123238.2339.69757,21119,356
259152736.91,934.230614,74819,357
31,15292335.13,266.735326,93719,358
42,10462432.66,098.834519,67219,356
52,7531,07127.38,401.232840,30419,289

Summary Comparison

MetricKoreanEnglishChange
Avg Gen Speed32.8 t/s34.0 t/s+3.8%
Avg TTFT1,146 ms461 ms-59.8%
Avg Prefill Speed2,112 t/s4,008 t/s+89.8%
Round 1 Input Tokens732331-54.8%
Peak Memory19,345 MB19,358 MB+0.1% (negligible)

Analysis

1. 55% Fewer Input Tokens — The Root Cause

This is the most fundamental difference. The same system prompt content uses 732 tokens in Korean but only 331 tokens in English.

graph LR
    A["🇰🇷 Korean Prompt<br/>732 tokens"] -->|"-55%"| B["🇺🇸 English Prompt<br/>331 tokens"]

    style A fill:#e74c3c,stroke:#c0392b,color:#fff
    style B fill:#27ae60,stroke:#1e8449,color:#fff

Why such a big difference? LLM tokenizers (e.g., BPE) are optimized for English text. An English word typically maps to 1–2 tokens, while Korean syllables often require separate tokens each:

  • "function" → 1 token
  • "함수" (same meaning) → 2–3 tokens

2. TTFT Improved by 60% — Perceived Responsiveness

Average TTFT dropped from 1,146ms → 461ms — a 60% improvement.

This is the wait time before the first character appears. Going from 1.1 seconds to 0.5 seconds is a substantial UX improvement. TTFT is directly proportional to input token count, so fewer input tokens = faster first response.

3. Prefill Speed Doubled (+90%)

Prefill speed jumped from 2,112 → 4,008 t/s — nearly faster.

Key insight: In Rounds 4–5, the English prompt achieves 6,000–8,400 t/s prefill. As input grows larger, GPU matrix operation parallelization becomes more efficient. Korean’s higher token count reduces this efficiency since there are more tokens to process for the same semantic content.

4. Generation Speed — Modest Improvement

Gen speed improved from 32.8 → 34.0 t/s (+3.8%).

Generation speed depends more on model architecture and hardware than input size, so the difference is small. However, a smaller KV cache (from fewer input tokens) improves cache hit efficiency slightly.

5. Memory — No Change

Memory usage stayed at ~19.3GB. The prompt size difference (~400 tokens) is negligible compared to the 30B model’s parameter memory footprint.

Key Takeaway

flowchart TD
    A["System Prompt<br/>Korean → English"] --> B["Input Tokens 55% ↓"]
    B --> C["TTFT 60% ↓<br/>(Responsiveness ↑)"]
    B --> D["Prefill 90% ↑<br/>(Processing speed ↑)"]
    B --> E["Gen Speed 4% ↑<br/>(Modest gain)"]
    B --> F["Memory: No change"]

    style A fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#27ae60,stroke:#1e8449,color:#fff
    style D fill:#27ae60,stroke:#1e8449,color:#fff

Always write internal prompts in English for local LLMs. Regardless of the user’s language, keep system prompts, tool descriptions, and internal messages in English for better token efficiency. The AI will still respond in whatever language the user writes in, so the user experience is unaffected.

Tools Used

This benchmark was run with ollama-bench:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Korean prompt benchmark
ollama-bench --model qwen3-coder:30b --rounds 5 \
  --system-prompt korean_prompt.txt \
  --system-prompt-label korean \
  --output bench_korean.json

# English prompt benchmark
ollama-bench --model qwen3-coder:30b --rounds 5 \
  --system-prompt english_prompt.txt \
  --system-prompt-label english \
  --output bench_english.json

# Compare
ollama-bench --compare bench_korean.json bench_english.json

ollacode optimization results. Data-driven decisions are the foundation of meaningful optimization.

이 기사는 저작권자의 CC BY 4.0 라이센스를 따릅니다.