A short story about how Ukrainian Cyrillic is a tax on non-English developers, and what I did about it in four days of benchmarks on a 2009 Phenom II.
Disclaimer for Those Who Came for the Numbers
If you don’t care about the backstory, skip to the results — two benchmarks across 6 models. I won’t be offended. But you’ll miss half the point — this thing wasn’t born from a research paper, it was born from not having enough money for API calls.
I’m Daria, a programmer from Ukraine, living in Poland. I work with data daily, and without LLMs today — it’s like sawing wood without a chainsaw. Technically possible, but you’ll be the fool among people with chainsaws.
My work machine is an old AMD Phenom II that my husband literally found in a dumpster. A real 2009 processor that I wiped down, installed Linux on (because Windows would die of embarrassment on this), and work on.
The situation:
- Local models — forget it. Llama on Phenom II is like running a marathon in rubber flip-flops. Technically possible, morally unacceptable.
- Cloud (GPT, Claude, Gemini) — expensive. Especially in Ukrainian.
- Train my own model — I’m not that cool.
What pissed me off was that for quality semantic search I had to use cloud.
Why Is Ukrainian More Expensive?
This is the moment that genuinely angered me when I hit this wall.
Ukrainian Cyrillic in GPT/Claude tokenizers splits into 3-4 tokens per word. English — 1-1.5. So for the exact same content you pay three to four times more, just because you write in Cyrillic.
Prompt: "блін продакшн впав після деплою, що робити першим"
Tokens UA: 45
Same meaning in English:
"damn production crashed after deploy, what do first"
Tokens EN: 12
This isn’t a technical problem, it’s a tax on non-English developers. We all pay it silently. Arabs, Thais, Hindi speakers — too. Nobody made comparison tables because nobody gets paid to write about it.
For me, with a Phenom II and a limited budget, this tax added up every month. And at some point I thought: okay, what if I compress the input before paying?
The Third Way
The insight is stupidly simple:
The problem isn’t the model, the problem is the input tokens.
If I can’t make the model cheaper — I can make the input shorter. And not just truncate, but transform it into something GPT understands equally well, but costs less.
That’s how the idea of a three-layer library was born:
- crack_open — normalization. 360 rules + pymorphy3 for lemmatization. “шо” → “що”, “канєшно” → “звичайно”, “пофіксь” → “виправ”.
- compress — remove noise. Fillers, intensifiers, repetitions. Rule-based, no ML. Fast and predictable.
- map_to_en — translate to English. 47K lexicon + seq2seq on 28K expression pairs (GRU encoder-decoder, 7.3M parameters).
I named it dormouse. Because that’s my nickname everywhere — the Dormouse from Alice in Wonderland. Simple logic: my lib = my nick.
How I Tested It
I don’t like papers where ML models are tested on synthetic datasets. I want real texts. So I ran two different benchmarks on two different corpora — to look at the thing from two angles.
Benchmark #1 — coverage on chaotic real texts. Corpus: 53,351 texts. My Telegram conversations (years of chatting with clients, friends, colleagues — surzhyk, profanity, memes), my prompts to Claude, three Lewis Carroll books (because if your lib is called dormouse, you must run it through the source material).
Benchmark #2 — quality preservation on work prompts. 100 carefully selected prompts × 6 models: GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o-mini, GPT-5.5, Gemini 2.0 Flash. Each prompt run twice — original in Ukrainian and squeezed in English. Responses scored for relevance 1-5.
The Phenom II ran the first benchmark for 94 hours straight (~4 days). Sometimes it felt unwell and the monitor turned pink :)
Results
Benchmark #1 — Coverage (53,351 texts)
| Metric | Value |
|---|---|
| Lexicon coverage | 88.2% |
| Token savings (without seq2seq) | 48.5% |
| Token savings (with seq2seq) | 73.3% |
| Seq2seq accuracy (exact match) | 98.2% |
| Seq2seq accuracy (overlap) | 99.3% |
73% savings — this is on chaotic Telegram texts where there’s a lot of noise to throw away. On cleaner work prompts the savings are lower but still significant (see benchmark #2).
Benchmark #2 — Response Stability (100 prompts × 6 models)
First, honest about methodology. I didn’t use an LLM-judge for response evaluation — because running 600 requests through a judging GPT would cost me more than the lib saves in six months. Instead I used a heuristic judge: scoring by length, structure and formal completeness of the response (1-5 points).
This means something important: I’m not measuring depth of content, I’m measuring whether the model gives a similarly-formed response to the original prompt and the squeezed one.
| Model | UA score | EN (squeezed) | Preservation |
|---|---|---|---|
| GPT-4.1 | 4.79 | 4.86 | 102% |
| GPT-4.1-mini | 4.71 | 4.68 | 99% |
| GPT-4o-mini | 4.61 | 4.60 | 100% |
| GPT-4.1-nano | 4.58 | 4.56 | 100% |
| GPT-5.5 | 4.00 | 4.00 | 100% |
| Gemini 2.0 Flash | 4.11 | 4.10 | 100% |
Token savings on this corpus: 50% (without seq2seq).
What this means in human language:
Squeeze doesn’t force the model to give worse-formed responses. Across all 6 models, response form is stable (99-102%). This is not “content quality preserved” — it’s a weaker claim. But it’s an honest claim.
GPT-5.5 artifact: scored 4.0/4.0 — lower than all other models. This isn’t real degradation: GPT-5.5 gives shorter, more concise answers that the heuristic penalizes for short length. It’s a penalty for conciseness, not for poor answers. This is a limitation of my judge, not the model.
What would be ideal but I haven’t done:
- LLM-judge on the same 600 requests (Claude Opus or GPT-4 as judge).
- Human evaluation on at least 50 samples.
All in the roadmap. If someone has GPT credits and wants to do this — write me, I’ll share the corpus and scripts.
Benchmark #3 — Head-to-Head vs LLMLingua (20 prompts)
LLMLingua is the most popular prompt compression tool from Microsoft Research. It uses GPT-2 perplexity to identify “low-information” tokens and discard them. Works well on English texts. How does it work on Ukrainian?
| Method | Tokens | Savings | Quality |
|---|---|---|---|
| Original UA | 1,312 | — | 4.65 |
| dormouse | 620 | 53% | 4.50 |
| LLMLingua (on UA) | 1,182 | 10% | 4.60 |
| dormouse + LLMLingua | 595 | 55% | 4.60 |
What I see in these numbers:
LLMLingua on Ukrainian gives only 10% savings vs 53% from dormouse. A five-fold difference. The explanation is mechanical: GPT-2 is trained predominantly on English data, its perplexity on Cyrillic is noisy. It can’t correctly decide which tokens are “unimportant,” so it discards few and cautiously.
Quality is almost the same: 4.50 for dormouse vs 4.60 for LLMLingua. That’s 0.1 points on a heuristic judge — within noise.
The combo dormouse + LLMLingua adds only +2% to savings. Meaning dormouse already captures the bulk of the “compressible” potential, and LLMLingua has nothing to add on top.
Conclusion: for Ukrainian, LLMLingua isn’t broken, just architecturally unsuitable. It would do the same thing trying to compress Arabic or Thai — any language with a dominant non-Latin script that lacks a cheap perplexity model.
Try It
pip install dormouse-ua
Package is 29MB — everything included: 47K lexicon, seq2seq model (7.3M params), 360 normalization rules. Deliberately bundled instead of lazy-loading from HuggingFace — because I live in reality where internet can be slow, HuggingFace can have its own plans, and pip install should just work on the first try.
from dormouse import squeeze
squeeze("ваще нормально, канєшно зробимо", target="cloud")
# → "generally ok, sure do"
Or as a drop-in middleware for OpenAI/Anthropic SDK:
from openai import OpenAI
from dormouse import DormouseClient
client = DormouseClient(OpenAI())
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "шо там по деплою, він ваще не робе"}],
)
# Under the hood: squeeze → EN → GPT → unsqueeze → Ukrainian response
Bonus — stir, mumble, sip for semantic search, classification and RAG, fully local on CPU. No API, no keys, no cost (requires [ml]).
Who Needs This
- Ukrainian SaaS / chatbots — users write in surzhyk, GPT responds off-topic. Normalize before sending — get concrete answers.
- RAG systems — user searches in slang, documents in literary language. Normalize both sides — search works.
- Batch processing — 10K comments through GPT for sentiment. Squeeze first — cheaper and faster.
- AI agents — long action chains eat context window. -50-73% tokens = +50-73% “memory” for the agent.
- Local search without API — stir/mumble/sip work offline. Phenom II gives 600ms search across 8K chunks. Works.
Honest About Limitations
I don’t like it when library authors sell their code like it solves all world problems. So here are three things I’m not sure about — and where I genuinely need your help.
1. I don’t know yet if this works outside my context. 53K texts is a lot by volume, but it’s my bubble: my Telegram conversations, my prompts, my Alice in Wonderland. How it behaves on a doctor’s diagnostic texts, a programmer from Dnipro with Russian surzhyk, a teenager with their own slang — I don’t know. Might be fine, might break. Only real users will show.
2. Heuristic judge is not a full eval. I measured 99-102% preservation by structure and length, not by content. This means my lib doesn’t force the model to give shorter or less structured answers. It doesn’t mean the content stays equally accurate. For that you need LLM-judge or human evaluation, which I don’t have yet. In the roadmap.
What I actually need:
- Logs — if it breaks on some text type, send an issue with the example.
- Tests on your data — if you have a Ukrainian corpus (medical, legal, support tickets) and can run it through squeeze() — share what happened. Edge cases where it breaks are especially interesting.
- Runs on your API use cases — if you pay for GPT/Claude in a Ukrainian product, try it and tell me honestly whether the savings are real for you. My 50-73% is on my corpora, not a law of nature.
Links
- GitHub: ChuprinaDaria/dormouse
- PyPI: dormouse-ua
- HuggingFace: Dariachup/dormouse
MIT license. Forks, issues, criticism — all open. Especially criticism.
Instead of a Conclusion
I didn’t write the next PyTorch. I didn’t make Hugging Face. I made a small, specific thing that saves me (and hopefully a few others) money on API.
Is this a revolution? No. Is it useful? Yes. Does it work on a 2009 Phenom II found in a dumpster? It does.
And that, honestly, is the best metric I have.