The Machine That Grades the Machine: On Automated Quality Scores for Turkmen

CAT platforms now ship LLM-based quality scores that rate a translation from 0 to 100. For a low-resource language like Turkmen, that number measures something, but it isn't what the dashboard implies.

There is a new box in the localization dashboard, and it contains a number. Smartcat calls its version the Translation Quality Score: an automatic, "objective" measurement that issues a value between 0 and 100 for any language pair and helpfully highlights the segments that need attention. It is a clean, satisfying thing to look at, and it is spreading. Phrase, memoQ, and Trados have all pushed LLM capability into their editors over the past two years — Trados Copilot inside Studio 2024, Smartcat's AI Agents that learn from feedback, memoQ's steadily reworked QA settings across the 11.4–11.6 releases. Automated quality estimation is quietly becoming its own product category.

I want to make one specific argument about that number, because I think project managers are about to start trusting it, and for languages like mine they shouldn't — at least not the way the interface invites them to.

The scorer and the translated text come from the same well

Here is the mechanic that the tidy 0-to-100 display obscures. When you run AI translation into Turkmen and then ask an LLM-based checker to score it, you are frequently asking the same family of models to grade its own homework. The system that produced the Turkmen and the system that evaluates the Turkmen were trained on the same thin slice of the internet. Turkmen is a genuinely low-resource language; the model did not learn it deeply, it learned it incidentally. So the judge is exactly as ignorant as the defendant.

This matters more than it would for Spanish or German. For a high-resource pair, the evaluator has seen enough good and bad output to have a real sense of the distribution — it can plausibly tell fluent from broken, idiomatic from stilted. For Turkmen, both the generator and the evaluator are working from sparse, often noisy data. A confident 87 out of 100 does not mean the translation is 87% good. It means the model is 87% comfortable, and its comfort and its correctness are only loosely related. I have opened Turkmen segments scored in the high eighties that quietly used a Russian loanword where a native term exists, or mangled the possessive suffix chain in a way no native reader would forgive. The score didn't flinch, because the thing that produced the error and the thing grading it share the same blind spot.

A score measures fluency; accuracy is a different question

Even setting aside the low-resource problem, it's worth being precise about what these estimators actually detect. LLM-based quality scoring is very good at surface plausibility — grammaticality, register, whether a sentence reads smoothly. The industry itself keeps repeating the caution, almost as a mantra: LLM suggestions can be fluent but factually or terminologically wrong. That's the whole trap. Fluency is exactly the dimension a language model excels at generating and, therefore, exactly the dimension it excels at rewarding.

So the score rises for the failure mode that costs clients the most. A confidently fluent mistranslation of a safety instruction in an oil-and-gas manual, or a plausible-sounding but legally wrong rendering of a contract clause, is the sentence most likely to score well and least likely to get flagged. The highlighting feature — "these segments need attention" — inverts the real risk map. It points you at the awkward sentences, which a competent reviser would have caught anyway, and waves you past the smooth, wrong ones.

This is a different problem from the one I've written about before with automated QA checkers crying wolf over Turkmen morphology. That was about false positives — the checker flagging correct agglutinative forms as errors. Quality scoring introduces the opposite and more dangerous error: the false negative dressed up as a metric.

Use it as a triage signal, not a quality gate

None of this means the tools are useless. I use LLM suggestions in the editor daily; having a third proposal alongside the TM match and the MT output is genuinely handy, and prompting for register — casual, formal — occasionally saves a keystroke. The efficiency gains the vendors cite are real for the mechanical parts of the job. My objection is narrow and specific: do not let a computed score stand in for a quality decision on a low-resource language.

For project managers, the practical translation is this. Treat the quality score the way you'd treat a fuzzy-match percentage — a rough triage signal about where to spend attention, never a sign-off. Don't build acceptance thresholds around it for Turkmen; a policy that auto-approves everything above, say, 90 will approve the confident errors first. And when you're pricing, remember that the score cannot see the work that actually protects the client, which is a native reviser validating meaning against source.

The honest reframing the industry is landing on — that AI shifts human value from raw production toward validation and risk management — is correct. But validation is a human act performed by someone who speaks the language. A number generated by a model that doesn't isn't validation. It's a second opinion from the same doctor.