I’m building my own AI-powered Swedish teacher. It gives me translation challenges — “translate ‘I am late’ into Swedish” — and I type my answer. But I also want to ask follow-up questions mid-challenge: “say this in english!”, “can you repeat that?”, “what does ‘sen’ mean?”

The app has to decide, for every message I send, whether I’m answering the challenge or just having a conversation. Get it wrong and the app grades my question as a wrong translation. Get it wrong the other way and my actual answer gets ignored.

The Problem: Heuristics That Can’t Keep Up

My first approach was a method called isQuestionLike(). It checked if input ended with ? or started with prefixes like "how ", "what ", "can you ", or "translate ". If it matched, treat it as conversation. Otherwise, grade it as a translation attempt.

Four prefixes worked for the obvious cases. Then I typed “say this in english!” during a challenge. No question mark. No matching prefix. The app graded it as a translation attempt — wrong answer, red mark, frustrating.

So I added more prefixes. Then more. The heuristic grew but never converged:

  • “say this in english!” (a request, no matching prefix)
  • “I don’t understand this” (a complaint, no matching prefix)
  • “please help” (a plea, no matching prefix)

Every time I used my own app naturally, I found another edge case. The heuristic was chasing natural language with pattern matching, and natural language was winning.

The Solution: One LLM Call

I replaced the entire method with a single API call to an LLM. The new isTranslationAttempt() method sends the active challenge context and the user’s message to the model with a system prompt:

“You classify user intent. The user has been asked to translate a sentence into [target language]. Determine if the user’s message is an actual translation attempt or if it is a question, comment, request, complaint, or other conversational message. Respond with exactly one word: TRANSLATION or CONVERSATION.”

The LLM sees both the challenge (“Translate ‘I am late’ into Swedish”) and the user’s input. It responds with one word. The entire decision happens in context, not in pattern-matching.

Why This Matters With AI

There’s a specific class of problems where traditional code struggles: decisions where the input space is open-ended and the criteria are semantic, not syntactic.

Checking if a string ends with ? is syntactic. Understanding whether “say this in english!” is a translation attempt or a conversational request is semantic. No amount of prefix-checking converges on semantic understanding.

LLMs are genuinely good at this. They understand intent, tone, and context. For a decision like this — binary output, bounded context, low latency requirements — an LLM call is cheaper and more reliable than an ever-growing heuristic.

The key insight: you’re not replacing “real logic” with AI. You’re replacing a bad approximation of language understanding with a tool that’s actually built for language understanding.

What This Looked Like in Practice

The learnswedish app is a Java project with clean architecture — the application layer defines an AiLanguageService interface, and the infrastructure layer implements it with OpenAI. The old isQuestionLike() was a private method inside LearningCoach.java. The new isTranslationAttempt(Challenge challenge, String userText) lives on the interface, making it testable and swappable.

The implementation extracts the target language from the challenge — if the challenge prompt is in Swedish, the target is English, and vice versa. It constructs a prompt that includes both the challenge text and my message, sends it to the LLM, and parses the one-word response.

After the change, I used my own app for a week without a single wrong decision. “Say this in english!”, “what does this mean?”, “I give up” — all correctly identified as conversation. “Jag är sen”, “det är kallt” — correctly graded as translation attempts. The LLM handles nuance that no heuristic could.

Testing follows the same pattern as any other service method: mock the HTTP response, verify the decision result, inspect the prompt for correctness. For integration tests, a StubAiLanguageService provides configurable behavior — tests can explicitly set the next decision result without hitting the real API. The interface stays clean; the implementation is swappable.

How to Build This In

Not every if-else chain should become an LLM call. The pattern fits when:

  1. The input is natural language — not structured data, not numeric ranges
  2. The criteria are semantic — the difference between outcomes depends on meaning, not syntax
  3. The input space is open-ended — you can’t enumerate all possible inputs
  4. A binary or small-set decision — the LLM responds with one word, not a paragraph
  5. Latency is acceptable — a 200ms API call fits your use case
  6. The cost of a wrong decision is low — this matters more than anything else on this list

That last point deserves emphasis. LLMs hallucinate. They’re probabilistic, not deterministic. My Swedish teacher misclassifying a message means I get a wrong grade — mildly annoying, easily recoverable. A medical system misclassifying a symptom, a financial system making a wrong routing decision, an access control system granting entry to the wrong person — those are different stakes entirely.

Use this pattern where a wrong decision is a minor inconvenience, not a safety risk. For high-stakes decisions, LLMs can inform but shouldn’t be the sole decision maker. A human review step, a deterministic fallback, or a confirmation loop should sit between the LLM’s output and the consequential action.

When the stakes are right, constrain the output aggressively. “Respond with exactly one word: X or Y” eliminates parsing complexity and keeps the integration clean. Parse defensively — uppercase the response, check with startsWith rather than exact match.

Keep the interface clean. The application layer defines a method signature (isTranslationAttempt), the infrastructure layer implements it with OpenAI. If you later switch to a local model or a different API, the application code doesn’t change.

The Takeaway

When your heuristic is approximating language understanding, replace it with a tool that actually understands language. A single constrained LLM call can replace dozens of brittle string-matching rules — and it handles edge cases you haven’t thought of yet.