Essay·No. 04·MMXXVI

AI customer service: WhatsApp + chat in EN/ES.

The architecture behind a Reply playbook that switches languages mid-thread without losing the thread — and the four mistakes we made before we got it right.

Bahele Studio· Reply playbook· ~7 minute read

The first version of our bilingual Reply playbook had a translation layer. The customer wrote in Spanish, the system translated to English, the model answered in English, the system translated back. It worked, in the sense that the right words came out the other end. It also produced a customer-service voice that nobody on the team had ever used.

The replies were grammatically perfect Spanish that no actual Miami salesperson would write. The cadence was off. The contractions were missing. The slang was airbrushed flat. Customers replied less. The thread length dropped 30% in the second week. We pulled the playbook back.

Bilingual by default, not by translation.

The fix was conceptual, not technical. We stopped thinking of Spanish as a translation of English and started thinking of each language as a peer voice with its own training set. The same dealer, same product, same edge cases — but two distinct sets of phrasing rules, contraction patterns, and tone defaults.

Practically, that means we collect twice as much training data per client — three months of EN threads and three months of ES threads — and we tune two voices in parallel. They share an underlying knowledge base (inventory, prices, hours) but they are voiced separately. The system never translates. It selects.

Detecting language without asking.

We do not put a language toggle in front of the customer. The first message decides. If the inbound is Spanish, the entire thread is in Spanish until the customer codeswitches. If they switch — and Miami customers codeswitch constantly — we follow them, sentence by sentence, but we never volunteer English in a Spanish thread or vice versa.

The detector is a small classifier on the first sentence: confidence threshold of 0.85, fallback to a second-sentence read if uncertain. In a year of running this in production, it has misclassified less than 0.4% of inbounds. The misclassifications are almost always Spanglish openings ("Hi, está el carro disponible?"), which we now treat as Spanish-primary.

The four mistakes we made first.

i.Translating instead of selecting. Covered above. The voice gets washed out.
ii.Replying too fast. Two-second turnarounds feel robotic. Even when the answer is ready instantly, we hold it for forty to ninety seconds. Customers trust the slower reply more.
iii.One escalation rule. "Escalate angry messages" is too coarse. We now escalate on three signals separately: sentiment, novelty (a question we have not seen), and stakes (price, refund, complaint).
iv.Skipping native review. The first ES launch was reviewed by a fluent — but not native — speaker. Native review caught five idioms that read as off in the second pass. We now make it a launch gate.

What we are still working on.

Voice notes. WhatsApp users in our verticals send voice messages constantly — a thirty-second ramble with the make, model, year, and budget all in one breath. Speech-to-text is fine in EN, less fine in regional Spanish, and we are not yet comfortable shipping an automated reply to a voice message. For now, voice notes route to a human with a transcript and a suggested reply. It works, but it is the rough edge in every Reply playbook we run.

The pattern under all of it: bilingual is not a feature, it is the assumption. Once you internalize that, the architecture stops fighting you. The customer-service voice you actually wanted — the one your best salesperson uses at 11 p.m. on a Tuesday — becomes recoverable.

← Back to the journal