ReplAI: What I Learned Building a Real-Time Conversation Coach

Here's the thing about hackathon demos: the moment when the judges look confused is either the beginning of the end or the beginning of winning. At Hack The Bay 2026, we got that look about forty seconds into our ReplAI demo. One of the judges — who turned out to be the founder of Zenardy, the company behind the challenge we'd entered — was watching the app analyze his body language in real time while he practiced a pitch to our AI. He looked skeptical, maybe a little thrown off by seeing his own eye-contact score updating live on screen.

I kept going. Explained what it was doing, why the gaze tracker mattered, what the feedback loop was actually catching. Fifteen minutes later we were still talking. We won the Zenardy Challenge, Best Use of Gemini, and placed 3rd overall out of 34 projects.

I'm not writing this to recap a win. I'm writing it because ReplAI involved a bunch of non-obvious decisions, and most of them aren't apparent from a demo or a repo.

The problem

ReplAI is a conversation coach. You pick a scenario — salary negotiation, a tough HR conversation, a legal client intake — and an AI plays the other person while the app watches you. Not just what you say, but how you say it: are you making eye contact, how's your posture, are you speaking at a reasonable pace. After each exchange you get coaching feedback.

The core insight is that most conversation coaching tools give feedback after the fact. You record yourself, you watch it back, you cringe, you close the tab. We wanted feedback during the conversation, at a pace where it's actually usable. That constraint — real-time, low-latency, multi-modal — drove almost every technical decision we made.

Why each tool exists

MediaPipe for body language

I needed something that could track facial landmarks in a browser, in real time, without asking the user to install anything. MediaPipe Face Mesh gives you 468 facial landmarks at 30fps in JavaScript, including the ones you need to estimate gaze direction. It's not perfect — it struggles in low light and with certain camera angles — but it's fast, it's free, and it runs in-browser without a server round-trip. That last part mattered a lot for a 24-hour hackathon where we couldn't count on a backend staying up.

LiveKit for WebRTC

I needed low-latency audio and video in the browser. Building that from scratch means navigating ICE negotiation, codec selection, STUN/TURN servers — all of which would have eaten half my hackathon time. LiveKit wraps WebRTC in an SDK that handles the plumbing and exposes audio analysis (speech pace, filler words, energy levels) as a first-class feature. That prosody data feeds directly into the feedback loop without any additional processing on my end.

Gemini 2.5 Flash for the AI side

I needed something fast enough to close the conversational loop. GPT-4 is capable but the latency at our tier was high enough to break the illusion of a live conversation. Gemini 2.5 Flash turned out to be fast enough that we could keep speech-to-feedback under 300ms on a decent connection — which is the threshold below which the feedback feels responsive rather than delayed.

These three systems tie together through a shared event emitter. MediaPipe fires gaze and posture events; LiveKit fires prosody events; both attach to a context object that goes into each Gemini prompt. The LLM channel has backpressure on it — if a feedback request is in flight, new events queue rather than firing another request. Without that guard you get a feedback storm where the AI is commenting on every half-second of movement, which is overwhelming and useless.

The conflict detection piece

The most technically interesting part of ReplAI is something we almost didn't build: a real-time conflict-of-interest detector for the legal scenario.

The setup: a lawyer is practicing a client intake conversation. Midway through, the client mentions a company. ReplAI should flag if that company is an adverse party in any of the firm's active cases — before the lawyer accidentally says something they shouldn't.

The naive approach — feed the transcript to an LLM and ask — was too slow and too expensive to run on every new sentence. So I built a lightweight pipeline that does two things:

Runs named entity recognition on the streaming transcript to extract company and person names as they come in.
Matches those entities against a pre-loaded firm database using a hybrid Levenshtein + Jaccard scorer.

Levenshtein catches typos and minor name variations ("Acme Corp" vs "ACME Corporation"). Jaccard handles partial overlaps ("Google" matching "Google DeepMind"). Together they surface likely conflicts faster and more cheaply than asking a language model to figure it out from scratch each time. The LLM only gets involved to confirm and explain a flagged match, not to do the matching itself.

This was the piece that stopped the Zenardy founder mid-demo. He asked me to walk him through it. Then he asked how it would handle international company names with transliterations. I had actually thought about that — it was a good fifteen minutes.

What I'd change

The MediaPipe gaze estimation is shakier than I'd like. It works well under good lighting with a decent camera and falls apart on older laptops in a dim room. A real product would need a calibration step upfront — establish a baseline for this specific user's face before scoring their eye contact.

The multi-modal weights (gaze, prosody, dialogue) are currently static. There's a version of this where the feedback model learns which signals matter most for which scenario type — negotiation might weight vocal confidence more than eye contact; legal intake might be the opposite. We didn't have time for that in 24 hours, and I'm not sure I'd trust a self-trained weighting model on a prototype anyway.

The conflict-detection corpus is also mocked for the demo. In production you'd want a real case management API integration, which is a completely different engineering problem.