Modernizing chatbots with LLM preprocessing

Healthcare chatbots traditionally rely on rigid intent-matching systems, limiting their accuracy and negatively impacting user experiences. Large Language Models (LLMs) offer more natural, intuitive conversations, yet their generative nature introduces risks like hallucinations or inaccurate medical responses — unacceptable in settings where accuracy, safety, and trust are critical, as in healthcare.

A promising middle ground is LLM preprocessing, where LLMs clarify user questions and improve intent matching while ensuring controlled, predefined responses. This approach proved effective in our recent partnership with Planned Parenthood Federation of America (PPFA), a leading nonprofit providing sexual and reproductive healthcare nationwide, significantly improving their chatbot accuracy without sacrificing control.

The Problem with Traditional Chatbots

The fundamental issue with traditional chatbots lies in their rigid design. They excel when users ask questions exactly as anticipated, but healthcare conversations rarely follow scripts. Consider the many ways someone might ask about contraception—from clinical language to slang or hesitant euphemisms. Traditional intent matching systems fail at this variability, often defaulting to generic responses or complete misunderstanding, especially when faced with typos or misspellings. Additionally, the lack of conversational memory makes follow-up questions meaningless. For example, “How long does it last?” is ambiguous if the chatbot can’t connect it to previous message.

The Risks of Fully Generative Chatbots

While LLMs offer conversational fluency, fully generative chatbots pose serious risks—especially in regulated or high-stakes domains like healthcare. By design, these models generate responses on the fly, which means you can’t fully predict or constrain what they will say. This opens the door to hallucinations: confident-sounding answers that are factually wrong. Even more concerning are responses that may be misleading, insensitive, or harmful, particularly when handling sensitive topics like pregnancy, sexual assault, or mental health. In these cases, lack of output control isn’t just a technical flaw—it’s a safety issue.

What is LLM-Based Preprocessing?

LLM preprocessing offers a strategic middle ground: it harnesses the advanced language understanding of LLMs while maintaining control over chatbot outputs. Here, the LLM serves as an advanced input processor and not a response generator.

The model takes raw user questions and refines them into clear, structured queries by correcting typos, extracting medical concepts, and translating informal language into standardized terminology. For example, a hesitant question like “my monthly thing has been weird lately” becomes precise, searchable terms such as “irregular menstrual cycle symptoms”.

This preprocessed input then flows into one of two controlled pathways, depending on the requirement:

RAG Path: The refined input is used to perform a search across a controlled knowledge base of predefined answers. The system retrieves relevant, pre-approved responses and uses reranking technique to select the most appropriate one.
Hybrid Path: Clarified intent or keywords get passed to the underlying chatbot or NLU engine, which now receives clean, standardized input. This path is particularly valuable when the underlying system can handle human-in-the-loop escalation or provides detailed analytics for continuous improvement.

The LLM never generates the final response. It only prepares the input for the underlying system to handle. The result is dramatically improved intent matching and answer accuracy while maintaining control over responses.

Use Cases Where This Is Ideal

This approach is valuable in domains where accuracy, safety, and compliance are non-negotiable. This includes healthcare, legal, and financial chatbots—where hallucinated or off-script answers can have serious consequences. It’s also well-suited for systems with strict regulatory or brand guidelines, where responses must remain traceable, controlled, and auditable.

Implementation Considerations

Modernizing a chatbot with LLM preprocessing requires thoughtful preparation to ensure measurable outcomes, clear success criteria, and continuous improvement. It starts with defining what a “good answer” looks like and putting the right systems in place to measure impact.

Define What a “Good Answer” Looks Like

What qualifies as a correct response will vary by domain, but clarity around this definition is essential for both evaluation and improvement. A strong answer is relevant to the user’s question, specific when precision is needed, and always safe, avoiding misinformation or liability. In sensitive domains like healthcare, it's also critical to define clear escalation logic: when should the bot hand off to a human agent? Aligning on these dimensions (Relevancy, Specificity, Safety, Escalation criteria) with domain experts early ensures downstream metrics and that the evaluation reflects real-world expectations.

Establish an Evaluation Pipeline

Once “good answer” criteria are defined, the next step is building a pipeline to evaluate the real-world impact of LLM preprocessing. Measuring this impact isn’t optional, it is necessary to validate that the system is working as intended and identify where to improve.

One practical approach is to create an evaluation dataset using historical user questions. LLMs can assist in associating expected responses to these questions, which can then be reviewed and validated by SMEs (Subject Matter Expert). This dataset becomes the foundation for comparing chatbot behavior before and after preprocessing is introduced.

To ensure the pipeline reflects real usage, include diverse inputs: typos, ambiguous phrasing, and edge cases. Then, establishes key metrics like Exact Match, Semantic Correctness, False Redirection, and Missed Redirection. These metrics reveal not just what’s working, but where the system needs refinement. A structured, repeatable evaluation process enables confident iteration and helps build stakeholder trust by demonstrating measurable, sustained improvements.

Some important considerations

While LLM preprocessing offers clear benefits, it comes with trade-offs. Running an additional LLM layer introduces latency, which may affect real-time performance if not properly optimized. However, this can often be mitigated through caching of frequent queries, or using smaller, faster models for preprocessing. There’s also the cost of inference, particularly with large models in high-traffic environments. Still, in regulated or high-risk domains, the added overhead is often justified by reduced manual escalations, improved user experience, and stronger alignment with compliance standards.

Summary

LLM preprocessing offers a low-risk, high-impact way to modernize traditional chatbots—improving intent recognition and response reliability while maintaining full control over outputs.

In our work with Planned Parenthood Federation of America (PPFA), this approach led to a 2.5× improvement in response accuracy and a 2.3× reduction in false redirections to human agents over the evaluation dataset, directly improving user experience and reducing support burden.

To fully realize the benefits of this architecture, organizations should leverage domain expertise and prioritize careful data preparation to define meaningful performance benchmarks. This foundation results in a more trustworthy evaluation and accelerates meaningful impact at scale.

Interested in learning how Toboggan Labs can help your organization navigate similar AI implementation challenges? Contact us to discuss your specific use case.