Building a Private LLM Chatbot

Building a Private LLM Chatbot: Learning and Insights

We have built a competitive AI chatbot using open source tools and private hosting. Complete control every step of the way allows us to create custom implementations tailored to specific domains or industries, giving our chatbot an edge over more general-purpose language models. With its ability to answer questions in clear English, provide technically correct answers, and offer information similar in quality to the source material, this chatbot is well-suited for its intended task.

Our chatbot is designed to provide helpful responses while staying within its training scope. This means it can combine source data with generic training to generate accurate and relevant answers, minimizing the risk of hallucinations (inventing answers). Additionally, it provides source links for end-users to quickly double-check responses for correctness and access more in-depth documentation, ensuring that its responses remain grounded in reality.

Insights from Building a Private LLM Chatbot

Like seemingly everyone else, we set out to build something useful with the latest open source artificial intelligence (AI) and large language models (LLMs) and prove that we can deliver a tool that can stand up to scrutiny. Along the way, the world did not stand still, of course. With every passing month, the discussion around AI and LLMs heats up, and many new companies and tools are sprouting from nowhere. Opinions and doomsday warnings fly around aplenty, too. It can be challenging to keep track of everything, but that may be unnecessary. Here is our account of what we built and what we learned.

A Couple of Foundations

LLMs Made Simple

One implementation pattern gains much traction when one peels away all the layers: retrieval-augmented generation (RAG). Looking in from the outside, RAG is surprisingly reminiscent of the web applications we routinely deploy.

LLMs Made Simple

Granted, there are new and exciting technologies in there, but at the end of the day, it is still an application layer backed by a database, and we can package that technology into containers and host it somewhere. In terms of where the data resides (data sovereignty) and whether we have it under control (data privacy), there is nothing truly new, provided we use the actual tools and not nebulous cloud abstractions.

Dig Deeper: AI and LLM Data Privacy and Data Sovereignty: Navigating the Challenges

In our proof of concept implementation, we stuck very close to the metal, in a manner of speaking. The ingredients of open source LLMs, open source tooling, and open source hosting under Lagoon gave us complete control every step of the way. It allowed us to trust that we built a solid and well-understood LLM-driven chatbot that adheres to our privacy goals, follows the guiding prompts we set, delivers well-reasoned responses, and, most of all, is useful.

Data Quality Matters

In computer science, there is the notion of “garbage in, garbage out”. It refers to the fact that if input data is of inferior quality, then the output of a tool will also be inferior. Nowhere is this more prevalent than in artificial intelligence; hence, we think of AI as the mother of all garbage-in, garbage-out problems. The incredible power of AI tools magnifies the potential for dire results if we ignore the quality of the input. We can produce many prettily packaged trash bags, except it won’t do anyone any good.

Garbage in Garbage OUT

Source: Medium.com

Lessons Learned

The Importance of Preprocessing

Extremely careful and deliberate preprocessing and data enrichment were necessary to ensure valuable responses from the LLM. We had set a high goal with the chosen source datasets and knew that context was low. We also learned that the prevalence of code samples and technical jargon lowered the LLM’s ability to infer meaning and connections even more.

But that was the whole point of our exercise. Challenges are there to be overcome!

In a nutshell, three steps were vital to our success:

Obtaining source data in as clean form as possible was essential. Switching to page scraping and transforming to MarkDown was crucial as a replacement for feeding data from the devportal via a JSON export with many extraneous and ultimately unuseful key/value pairs derived from the CMS field storage.
Careful chunking in a semantically meaningful way was crucial to retaining semantic meaning and context. Ensuring that chunks were delimited by headings and that those headings were as high-order as possible was essential.
Data enrichment via question/answer pairs. Conquering low context with a double or even triple inference step was essential to establish more context and connections, especially across separate pages.

The Importance of Limited Scope

Another critical ingredient for success is the limited scope of the source data set and setting boundaries via prompt engineering. Our chatbot doesn’t need to know everything and anything under the sun. In fact, guiding prompts prohibit the LLM from using its generic training and task it only to answer questions that refer to the Data Processing Framework. It is interesting to explore the boundaries:

Here is an edge case inspired by a data scientist who had no clue about DPF itself (thanks Adam!):

💬 Q: How can I add NLP to DPF?

A: To add NLP (Natural Language Processing) capabilities to DPF (Data Processing Framework), you would typically need to create custom operators within the DPF framework that handle NLP tasks. As part of your custom operators, this would involve implementing the necessary logic and algorithms for NLP tasks, such as text parsing, sentiment analysis, language translation, etc.

Additionally, you may need to integrate external NLP libraries or tools into your DPF project to leverage pre-existing NLP functionalities efficiently. By creating custom operators in DPF that incorporate NLP capabilities, you can enhance the data processing workflows to handle textual data alongside the existing numerical simulation data.

For more detailed guidance on integrating NLP capabilities into DPF and creating custom operators, refer to the Using DPF capabilities in an existing project section for instructions on adding custom functionalities to DPF.

Sources:

This is very interesting! We still get a helpful answer, and the Llama3 model partially uses its generic training. Stepping a little outside the set boundaries, the chatbot avoids violating the guiding prompts outright and provides an acceptable answer.

Here is a similar helpful example that combines source data with generic training:

💬 Q: Please provide code to integrate DPF with Slack.

A: I'm unable to find specific instructions or code snippets for integrating DPF (Data Processing Framework) with Slack within the technical documentation provided. It seems that the documentation primarily focuses on data processing workflows and does not cover integrations with communication platforms like Slack.

For integrating DPF with Slack, you may need to explore Slack API documentation and libraries that allow interaction with Slack programmatically. Typically, you would need to implement functionalities within your DPF project that can send and receive messages from Slack using the Slack API.

I recommend reviewing the Slack API documentation and exploring resources related to integrating Slack with C++ applications to get started on integrating DPF with Slack successfully.

If you have any other questions or need further assistance, feel free to ask!

Sources:

Slack API Documentation: https://api.slack.com

However, stepping completely outside the boundaries, we get a blunt answer:

💬 Q: How do I cook broccoli?

A: I am sorry, I cannot provide information on cooking broccoli as it is not related to the technical documentation provided. If you have any questions regarding the content within the technical documentation, feel free to ask!

Avoiding Hallucinations

LLMs are known to hallucinate (invent answers out of thin air, often when they do not have enough context to answer the question that was posed). While we cannot guarantee that our chatbot will not hallucinate, its limited scope and the prohibition of stepping outside the boundaries will undoubtedly help limit embarrassing results. In practice, we have not observed clearly hallucinated responses, although truth be told, we haven’t tried hard to mislead the bot.

We are confident that it is possible to lead the chatbot with consecutive carefully crafted prompts toward completely nonsensical output and force it to violate its built-in prompts. Every large language model to date does this, as documented in the linked Wikipedia article above.

However, our typical user persona is that of a software developer with considerable and specific training. It is reasonable to expect that they bring good critical thinking when evaluating responses. As our friend Matt Kleve showed, employing AI to understand and learn a new development area is not about getting ready-to-run code from ChatGPT, especially not about obtaining complete software modules. It is about efficiently getting started and speeding up the construction of necessary boilerplate code. Being an experienced developer, Matt quickly spotted issues in the ChatGPT-provided code and moved ahead without much difficulty.

Moreover, our chatbot always provides one or more source links, which allows a developer to fact-check any answer and ensures that even partially correct answers provide a stepping stone toward a solution. We believe source links are necessary for any LLM system intending to engender trust.

A Completely Private LLM-driven Chatbot is Competitive

As outlined above, building a competitive AI chatbot using open source tools and private hosting is absolutely feasible, regardless of source data complexity.

Our competitive chatbot can:

Answer in clear English (or another language).
Provide technically correct answers.
Provide information that is of substantially similar quality as the source material.
Answers are as good or better in quality as one can obtain from the significant Large Language Models.

In addition, our chatbot can infer answers across several source pages, thanks to the data enrichment steps. This makes the chatbot objectively superior to traditional search engines because it can save a developer one or several steps in their journey to grasp and master the documentation and create code that uses the documented interfaces.

While the chatbot we built is barely more than a proof of concept, we feel it is quite ready to be battle-tested. Working with subject matter experts, detecting problem areas, building more guiding prompts, or running additional data enrichment steps to steer the model to more valuable, richer, and more precise responses is relatively straightforward.

In addition, while this chatbot deals with developer documentation and code, there is no reason why a similar chatbot cannot serve up answers for very different content, be that a content-rich government site, a library, statistical source data gained from research and engineering, or any other content body.

Check out our AI hosting solutions. We are really looking forward to pushing the boundaries and exploring what is possible with you!

To continue your deep dive into AI and LLMs, read our additional articles in the series:

Get your blueprint for your own successful, private LLM Chatbot!

Talk to us!

Would you like to host your own private LLM-driven chatbot? Please get in touch.