Skip to content

The key things to keep in mind when building GenAI applications

You've clicked the link, so I presume you want to explore the dos and don'ts of making GenAI apps safe and useful. After all, this is a new technology, and if a single thing goes wrong, it can undermine customers' trust in a product.

I first used a Transformers generative model in 2019 when Google released T5. It was pretty impressive compared to LSTMs, but in the last 5 years, this class of models moved far ahead in terms of capabilities and easiness of use. In 2019, our primary concerns were mostly centered on fine-tuning and ensuring we don't destroy the foundation while tailoring the system to the use case. Now, people often use generalist models, with RAG becoming a de-facto standard for contextualizing models. Coupled with emergent behaviors (i.e., when a model does something you don't know it can), this creates a very different landscape of potential risks that people need to consider when building modern GenAI applications.

1) Only trust the model to produce customer-facing output in sensitive domains if supervised.

Please do supervise models for any applications more advanced than a campus navigation chatbot. I've seen a model mistakenly congratulate a recipient of the message on the loss of their pet. This happened because the model was prompted to retrieve the person's latest message to personalize a response AND to keep the writing always positive.

Supervision doesn't necessarily have to be manual. You can use a text similarity guardrail to intercept output if it touches upon things like politics or other sensitive topics.

Guide: Sensitive topics guard railing

2) Don't rely on LLM knowledge.

Models are a lot more likely to hallucinate if you ask them for the data stored in their memory. Ideally, a model should be prompted to only use the data in the prompt. RAG or Agents, in one way or another, are almost always strictly preferable to pure fine-tuning or using a generalist model without a little external help. Also, fine-tuned models are more likley to hallucinate.

Guide: Building a RAG app

3) Don't trust the LLM to produce structured output without validation.

Oh, if only I was paid a cent every time a model somewhere produces a good-looking link leading nowhere and confidently served to a user.

The model will likely mess up any machine-readable standard (JSON, URI link, CSV). Ensure you validate the links with a guardrail and JSONs or CSVs with a schema if the downstream consumer is an application (unless you're serving JSON to your users, but if so, there are more significant problems to look into than consistency).

Guide: Agents with a structured output

4) Keep costs in mind.

GPT-4o is all fun until you see a 1,000$ daily bill as your product scales. We once deployed a product that slightly utilized Claude. It had to be killed almost immediately as it racked up a 98$ charge in one hour with just a handful of active customers. With uncontrolled output sizes, the price for using the latest-gen models can be unpredictable. A simple solution is to use prompting like "Keep the output to 200 tokens or less". This enables more or less confident modeling of the costs in various growth scenarios.

Nevertheless, planning your options in advance is always a good strategy. Does the product really need a 340B model? Surprisingly, more often than not, much smaller models with some clever engineering do the trick. If Llama-70B is not an option, how much cheaper would Claude or GPT need to be for the unit economics of your product to work?

Guide: Do scenario modeling and reach out to consultants who build and tune custom models to estimate the threshold when it would make sense for you to migrate to your own model (it might be far ahead, but having a metric in mind is very useful).

Have fun and build safe systems!