This article is part of VentureBeat’s special issue, “The Real Cost of AI: Performance, Efficiency and ROI at Scale.” Read more from this special issue.
Model providers continue to roll out increasingly sophisticated large language models (LLMs) with longer context windows and enhanced reasoning capabilities.
This allows models to process and “think” more, but it also increases compute: The more a model takes in and puts out, the more energy it expends and the higher the costs.
Couple this with all the tinkering involved with prompting — it can take a few tries to get to the intended result, and sometimes the question at hand simply doesn’t need a model that can think like a PhD — and compute spend can get out of control.
This is giving rise to prompt ops, a whole new discipline in the dawning age of AI.
“Prompt engineering is kind of like writing, the actual creating, whereas prompt ops is like publishing, where you’re evolving the content,” Crawford Del Prete, IDC president, told VentureBeat. “The content is alive, the content is changing, and you want to make sure you’re refining that over time.”
The challenge of compute use and cost
Compute use and cost are two “related but separate concepts” in the context of LLMs, explained David Emerson, applied scientist at the Vector Institute. Generally, the price users pay scales based on both the number of input tokens (what the user prompts) and the number of output tokens (what the model delivers). However, they are not changed for behind-the-scenes actions like meta-prompts, steering instructions or retrieval-augmented generation (RAG).
While longer context allows models to process much more text at once, it directly translates to significantly more FLOPS (a measurement of compute power), he explained. Some aspects of transformer models even scale quadratically with input length if not well managed. Unnecessarily long responses can also slow down processing time and require additional compute and cost to build and maintain algorithms to post-process responses into the answer users were hoping for.
Typically, longer context environments incentivize providers to deliberately deliver verbose responses, said Emerson. For example, many heavier reasoning models (o3 or o1 from OpenAI, for example) will often provide long responses to even simple questions, incurring heavy computing costs.
Here’s an example:
Input: Answer the following math problem. If I have 2 apples and I buy 4 more at the store after eating 1, how many apples do I have?
Output: If I eat 1, I only have 1 left. I would have 5 apples if I buy 4 more.
The model not only generated more tokens than it needed to, it buried its answer. An engineer may then have to design a programmatic way to extract the final answer or ask follow-up questions like ‘What is your final answer?’ that incur even more API costs.
Alternatively, the prompt could be redesigned to guide the model to produce an immediate answer. For instance:
Input: Answer the following math problem. If I have 2 apples and I buy 4 more at the store after eating 1, how many apples do I have? Start your response with “The answer is”…
Or:
Input: Answer the following math problem. If I have 2 apples and I buy 4 more at the store after eating 1, how many apples do I have? Wrap your final answer in bold tags .
“The way the question is asked can reduce the effort or cost in getting to the desired answer,” said Emerson. He also pointed out that techniques like few-shot prompting (providing a few examples of what the user is looking for) can help produce quicker outputs.
One danger is not knowing when to use sophisticated techniques like chain-of-thought (CoT) prompting (generating answers in steps) or self-refinement, which directly encourage models to produce many tokens or go through several iterations when generating responses, Emerson pointed out.
Not every query requires a model to analyze and re-analyze before providing an answer, he emphasized; they could be perfectly capable of answering correctly when instructed to respond directly. Additionally, incorrect prompting API configurations (such as OpenAI o3, which requires a high reasoning effort) will incur higher costs when a lower-effort, cheaper request would suffice.
“With longer contexts, users can also be tempted to use an ‘everything but the kitchen sink’ approach, where you dump as much text as possible into a model context in the hope that doing so will help the model perform a task more accurately,” said Emerson. “While more context can help models perform tasks, it isn’t always the best or most efficient approach.”
Evolution to prompt ops
It’s no big secret that AI-optimized infrastructure can be hard to come by these days; IDC’s Del Prete pointed out that enterprises must be able to minimize the amount of GPU idle time and fill more queries into idle cycles between GPU requests.
“How do I squeeze more out of these very, very precious commodities?,” he noted. “Because I’ve got to get my system utilization up, because I just don’t have the benefit of simply throwing more capacity at the problem.”
Prompt ops can go a long way towards addressing this challenge, as it ultimately manages the lifecycle of the prompt. While prompt engineering is about the quality of the prompt, prompt ops is where you repeat, Del Prete explained.
“It’s more orchestration,” he said. “I think of it as the curation of questions and the curation of how you interact with AI to make sure you’re getting the most out of it.”
Models can tend to get “fatigued,” cycling in loops where quality of outputs degrades, he said. Prompt ops help manage, measure, monitor and tune prompts. “I think when we look back three or four years from now, it’s going to be a whole discipline. It’ll be a skill.”
While it’s still very much an emerging field, early providers include QueryPal, Promptable, Rebuff and TrueLens. As prompt ops evolve, these platforms will continue to iterate, improve and provide real-time feedback to give users more capacity to tune prompts over time, Dep Prete noted.
Eventually, he predicted, agents will be able to tune, write and structure prompts on their own. “The level of automation will increase, the level of human interaction will decrease, you’ll be able to have agents operating more autonomously in the prompts that they’re creating.”
Common prompting mistakes
Until prompt ops is fully realized, there is ultimately no perfect prompt. Some of the biggest mistakes people make, according to Emerson:
- Not being specific enough about the problem to be solved. This includes how the user wants the model to provide its answer, what should be considered when responding, constraints to take into account and other factors. “In many settings, models need a good amount of context to provide a response that meets users expectations,” said Emerson.
- Not taking into account the ways a problem can be simplified to narrow the scope of the response. Should the answer be within a certain range (0 to 100)? Should the answer be phrased as a multiple choice problem rather than something open-ended? Can the user provide good examples to contextualize the query? Can the problem be broken into steps for separate and simpler queries?
- Not taking advantage of structure. LLMs are very good at pattern recognition, and many can understand code. While using bullet points, itemized lists or bold indicators (****) may seem “a bit cluttered” to human eyes, Emerson noted, these callouts can be beneficial for an LLM. Asking for structured outputs (such as JSON or Markdown) can also help when users are looking to process responses automatically.
There are many other factors to consider in maintaining a production pipeline, based on engineering best practices, Emerson noted. These include:
- Making sure that the throughput of the pipeline remains consistent;
- Monitoring the performance of the prompts over time (potentially against a validation set);
- Setting up tests and early warning detection to identify pipeline issues.
Users can also take advantage of tools designed to support the prompting process. For instance, the open-source DSPy can automatically configure and optimize prompts for downstream tasks based on a few labeled examples. While this may be a fairly sophisticated example, there are many other offerings (including some built into tools like ChatGPT, Google and others) that can assist in prompt design.
And ultimately, Emerson said, “I think one of the simplest things users can do is to try to stay up-to-date on effective prompting approaches, model developments and new ways to configure and interact with models.”
The post The rise of prompt ops: Tackling hidden AI costs from bad inputs and context bloat appeared first on Venture Beat.