September 26, 2023

Token Usage with OpenAI Streams and Next.jsTokenization and metering with OpenAI streaming API and Vercel AI

If you've used ChatGPT in your browser, you've likely noticed how the response is continuously streamed back after your prompt, with response lines appearing in the conversation one after the other. This is possible thanks to OpenAI's stream API, which returns responses to the client as soon as they are available. This functionality is terrific for building modern, responsive applications. However, this comes with a big downside: OpenAI's data-only stream API doesn't return the token usage metadata, and you have to tokenize messages on your own to track the customer's token consumption.

At OpenMeter, we are devoted to help engineers with usage metering, and often encounter inquiries from our users on how to track token usage with OpenAI's streaming API. We wrote this article to guide you through the process.

#The Case for Streaming

Large Language Models (LLMs) are incredibly powerful, yet they can be slow in generating long outputs compared to the latency you are used to with modern web applications. In attempts to build a traditional blocking UI, your users might find themselves staring at loading spinners for tens of seconds, waiting for the entire LLM response to be generated. This can result in a poor user experience, especially in conversational applications like chatbots. Streaming UIs can help alleviate this issue by displaying parts of the response as they become available.

To enable modern user experiences, OpenAI has built a streaming API that returns responses as data-only server-sent events as soon as they are available. You can leverage this streaming API as follows:

const stream = await openai.chat.completions.create({
  model: 'gpt-3.5-turbo',
  messages: [{ role: 'user', content: 'Say Hello World!' }],
  stream: true,
});
for await (const chunk of stream) {
  console.log(chunk.choices[0]?.delta?.content || '');
}

As this is a data-only response, it doesn’t contain token usage metadata, which is, by default, included in OpenAI’s blocking API call response. To track and attribute token usage to your customers, you must tokenize messages and count usage yourself. The next section will discuss implementing tokenization and usage tracking in your application.

#Tokenization and metering usage

Most models, like GPT, process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens. You can test out tokenization on your own with OpenAI’s interactive tokenizer. For example, the following text comprises 11 tokens and 50 characters:

OpenMeter simplifies usage metering for engineers.

Given that models think in tokens, it's no surprise that most AI vendors, like OpenAI or Antropic, charge based on the number of tokens consumed. This is calculated by tokenizing the model’s response and summing all the tokens. In cases where tokenization is required, OpenAI released a Python package named tiktoken, a fast BPE tokenizer for OpenAI's models, to assist their users. Since it's implemented in Python, the JavaScript community created unofficial ports for the Python package to use in Node.js, such as js-tiktoken and WASM bindings.

#Completion output token usage

We discussed how streaming APIs enable responsive modern UIs essential for conversational AI, while OpenAI’s data-only stream API doesn’t return token usage. Let's explore how we can tokenize response messages using js-tiktoken, combine it with the streaming API, and count token usage. This is useful, for example, if you want to track your users' token usage for billing and analytics use cases.

const model = 'gpt-3.5-turbo';
const enc = tiktoken.encodingForModel(model); // js-tiktoken
let completionTokens = 0;

const stream = await openai.chat.completions.create({
  model,
  messages: [{ role: 'user', content: 'Say Hello World!' }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  const tokenList = enc.encode(content);
  completionTokens += tokenList.length;

  console.log(content);
}

console.log(`Completion token usage: ${completionTokens}`);

#Input prompt token usage

With OpenAI, chat completion requests are billed based on the number of tokens in the prompt input and the number of tokens in the completion output returned by the API. Input and output tokens are priced differently so that you can track both, depending on your use case.

To calculate the number of tokens in the input, you can use the following code snippet:

const messages: Array<ChatCompletionMessageParam> = [
  { role: 'user', content: 'Say Hello World!' },
];
const promptTokens = messages.reduce(
  (total, msg) => total + enc.encode(msg.content ?? ').length,
  0,
);

To measure token usage across multiple prompts over time, you can report customer token usage to a centralized metering service. Check out our GitHub for the full example of usage metering the OpenAI streaming API.

#Next.js and Vercel’s AI Package

Vercel, the company behind the popular Next.js React framework, has created a library that simplifies the development of AI-empowered streaming user interfaces, integrating effortlessly with OpenAI and other providers. One of the great features of Next.js applications is their capability to operate on the edge, which enables engineers to build low-latency web applications. These Next.js applications run on the Edge Runtime, which requires a tokenization library compatible with this environment; luckily, the js-tiktoken npm package is one of them.

In the following example, we create a Next.js function that takes prompts from the requests, delegates them to OpenAI, and seamlessly streams back the responses to the user interface, ensuring a responsive user experience. To accurately read customers' token consumption, we tokenize messages as they arrive and track their usage.

export async function POST(req: Request) {
  const { messages } = await req.json();
  const model = 'gpt-3.5-turbo'

  const response = await openai.chat.completions.create({
    model,
    messages: messages,
    stream: true,
  });

  const enc = tiktoken.encodingForModel(model);  // js-tiktoken
  let completionTokens = 0;

  const streamCallbacks: OpenAIStreamCallbacks = {
    onToken: (content) => {
      // We call encode for every message as some experienced
      // regression when tiktoken called with the full completion
      const tokenList = enc.encode(content);
      completionTokens += tokenList.length;
    },
    onFinal() {
      console.log(`Token count: ${completionTokens}`);
    },
  }

  const stream = OpenAIStream(response, streamCallbacks);
  return new StreamingTextResponse(stream);
}

#Overview

Modern applications thrive on being responsive and fluid. With Large Language Models (LLMs), generating extensive outputs can take longer, necessitating building streaming UIs for the best user experience. OpenAI's data-only streaming APIs enable engineers to process and show messages from the AI model as soon as it becomes available. However, OpenAI’s streaming API response is data-only and doesn’t return the token usage metadata required for tracking customer consumption for billing and analytics. To work around this, we need to tokenize messages and track token usage to fill the gap and enable accurate usage metering with OpenAI’s stream APIs.

Looking to meter customer token usage? Get started with OpenMeter Cloud today!

Counting LLM or token usage?

Get started with OpenMeter Cloud today!

Join OpenMeter Cloud

Peter Marton@slashdotpeter