MIT's StreamingLLM Makes Your Chatbots Talk Better, Longer

MIT scientists improve chatbot performance in extended conversations by optimizing memory usage

Ben Wodecki, Jr. Editor

February 15, 2024

2 Min Read
Image of a computer with chat bubbles
Getty Images

At a Glance

  • MIT researchers came up with a way to make chatbots talk for longer without seeing degraded performance.
  • The first few inputs of a query are the most important. If these inputs stay in memory, performance stays up.

The longer you converse with a chatbot, the worse its responses typically become. Now, a team of researchers from MIT has developed a solution to enable the likes of ChatGPT or Gemini to chat nonstop without their performances deteriorating.

Dubbed StreamingLLM, the framework makes a change to the underlying model’s Key-value (KV) Cache which acts as a conversation memory.

Chatbots generate responses based on user inputs, storing those in the KV Cache. The system creates an attention map that plots each token and how it relates to others. KV Caches can only hold a finite amount of information and will ditch older information when nearing capacity.

MIT’s researchers propose a Sliding Cache – it removes less essential information while ensuring the cache retains key data points.

The resulting process allows a chatbot to go on conversing with a user without the performance dropping. The StreamingLLM paper states that the solution enabled models like Llama 2 and Falcon to perform stably even when a conversation went well beyond four million tokens in length.

The method even enabled models to return responses more than 22 times faster.

“By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications,” Guangxuan Xiao, the lead author on the StreamingLLM paper told MIT News.

Related:Nvidia's Customizable Chatbot You Can Run on Your PC

Let the attention sink in

Researchers found that the first few inputs of a query are the most important. If these get shunted out when capacity is reached, that causes models to fail in longer conversations. But if these inputs are kept in, the performance stays up. They call this phenomenon, ‘attention sink.’

The threshold of four initial tokens was enough to prevent a chatbot − using a Sliding Cache – from seeing deteriorating performance as conversations continue. In fact, it led to optimal performance.

The researchers also discovered that adding a placeholder token as a dedicated attention sink during pre-training can further improve deployment.

Song Han, a member of the MIT-IBM Watson AI Lab and a distinguished scientist of Nvidia told MIT News: “We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible — every other token can see it.”

“We found that we must always keep the attention sink in the cache to maintain the model dynamics.”

You can access StreamingLLM via Nvidia's large language model optimization library, TensorRT-LLM.

Related:Case Study: How Alibaba Uses AI Chatbots to Serve a Billion Customers

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like