MIT's StreamingLLM Makes Your Chatbots Talk Better, Longer

MIT scientists improve chatbot performance in extended conversations by optimizing memory usage

February 15, 2024

2 Min Read

Getty Images

At a Glance

MIT researchers came up with a way to make chatbots talk for longer without seeing degraded performance.
The first few inputs of a query are the most important. If these inputs stay in memory, performance stays up.

The longer you converse with a chatbot, the worse its responses typically become. Now, a team of researchers from MIT has developed a solution to enable the likes of ChatGPT or Gemini to chat nonstop without their performances deteriorating.

Dubbed StreamingLLM, the framework makes a change to the underlying model’s Key-value (KV) Cache which acts as a conversation memory.

Chatbots generate responses based on user inputs, storing those in the KV Cache. The system creates an attention map that plots each token and how it relates to others. KV Caches can only hold a finite amount of information and will ditch older information when nearing capacity.

MIT’s researchers propose a Sliding Cache – it removes less essential information while ensuring the cache retains key data points.

The resulting process allows a chatbot to go on conversing with a user without the performance dropping. The StreamingLLM paper states that the solution enabled models like Llama 2 and Falcon to perform stably even when a conversation went well beyond four million tokens in length.

The method even enabled models to return responses more than 22 times faster.

“By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications,” Guangxuan Xiao, the lead author on the StreamingLLM paper told MIT News.

Let the attention sink in

Researchers found that the first few inputs of a query are the most important. If these get shunted out when capacity is reached, that causes models to fail in longer conversations. But if these inputs are kept in, the performance stays up. They call this phenomenon, ‘attention sink.’

The threshold of four initial tokens was enough to prevent a chatbot − using a Sliding Cache – from seeing deteriorating performance as conversations continue. In fact, it led to optimal performance.

The researchers also discovered that adding a placeholder token as a dedicated attention sink during pre-training can further improve deployment.

Song Han, a member of the MIT-IBM Watson AI Lab and a distinguished scientist of Nvidia told MIT News: “We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible — every other token can see it.”

“We found that we must always keep the attention sink in the cache to maintain the model dynamics.”

You can access StreamingLLM via Nvidia's large language model optimization library, TensorRT-LLM.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

MIT's StreamingLLM Makes Your Chatbots Talk Better, Longer

At a Glance

Let the attention sink in

About the Author(s)

Latest News

Trending articles