Is there a way to mitigate performance as context grows?

Apr 26, 2026 · 5:08 PM UTC · 0 reactions · 0 comments · 8 views

via

LocalLlama

In my local LLM setup I get from 30 to 80 t/s generation at the beginning, but it drops quite a lot as context grows. I use llama.cpp/Vulkan with an MI50 and a V100, is there some command line flags that can improve this issue? Or some good practice other than restart the chat after some time?

Original article

LocalLlama

Read full at LocalLlama →

Anonymous · no account needed

Discussion

0 comments

Is there a way to mitigate performance as context grows?

Discussion

More from LocalLlama