Various experiments on the recent Window Attention with Attention Sinks/StreamingLLM approach indicate that the approach certainly improves inference fluency of pretrained LLMs, while also improving the VRAM usage from linear to constant.
It can be applied to pretrained LLMs with little to no additional effort, and Hugging Face transformers is working on first-party support. Until then, the third-party module in the blogpost already works well.