My understanding is that Mistral uses a regular 4K RoPE that is "extends" the window size with SWA. This is based on looking at the results of Nous Research's Yarn-Mistral extension:
https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k and Self-Extend, both of which only apply to RoPE models.
There are quite a few recent attention extension techniques recently published:
* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462
* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325
* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669
* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071
You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.
There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...