No comments yet.
I've noticed this brings new challenges to deployment that not many people are talking about: the response time for streaming can sometimes last several minutes (especially when using reasoning models), which is quite different from the traditional API requests that complete in just a few seconds. At the same time, we don't want ongoing requests to be interrupted when deploying a new version.
How did you guys do it?