At a high level there are a few differences
- Control over embeddings. What gets embedded? What are the output vectors? What models do you use? How do you handle multimodal input?
- Performance. When you make a call to Assistants, you have to wait for the Assistant to understand that it needs to do RAG. This performance hit is actually quite large (look at the two videos on the blog for reference)
- Cost. OpenAI has an incentive to load the context window to consume more tokens. A few dozen calls to Assistants was costing me around $10.