I get that it's allows ensuring you're testing the model capabilities vs prompts, but most models are being post-trained with very different formats of prompting.
I use Seedream in production so I was a little suspicious of the gap: I passed Bytedance's official prompting guide, OPs prompt, and your feedback to Claude Opus 4.5 and got this prompt to create a new image:
> A partially eaten chicken burrito with a bite taken out, revealing the fillings inside: shredded cheese, sour cream, guacamole, shredded lettuce, salsa, and pinto beans all visible in the cross-section of the burrito. Flour tortilla with grill marks. Taken with a cheap Android phone camera under harsh cafeteria lighting. Compostable paper plate, plastic fork, messy table. Casual unedited snapshot, slightly overexposed, flat colors.
Then I generated with n=4 and the 'standard' prompt expansion setting for Seedream 4.0 Text To Image:
They're still not perfect (it's not adhering to the fillings being inside for example) but it's massively better than OP's result
Shows that a) random chance plays a big part, so you want more than 1 sample and b) you don't have to "cheat" by spending massive amounts of time hand-iterating on a single prompt either to get a better result
Including a "total rolls" is a very valuable metric since it helps indicate how steerable the model is.
I think it is the fine tuning, because you can find AI photos that look more like real ones. I guess people prefer obviously fake looking 'picturesque' photos to more realistic ones? Maybe it's just because the money is in selling to people generating marketing materials? NB is clearly the only model here which permits a half eaten burrito to actually appear to have been bitten.
It looks like they took the page down now though...
It's just not as plasticy and oversaturated as the others.
The table grain is the only thing that gives it away - if it weren't for that no one without advance warning is going to notice that it's not real.
The “partially eaten” part of the prompt is interesting…everyone knows what a half-eaten burrito looks like but clearly the computers struggle.
For some reason ever since DALL-E 2, all food models seem to generate obviously fake food and/or misinterpret the fun constraints...until Nano Banana. Now I can generate fractal Sierpiński triangle peanut butter and jelly sandwiches.
I can kind of see what you mean in that it went for realism in the aesthetics, but not the object... but that last one would probably fool me if I was scrolling
"peanut butter, jelly and bread rubik's cube. each smaller cube in the rubik's cube is one ingredient, randomly selected. professional food photography style. ensure it looks like a working rubik's cube"
Even ignoring the Heinz bean outliers, these are all decidedly Scottsdale. With one exception. All hail Nano Banana.
Do people get burritos with beans in them more or less as pictured? Aesthetically, it seems like it'd look pretty appealing if you were someone who loved beans compared to what I had in mind, but again I'm really in no position to judge these images based on bean appearance.
1. The text encoders are primitive (e.g. CLIP) and have difficulty with nuance, such as "partially eaten", and model training can only partially overcome it. It's the same issue with the now-obsolete "half-filled" wine glass test.
2. Most models are diffusion-based, which means it denoises the entire image simultaneously. If it fails to account for the nuance in the first few passes, it can't go back and fix it.
I believe some image generation AIs were RLHFed like chat bot LLMs, but moreso to improve aesthetics rather than prompt adherence.