https://static.simonwillison.net/static/2024/pelicans-on-bic...
Of the four two were a pelican riding a bicycle. One was a pelican just running along the road, one was a pelican perched on a stationary bicycle, and one had the pelican wearing a weird sort of pelican bicycle helmet.
All four were better than what I got from Sora: https://simonwillison.net/2024/Dec/9/sora/
My company (Nim) is hosting Hunyuan model, so here's a quick test (first attempt) at "pelican riding a bycicle" via Hunyuan on Nim: https://nim.video/explore/OGs4EM3MIpW8
I think it's as good, if not better than Sora / Veo
What does it produce for “A pelican riding a bicycle along a coastal path overlooking a harbor”?
Or, what do Sora and Veo produce for your verbose prompt?
The Pelican is doing some weird flying motion, motion blur is hiding a lack of detail, cycle is moving fast so background is blurred etc. I would even say SORA is better because I like the slow-motion and detail but it did do something very non physical.
Veo is clearly the best in this example. It has high detail but also feels the most physically grounded among the examples.
If you'd like to replicate, the sign-up process was very easy and I was easily able to run a single generation attempt. Maybe later when I want to generate video I'll use prompt enhancement. Without it, the video appears to have lost a notion of direction. Most image-generation models I'm aware of do prompt-enhancement. I've seen it on Grok+Flow/Aurora and ChatGPT+DallE.
Prompt
A pelican riding a bicycle along a coastal path overlooking a harbor
Seed
15185546
Resolution
720×480Turning content blockers off does not make a difference.
oh, shit!
> Prompt: The sun rises slowly behind a perfectly plated breakfast scene. Thick, golden maple syrup pours in slow motion over a stack of fluffy pancakes, each one releasing a soft, warm steam cloud. A close-up of crispy bacon sizzles, sending tiny embers of golden grease into the air. [...]
In the video, the bacon is unceremoniously slapped onto the pancakes, while the prompt sounds like it was intended to be a separate shot, with the bacon still in the pan? Or, alternatively, everything described in the prompt should have been on the table at the same time?
So, yet again: AI produces impressive results, but it rarely does exactly what you wanted it to do...
But I'm also seeing some genuinely creative uses of generative video - stuff I could argue has got some genuine creative validity. I am loathe to dismiss an entire technique because it is mostly used to create garbage.
We'll have to figure out how to solve the slop problem - it was already an issues before AI so maybe this is just hastening the inevevitable.
>at resolutions up to 4K, and extended to minutes in length.
https://blog.google/technology/google-labs/video-image-gener...
Anyways, I strongly suspect that the funny meme content that seems to be the practical uses case of these video generators won't be possible on either Veo or Sora, because of copyright, PC, containing famous people, or other 'safety' related reasons.
I was so excited to see Sora out - only to see it has most of the same problems. And Kling seems to do better in a lot of benchmarks.
I can’t quite make sense of it - what OpenAI were showing when they first launched Sora was so amazing. Was it cherry picked? Or was it using loads more compute than what they’ve release?
It does pop up. Look at where his hand is relative to the jar when he grabs it vs when he stops lifting it. The hand and the jar are moving, but the jar is non-physically unattached to the grab.
This feels like a bit of a comeback as Veo 2 (subjectively) appears to be a step up from what Sora is currently able to achieve.
Some of the videos look incredibly believable though.
These videos will and may be too realistic.
Our society is not prepared for this kind of reality "bending" media. These hyperrealistic videos will be the reason for hate and murder. Evil actors will use it to influence elections on a global scale. Create cults around virtual characters. Deny the rules of physics and human reason. And yet, there is no way for a person to detect instantly that he is watching a generated video. Maybe now, but in 1 year, it will be indistinguishable from a real recorded video
I'm thinking of simple cryptographic signing of a file, rather than embedding watermarks into the content, but that's another option.
I don't think it will solve the fake video onslaught, but it could help.
Cute hack showing that its kinda useless unless the user-facing UX does a better job of actually knowing whether the certificate represents the manufacturer of the sensor (dude just uses a self signed cert with "Leica Camera AG" as the name. Clearly cryptography literacy is lagging behind... https://hackaday.com/2023/11/30/falsified-photos-fooling-ado...
The only thing this changes is not needing to pay human beings for work.
I feel this kind of hypervigilance will be mentally exhausting, and not being able to trust your primary senses will have untold psychological effects
And no big tech company would run the ads you're suggesting, because they only make money when people use the systems that deliver the untrustworthy content.
I think we will need the same healthy media diet.
In theory that should matter to something like Open(Closed)Ai. But who knows.
This quote suggests not: "maintaining complete consistency throughout complex scenes or those with complex motion, remains a challenge."
> a thief threatens a man with a gun, demanding his money, then fires the gun (etc add details)
> the thief runs away, while his victim slowly collapses on the sidewalk (etc same details)
Would you get the same characters, wearing the identical clothing, the same lighting and identical background details? You need all these elements to be the same, that's what filmmakers call "continuity". I doubt that Veo or any of the generators would actually produce continuity.
Not much. Low quality over-saturated advertising? Short films made by untalented lazy filmmakers?
When text prompts are the only source, creativity is absent. No craft, no art. Audiences won't gravitate towards fake crap that oozes out of AI vending machines, unrefined, artistically uncontrolled.
Imagine visiting a restaurant because you heard the chef is good. You enjoy your meal but later discover the chef has a "food generator" where he prompts the food into existence. Would you go back to that restaurant?
There's one exception. Video-to-video and image-to-video, where your own original artwork, photos, drawings and videos are the source of the generated output. Even then, it's like outsourcing production to an unpredictable third party. Good luck getting lighting and details exactly right.
I see the role of this AI gen stuff as background filler, such as populating set details or distant environments via green screen.
That's an obvious yes from me. I liked it, and not only that, but I can reasonably assume it will be consistently good in the future, something lot's of places can't do.
> Veo sample duration is 8s, VideoGen’s sample duration is 10s, and other models' durations are 5s. We show the full video duration to raters.
Could the positive result for Veo 2 mean the raters like longer videos? Why not trim Veo 2's output to 5s for a better controlled test?
I'm not surprised this isn't open to the public by Google yet, there's a huge amount of volunteer red-teaming to be done by the public on other services like hailuoai.video yet.
P.S. The skate tricks in the final video are delightfully insane.
Closed models aren't going to matter in the long run. Hunyuan and LTX both run on consumer hardware and produce videos similar in quality to Sora Turbo, yet you can train them and prompt them on anything. They fit into the open source ecosystem which makes building plugins and controls super easy.
Video is going to play out in a way that resembles images. Stable Diffusion and Flux like players will win. There might be room for one or two Midjourney-type players, but by and large the most activity happens in the open ecosystem.
Are there other versions than the official?
> An NVIDIA GPU with CUDA support is required. > Recommended: We recommend using a GPU with 80GB of memory for better generation quality.
https://github.com/Tencent/HunyuanVideo
> I am getting CUDA out of memory on an Nvidia L4 with 24 GB of VRAM, even after using the bfloat16 optimization.
With the YouTube corpus at their disposal, I don't see how anyone can beat Google for AI video generation.
Like the tanker which is still steering to fully align with the course people expect it to be, which they don't recognize that it will soon be there and be capable of rolling over everything which comes in its way.
If OpenAi claims they're close to having AGI, Google most likely already has it and is doing its shenanigans with the US government under the radar. While Microsoft are playing the cool guys and Amazon is still trying to get their act together.
That, or they have a secret super human intelligence under wraps at the pentagon.
OpenAI might be well-capitalized, but they're (1) bleeding money, (2) no clear path to profitability, and (3) competing head-to-head with a behemoth who can profitably provide a similar offering at 10-20x cheaper (literally).
Google might be slow out the blocks, but it's not like they've been sitting on their hands for the past decade.
https://arstechnica.com/information-technology/2023/03/yes-v...
We're not even done with 2024.
Just imagine what's waiting for us in 2025.
SD Cards?
Because there are literally thousands of avenues to explore and we've only just begun with the lowest of low hanging fruit.
To really do well on this task, the model basically has to understand physics, and human anatomy, and all sorts of cultural things. So you're forcing the model to learn all these things about the world, but it's relatively easy to train because you can just collect a lot of videos and show the model parts of them -- you know what the next frame is, but the model doesn't.
Along the way, this also creates a video generation model - but you can think of this as more of a nice side effect rather than the ultimate goal.
All these models have just “seen” enough videos of all those things to build a probability distribution to predict the next step.
This is not bad, or make it inherently dumb, a major component of human intelligence is built on similar strategies. I couldn’t tell what grammatical rules are broken in text or what physical rules in a photograph but can tell it is wrong using the same methods .
Inference can take it far with large enough data sets, but sooner or later without reasoning you will hit a ceiling .
This is true for humans as well, plenty of people go far in life with just memorization and replication do a lot of jobs fairly competently, but not in everything.
Reasoning is essential for higher order functions and transformers is not the path for that
Think 5-10 years into the future, this is a stepping stone
but more than anything it's useful as a stepping stone to more full-featured video generation that can maintain characters and story across multiple scenes. it seems clear that at some point tools like this will be able to generate full videos, not just shots.
Now, it may not be the best fit for those yet due to its limitations, but you've gotta walk before you can run: compare Stable Diffusion 1.x to FLUX.1 with ControlNet to see where quality and controllability could head in the future.
https://www.reddit.com/r/aivideo/comments/1hbnyi2/comment/m1...
Another more serious music video also made entirely by one person. https://www.youtube.com/watch?v=pdqcnRGzH5c Don't know how long it took though.
my templates all are waiting for stock videos to be added looping in the background
you have no idea how cool I am with the lack of copyright protections afforded to these videos I will generate, I'm making my money other ways
- gold everywhere is excessive - more Rococo (1730s-1760s) than Renaissance (1300-1600 roughly), which was a lot more restrained
- mirror way too big and clear. Renaissance mirrors were small polished metal or darker imperfect glass
- candelabras too ornate and numerous for Renaissance. Multi tier candleholders are more Baroque (1600-1750), and candles look suspiciously perfect, as opposed to period-appropriate uneven tallow or beeswax
- white paper too pristine (parchment or vellum would be expected), pen holders hilariously modern, gold-plated(??) desk surface is absurd
- woman's clothing looks too recent (Victorian?); sleeves and hair are theatrical
- hard to tell, but background dudes are lurking in what look like theatrical costumes rather than anything historically accurate
The prompt for the figure running through glowing threads seems to contain a lot of detail that doesn't show up in the video.
In the first example (close-up of DJ), the last line about her captivating presence and the power of music I guess should give the video a "vibe" (compared to prescriptively describing the video). I wonder how the result changes if you leave it out?
Cynically I think that it's a leading statement there for the reader rather than the model. Like now that you mention it, her presence _is_ captivating! Wow!
Now, examples of image or video generation models showing off how great they are should be stickman drawings or stickman videos. As far as I know, no model has been able to do that properly yet. If a model can do it well, it will be a huge breakthrough.
Another point to consider is that if my generative video system isn't good at maintaining world consistency, then doing a slow-motion video gives the illusion of a long video while being able to maintain a smaller "world context".
Although I tried that and it has the same issue all of them seem to have for me: if you are familiar with the face but they are not really famous then the features in the video are never close enough to be able to recognize the same person.
50 cents per video. Far more when accounting for a cherrypick rate.
..and that's when I realized how much cherry picking we have in these "demos". These demos are about deceiving you into thinking the model is much better than it actually is.
This promotes not making the models available, because people then compare their extrapolation of demo images with the actual outputs. This can trick people into thinking Google is winning the game.
Google won.
Of course, it's orders of magnitude cheaper than making a video or an animation yourself.
Namely, so few neurons to get picture in our heads.
I guess, end of the world scenarios may lead us to create that super intelligence with a gigantic ultra performant artificial "brain".
Humanity has its ways of objecting accelerationism.
Actually, typically human objection only slows it down and often it becomes a fringe movement, while the masses continue to consume the lowest common denominator. Take the revival of the flip phone, typewriter, etc. Sadly, technology marches on and life gets worse.
TikTok is one of the easiest platforms to create for, and look at how much human attention it has sucked up.
The attention/dopamine magnet is accelerating its transformation into a gravitational singularity for human minds.
"The Human Security System is structured by delusion. What's being protected there is not some real thing that is mankind, it's the structure of illusory identity. Just as at the more micro level it's not that humans as an organism are being threatened by robots, it's rather that your self-comprehension as an organism becomes something that can't be maintained beyond a certain threshold of ambient networked intelligence." [0]
See also my research project on the core thesis of Accelerationism that capitalism is AI. [1]
[0] https://syntheticzero.net/2017/06/19/the-only-thing-i-would-...
Thanks for sharing that video and post!
One way to think about this stuff is to imagine that you are 14 and starting to create videos, art, music, etc in order to build a platform online. Maybe you dream of having 7 channels at the same time for your sundry hobbies and building audiences.
For that 14 year old, these tools are available everywhere by default and are a step function above what the prior generation had. If you imagine these tools improving even faster in usability and capability than prior generations' tools did …
If you are of a certain age you'll remember how we were harangued endlessly about "remix culture" and how mp3s were enabling us to steal creativity without making an effort at being creative ourselves, about how photobashing in Photoshop (pirated cracked version anyway) was not real art, etc.
And yet, halfway through the linked video, the speaker, who has misgivings, was laughing out loud at the inventiveness of the generated replies and I was reminded that someone once said that one true IQ test is the ability to make other humans laugh.
Inventive is one way of putting it, but I think he was laughing at how bizarre or out-of-character the responses would be if he used them. Like the AI suggesting that he post "it is indeed a beverage that would make you have a hard time finding a toilet bowl that can hold all of that liquid" as if those were his own words.
If this is "just another tool" then my question is: does the output of someone who has used this tool for one thousand hours display a meaningful difference in quality to someone who just picked it up?
I have not seen any evidence that it does.
Another idea: What the pro generative AI crowd doesn't seem to understand is that good art is not about _execution_ it's about _making deliberate choices_. While a master painter or guitarist may indeed pull off incredible technical feats, their execution is not the art in and of itself, it is widening the amount of choices they can make. The more and more generative AI steps into the role of making these choices ironically the more useless it becomes.
And lastly: I've never met anyone who has spent significant time creating art react to generative AI as anything more than a toy.
Maybe it’s just me who couldn’t find it, (the website barely works at all on FF iOS)..
> VideoFX isn't available in your country yet.
Think about it, almost everyone I know rarely clicks on ads or buys from ads anymore. On the other hand, a lot of people including myself look into buying something advertised implicitly or explicitly by content creators we follow. Say a router recommended by LinusTechTips. A lot of brands started moving their as spending to influencers too.
Google doesn't have a lot of control on these influencers. But if they can get good video generations models, they can control this ad space too without having human in the loop.
1) AI is a massive wave right now and everyone's afraid that they're going to miss it, and that it will change the world. They're not obviously wrong!
2) AI is showing real results in some places. Maybe a lot of us are numb to what gen AI can do by now, but the fact that it can generate the videos in this post is actually astounding! 10 years ago it would have been borderline unbelievable. Of course they want to keep investing in that.
This is a typical tech echo chamber. There is a significant number of people who make direct purchases through ads.
> But if they can get good video generations models, they can control this ad space too without having human in the loop.
Looks like based on a misguided assumption. Format might have significant impacts on reach, but decision factor is trust on the reviewer. Video format itself does not guarantee a decent CTR/CVR. It's true that those ads company find this space lucrative, but they're smart enough to acknowledge this complexity.
Even if its not, TV ads, newspaper ads, magazine ads, billboards, etc... get exactly 0 clickthrus, and yet, people still bought (and continue to buy) them. Why do we act like impressions are hunky-dory for every other medium, but worthless for web ads?
I remember saying this to a google VP fifteen years ago. Somehow people are still clicking on ads today.
Most people have claimed not to be influenced by ads since long before networked computers were a major medium for delivering them.