"The study said 86.66% of the generated software systems were "executed flawlessly.""
What's that susupposed to mean ?
Unit tests were written to have this percentage ? By human or AI itself ?
86.66% of how many LOC ?
How long will it take for a human (because AI can badly do) to debug the code ?
What was the purpose of generated code ?
Lack of useful context here. Maybe I missed information fast-reading the article.
IMHO, it looks like just another rant on how "good" LLMs can "write code".
EDIT : Sorry OP, I didn't see the arxiv link.
I'm on mobile and read a 10 Mo pdf isn't worth it. I'll try to read it on a computer though, looks interesting.