So, they will rely on voice commands for recognition, not natural language. Often one to two words to set a chain of tasks in motion. Think of having to control your entire computer, including navigating by voice. That would be very exhausting and inefficient through natural language. There needs to be a hybrid solution that can leverage low domain natural language, but also high domain command-based recognition. I cannot overstate how important of low latency between the beginning of a command and an action produced. High latency means a big cognitive load and not to mention just inefficiency.
There's a lot of overlap between UI automation and accessibility control tools. However, UIA automation has always been a slow process simply because the stack doesn't have the demand from devs for low latency.
It's a difference between having an independent agent do something on your behalf, not caring how long it takes, versus you waiting for a aynchronous task to be completed.