Storing voice (audio) data is not what the article says. I'd imagine you transcribe the audio to text and search in that. Storing text is incredibly easy. Besides you can throw away 99.9% of the data almost immediately.
I'm actually curious how much text data this would be per day; number of call minutes * average number of words per minute. I'd be surprised if that wouldn't fit in a reasonable cluster.
You underestimate the CPU power needed to do this. The Netherlands has a population of 16 million, by comparison Google voice has about 1.4 million users. This is an order of magnitude difference. On top of this they only transcribe voicemail not all calls. What is the ratio of calls to voicemail?
Transcribing all voice calls to text in the Netherlands computationally could easily be two orders of magnitude more difficult than Google voice.
I'm sorry, but do we really think that machine transcription of millions of cell phone conversations is worth anything? How can anyone believe that after using google voice?
So you use a hybrid approach. The text transcription can be fed into programs that look for specific phrases, build up social networks, etc. And then anyone you decide you actually want to monitor you keep audio as well as the machine transcription.
The machine transcription remains incredibly valuable for broad surveillance even though it is highly imperfect.