Siri-ously Speaking: How Voice Activation and Text-to-Speech Really Works | Chatsworth Products
We use cookies to personalize content and to analyze our traffic. You consent to cookies if you continue to use this website. To find out more or to disable cookies view our Privacy Policy. X

Siri-ously Speaking: How Voice Activation and Text-to-Speech Really Works

 Permanent link

Siri Lives in a Data CenterWhen Apple recently unveiled the iPhone 4S, many were surprised by the lack of any external design change or hardware configuration. Yet for all its surface-level sameness, it's since become apparent that the phone's true beauty is much more than skin deep.

We're talking of course about "Siri," a mild-mannered, informative and surprisingly witty personal assistant baked straight into the operating system of each new iPhone. Press a button, ask a question (out loud) and get an answer (yup, out loud), almost instantly. Same goes for reminders, text messages and more - simply ask (or tell, nicely!) Siri to do something, sit back and listen to the sage advice of one wise lady (did we mention the computer-generated voice is female?)

But just what makes Siri tick? Software, and lots of it. Truth be told, voice activation and voice recognition software has been in the works (and in use) for some time now. Siri's just the latest in a line of technology raising the bar on how intuitive voice recognition between us (humans) and them (computers) can be.

Here's a quick, nuts and bolts breakdown of the technology at play:

  • We verbalize a command or question to our handheld device. That information is captured as audio and routed over the internet to a far-off data center.
  • Once there, the audio information is funneled (or in our case, routed through superior cable management) to stacks and stacks of servers (resting comfortably in racks, cabinets or wall-mount systems).
  • Inside these servers, complex computations begin to take effect. First, automatic speech recognition (ASR) software transcribes our speech into text. This text is then processed, tagged and parsed.
  • The information within that text is then analyzed for content and a (hopefully) acceptable answer or resolution is formulated.
  • That answer is then transformed back into natural language text, and using text-to-speech software, synthesized back into device-generated speech.

All in a matter of seconds, the entire process runs its course (sound waves > phone > data center > back to phone = you, amazed!). The glue that holds it all together? The data center. The glue that holds the data center together? Who else? Chatsworth Products, Inc. Jeff Cihocki, eContent Specialist 

Posted by Kim Ream at 08/01/2013 11:06:26 AM

Leave a comment
Name (required)
Email (optional, not published)
Your URL (optional)

Note: Conversation is encouraged and expected. However, moderation of comments is necessary to prevent spam, personal attacks, profanity, or off-topic commentary. Comments related to specific product support or customer service issues will be addressed separately rather than posted here. Please email [email protected] for assistance with these matters.

2/24/2024 12:31:39 AM