SIP-SIR

From QXIP

Jump to: navigation, search

120px-Iris.png

Lots of noise about voice recognition and AI lately with crApple's SIRI (a multiplatform product they jumped on - not invented, then banned from other platforms) and now DEXETRA's excellent IRIS beta for Android making the buzz. People re-discovered voice and want to use it to interact with their stuff. Like in the 80s! Surely, the technology they use could also well serve for a huge variety of VoIP applications and services, but relying on local resources for voice recognition and text-to-speech is expensive and borderline impossible for the most.

So let's go chop some wood....

Try our demo now:

  • WEB: http://qxip.net/bot
  • SIP: 411@qxip.net
  • PHONO: app:9991443508
  • INUM: +883510001809132
  • SKYPE: +990009369991443508
  • PSTN/US: +1-4149390585


SIP-SIR is reachable from about any source thanks to Tropo's fantastic routing. Speech Recognition and Synthesis by Google, Results by Google APIs, Wolfram-Alpha, True Knowledge Engine, MIT's START and Dexetra's IRIS EC2 mash engine.

Use a handset/headset/mic for best results!


Build Your Own:

Let's see how it works and play smart

Capturing a random IRIS search session and its results, the following ingredients are found (outdated):



No surprisese there! Now, guessing the basics there is not too hard: Google Speech-to-text API to convert speech to text (doh!), START & Cleverbot provide the AI for searches and personality, and most likely smart internal code takes care or recognizing local functions such as calling and messaging - but we don't need that, or at least, not yet. Since most of the work is outsourced to the above resources, we just need to orchestrate some strings.


Let's get some basic scripts to get up and running:

GOOGLE's SPEECH-TO-TEXT API

Using FLAC audio:

sox audio.wav audio-wide.wav rate 16000
flac audio-wide.wav

or convert/normalize using sox:

sox test.wav test.flac gain -n -5 silence 1 5 2%
wget -q -U "Mozilla/5.0" --post-file="/tmp/audio-wide.flac" --header="Content-Type: audio/x-flac; rate=16000" -O /tmp/test.txt http://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium

or 

curl --data-binary @audio-wide.flac --header 'Content-type: audio/x-flac; rate=16000' \
 'https://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium'


JSON Reply:
{"status":0,"id":"d91e49b8223fa5e80ca4e408cce61d7c-1","hypotheses":[{"utterance":"this is pure nonsense","confidence":0.82811373}]}




Using SPEEX audio:

Plain speex will not return great results, if any at all, as you might have noticed. That's because of the different packet format for the mime-type "x-speex-with-header-byte" as required by the google API, for some reason unclear to us common mortals, but luckily Chomium brings a hint:

// Encode the frame and place the size of the frame as the first byte. This
// is the packet format for MIME type x-speex-with-header-byte.

Now, is it worth it against FLAC? It depends on your application, but Speex is FAST and SMALL


....Solution?

Use our patch for speexenc with "x-speex-with-header-byte" support!

A pre-patched version is available on GitHub and we'd like to hear your feedback.


To Encode use the new --headerbyte option:

# speexenc --headerbyte --quality 3 --w audio-wide.wav example.spx
Encoding 16000 Hz audio using wideband (sub-band CELP) mode (mono)
Warning: with-header-byte output will not be compatible with most decoders.
 
wget -q -U "Mozilla/5.0" --post-file="/tmp/example.spx" --header="Content-Type: audio/x-speex-with-header-byte; rate=16000" -O /tmp/test.txt http://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium

or 

curl --data-binary @example.spx --header 'Content-type: audio/x-speex-with-header-byte; rate=16000'  'https://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium'


JSON Reply:

{"status":0,"id":"d91e49b8223fa5e80ca4e408cce61d7c-1","hypotheses":[{"utterance":"this is pure nonsense","confidence":0.82811373}]}



CSAIL/START PHP Hook

Now we have a string containing the recognized speech - likely, a question? Let's ask MIT's START a question and get the reply (unsanitized, example only)

 <?php
 
 if (isset($_GET["q"])) { 
 $query = str_replace(" ", "+", $_GET["q"]); 
 $homepage = get_data('http://start.csail.mit.edu/startfarm.cgi?query='.$query); 
 $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 $content = str_replace($newlines, "", html_entity_decode($homepage)); 
 preg_match_all("|<!-- REPLY-QUALITY: T --><P>(.*)</span><span|U", $content, $rows); 
 echo $rows[0][0];  
 } 
 
 function get_data($url)
 {
  $ch = curl_init();
  $timeout = 5;
  curl_setopt($ch,CURLOPT_URL,$url);
  curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
  curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
  curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
 }
 
 ?>



Back to the Fu.. Audio, back to Audio

Since we're squeezing juice out of our friends at Google, let's do it all the way. Shoot the resulting string their way and get speech (NOTE: you will need to set proper User-Agent and Referral strings for this to work)

http://translate.google.com/translate_tts?tl=en_en&q=never+heard+of+such+thing+sorry

NOTE: Apparently, the above is sometimes blocked for random access. Use alternatives such as AT&T's TTS, Loquendo, etc

Chrome/Chromium Working Example

Here's a little working demo using Chrome's built in support for Speech API and the above START script (getstart.php)


 <html> 
 <head> 
 <title>SIP-SIR Demo</title> 
 <script src="http://code.jquery.com/jquery-latest.js"></script> 
 <script type="text/javascript"> 
 
 function getSpeech() {
 	var speech = document.getElementById("speechfield").value;
 	//alert('You said: '+speech);
 	$("#response").load("getstart.php?q="+escape(speech)+'', function(response, status, xhr) {
  		if (status == "error") {
  		  var msg = "Sorry but there was an error: ";
  	 	   $("#response").html(msg + xhr.status + " " + xhr.statusText);
  		}
 	});	  
 
  } 
 
 </script> 
 </head> 
 <body> 
 <img src="logo.png"><br>
 <input type="text" id="speechfield" x-webkit-speech />
 <input type="button" value="Ask" onclick="getSpeech();" />
 <div id="response" name="response"></div>
 </body></html>

Done. Seen. What now?

So, we started with an audio file, transformed it into text, used the text to gather more info, turned the new info back to audio, with a nice chrome example. Now it would be nice to integrate the above within a SIP call wouldn't it? That's exactly what's next! Let's pick a random candidate: FreeSWITCH!

BASH time!

Now let's leave the form behind and put this together as a prototype - put it all of the above together in a local script that gets the audio file as a parameter, converts it to either FLAC or SPX and send it through the chain (help yourself using bash, perl, js, pytho, you name it... a silly example follows) and generate a final mp3 with the resulting answer from any engine (Wolfram, START, True Knowledge, you name it) as speech ready to play back - something like the following dialplan should get further experimenting started (the cleanup line is out of scope for testing):


FreeSWITCH can't do it all for us so for the sake of having a working example, here's a little rude basic bash demo script getting it all together for your ENTERTAINMENT only... and to be used to make things easier to test directly from the dialplan. NOTE: This is not the script running our demo, please brew your own better implementation and kick a$$ ;)

  • Edit & chmod +x /opt/sipsir.sh (or whatever)
#!/bin/bash
# SIPSIR speech recognition for FS
# Version 0.01 by webdelic@gmail.com
file=$1
echo ".$(date) NEW SESSION" >> /var/log/sipsir.log
if [ -e $file ]; then
 rm -rf $file.*
 echo "SIPSIR: Converting to speex... $file" >> /var/log/sipsir.log
 # force frequency & wideband, convert to speex headerbyte format
 sox $file -p rate 16000 | speexenc --headerbyte --vbr --quality 2 --w - $file.spx >> /var/log/sipsir.log
 if [ -s $file.spx ]; then
      echo "SIPRIR: Querying Google Speech Recognition API" >> /var/log/sipsir.log
      curl   --data-binary @$file.spx --header 'Content-type: audio/x-speex-with-header-byte; rate=16000' 'htt
ps://www.google.com/speech-api/v1/recognize?client=qxip&lang=en-US&maxresults=1' 2>&1 | grep -Poi '"utterance":.
*?[^\\]"' | sed 's/\"utterance\"://' | sed 's/"//g' | sed 's/ /+/g' > $file.text
 speech=`cat $file.text`
 if [ -z "$speech" ] ; then
      speech='sorry, try again'
            echo "No Match, exiting..." >> /var/log/sipsir.log
    exit
 fi
       echo "Matched: $speech" >> /var/log/sipsir.log
#      wget -q -U "Mozilla/5.0" -O $file.mp3 "http://translate.google.com/translate_tts?tl=en_en&q=$speech"
       echo "Asking MIT's START..." >> /var/log/sipsir.log
       wget -q -U "Mozilla/5.0" -O $file.answer "http://localhost/getstart.php?q=$speech"
       answer=`cat $file.answer`
       echo "Answer: $answer" >> /var/log/sipsir.log
       wget -q -U "Mozilla/5.0" -O $file.mp3 "http://translate.google.com/translate_tts?tl=en_en&q=$answer"

# echo $speech;
  if [ -s $file.mp3 ]; then
        echo "SIPSIR: Translation Ready! $file.mp3" >> /var/log/sipsir.log
  fi
 else
  echo "SIPSIR: Errors Transcoding $file" >> /var/log/sipsir.log
 fi
else
        echo "SIPSIR: File $file does not exist" >> /var/log/sipsir.log
fi
echo "---------------------------------------------------------------" >> /var/log/sipsir.log

FS Dialplan + Script Example

  • Record the audio and process it with the script from the dialplan:
 <extension name="SIPSIR-411">
  <condition field="destination_number" expression="^411$">
  <action application="answer"/>
     <action application="sleep" data="1000"/>
  <action application="playback" data="shout://translate.google.com/translate_tts?tl=en&q=Welcome+to+sip+sir.+Ask+your+question"/>
  <!--Record 10 seconds of audio, use # as recording session terminator /-->
  <action application="set" data="playback_terminators=#"/>
  <action application="set" data="file=/tmp/${strftime(%Y-%m-%d-%H-%M-%S)}_${caller_id_number}"/>
  <action application="record" data="${file}.wav 10 200"/>
  <action application="playback" data="shout://translate.google.com/translate_ttstl=en&q=Please+wait+while+I+think"/>
  <!--Process the audio externally via SIPSIR and playback the resulting audio /-->
  <action application="system" data="/opt/sipsir.sh ${file}.wav"/>
  <action application="sleep" data="1000"/>
  <action application="playback" data="${file}.wav.mp3"/>
  <action application="hangup"/>
  <action application="system" data="rm -rf ${file}.*"/>
  </condition>
 </extension>

FS ESL/PHP + Script Example

An even simpler (but faster) Freeswitch ESL example - using one of their PHP examples but in sync mode to accomodate the audio playback. The two scripts to process the audio could be easily merged (and they are in our demo) but for the sake of separation....

  • Dialplan example
<extension name="outbound-socket">
<condition field="destination_number" expression="^55(522)$">
<action application="set" data="ivr_path=/usr/local/freeswitch/scripts/sipsir.php"/>
<action application="socket" data="127.0.0.1:8084 sync full"/>
</condition>
</extension>
  • PHP example
#!/usr/bin/php -q

<?php

// set a couple of things so we dont kill the system
ob_implicit_flush(true);
set_time_limit(60);

// Open stdin so we can read the data in
$in = fopen("php://stdin", "r");

$date=date("Y-m-d-H-i-s");
$rand=rand(1000, 9999);
$file="/tmp/sipsir/ESL_".$date."_".$rand;

// Connect
echo "connect\n\n";

// Answer
echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: answer\n\n";

// Play a prompt
echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: playback\n";
echo "execute-app-arg: shout://translate.google.com/translate_tts?tl=en&q=Ask+your+question\n\n
";

sleep(1);

echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: set\n";
echo "execute-app-arg: playback_terminators=#\n\n";

echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: record\n";
echo "execute-app-arg: ".$file.".wav 10 200\n\n";

$target=$file.".wav";
while (!file_exists($target)) { sleep(1); }

// Play a prompt
echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: playback\n";
echo "execute-app-arg: shout://translate.google.com/translate_tts?tl=en&q=Please+Wait\n\n";

echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: system\n";
echo "execute-app-arg: /opt/sipsir.sh ".$file.".wav\n\n";

$target1=$file."wav.mp3"; while ( !file_exists($target1)) { sleep(1); }

echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: playback\n";
echo "execute-app-arg: ".$file.".wav.mp3\n\n";

// Wait
sleep(1);

// Play a prompt
echo "sendmsg\n";
echo "call-command: execute\n";
echo "execute-app-name: playback\n";
echo "execute-app-arg: shout://translate.google.com/translate_tts?tl=en&q=Thank+you\n\n";
//echo "event-lock: true";

sleep(3);
// Hangup
echo "sendmsg\n";
echo "call-command: hangup\n\n";

fclose($in);

?>

Test it!

Now all that's left is to dial your newly created extension and ask away!

So... "What time is it in Tokyo?"

  • WEB: http://qxip.net/bot
  • SIP: 411@qxip.net
  • PHONO: app:9991443508
  • INUM: +883510001809132
  • SKYPE: +990009369991443508
  • POTS/UK: +44-5603474156
  • PSTN/US: +1-4149390585



Follow us @qxip

Personal tools