Finding and Connecting to an Engine
An application selects an engine based on its properties. The EngineProperties class describes an operating mode of an engine. An engine that operates in several different modes (for example, it might support several different audio formats) will have an EngineProperties object to describe each of its different modes. There is a SVAPI function for enumerating the different engines with their respective modes for each of the different types of engines. The following code sample shows how to select a text-independent verification engine that uses threshold-based scoring and operates on 8-bit linear audio at either 8 or 11.025 kHz.
public EngineProperties selectEngine { EngineProperties Prop = new EngineProperties(); // Add the criteria we are checking for Prop.addProperty("engine.textmode", "independent"); Prop.addProperty("engine.score.threshold", "true"); Prop.addProperty("audio.format", "8bitLinear"); EnginePropertiesList List = SVAPI.availableVerificationEngines(Prop); // Search the list for an engine supporting either 8000 or 11025 Hz for (int i = 0; i < List.size(); i++) { try { Prop = List.elementAt(i); int SampleRate = Integer.parseInt(Prop.getProperty("audio.freq")); if (SampleRate == 8000 || SampleRate == 11025) return Prop; } catch (NumberFormatException e) {} } return null; }
An application can be very discriminating in selecting an engine and work with a single mode of operation of a particular engine, but a more robust application will support several different modes that engines might support. For example, a robust application might support both threshold- based and cohort-based verification engines. The engine properties will allow the application to know what calls it can make to the engine.
The following code sample shows how an application can connect to a verification engine.
public VerificationEngine connectToEngine(EngineProperties Prop) { // This function creates a new verification engine instance. // The engine selected is determined by Prop. try { return SVAPI.createVerificationEngine(Prop, new MyEngineNotify()); } catch (Exception e) { return null; } }
In the above sample, the application constructed an instance of MyEngineNotify, which it passed to SVAPI when creating the engine. This is the application's callback object. The engine calls methods in this class when events occur that the application needs to be notified of. In this case, MyEngineNotify is an application defined class which implements the EngineNotify interface. The following sample shows a minimal EngineNotify implementation.
class MyEngineNotify implements EngineNotify { public void asynchronousException(Engine Engine, Throwable e) {} public DataStore getDataStore() { return DataStore; } public boolean requestDisposal(Engine Engine, Model[] Model) { return false; } public boolean requestDisposal(Engine Engine, Utterance[] Utterance) { return false; } // Use the simple DataStore provided by SVAPI private DataStore DataStore = new MemoryDataStore(); }
SVAPI-compliant engines can support either live audio or batch audio or both. If an engine supports live audio, it processes a stream of audio data. Live audio sources include the desktop mic, an audio device installed in the local machine (such as a telephony board), a Java telephony call or an application supplied stream. If the engine supports batch audio, the engine processes audio data that has been collected by the application and is fed to the engine one chunk (Utterance) at a time. The source of utterances might be a database or an application might collect utterances itself from a live audio stream and feed them to the engine. The application can determine whether an engine supports live or batch audio from its properties object.
Note that an engine mode can only support one audio format at one sampling rate. If an engine supports multiple formats or rates, it must have an EngineProperties object for each mode. If an application needs to process audio that is in a variety of different formats or rates, it must establish a connection to an engine for each different format or rate.
The following code sample shows how an application could classify audio data from a URL (which could point to a file on the local file system) using the live audio functions. In this example, the engine will read data from the URL and classify it until it reaches the end of the file. As the engine is processing the audio, it will be calling back to the application's ClassificationNotify object to give it the classified audio segments. When the end-of-file is reached, the engine will call back to the application (this time to its EngineNotify object) with an end-of-file exception.
public void classifyFileLive(URL URL, ClassificationEngine Engine, ClassificationNotify Notify) throws RemoteException, SVAPIException, IOException { // Open the file for input InputStream InputStream = URL.openStream(); // Tell the engine to use the input stream for audio. // Note that no OutputStream is required. Engine.useStreamAudio(InputStream, null); // Start classifying Engine.enableAsynchronousClassification(Notify); }
The above sample makes use of a ClassificationNotify object. Following is an implementation of a ClassificationNotify object which collects the classified utterances and allows them to be obtained when they are all collected. This sample is a little complicated because of its asynchronous nature.
class MyClassificationNotify extends MyEngineNotify implements ClassificationNotify { public synchronized void asynchronousException(Engine Engine, Throwable e) { // Check if the exception is an end-of-file. // If not, save it so it can be thrown to the application. if (!(e instanceof EOFException)) { if (e instanceof SVAPIException) Error = (SVAPIException) e; else Error = new SVAPIException(null, e); } // Signal to other threads that we are done classifying. Done = true; notifyAll(); } public synchronized void classified(ClassificationEngine Engine, ClassificationResults[] NewResults) { // Add the new classified audio to the Results array. if (!Done) { int OldSize = (Results == null ? 0 : Results.length); // Allocate a bigger array ClassificationResults[] NewResultsArray = new ClassificationResults [OldSize + NewResults.length]; System.arraycopy(NewResults, 0, NewResultsArray, OldSize, NewResults.length); // Set the new array Results = NewResultsArray; } } public synchronized ClassificationResults[] waitForResults() throws SVAPIException { // This function waits for the signal that the processing is complete. while (!Done) { try { wait(); } catch (InterruptedException e) {} } // Check if we are done processing because of an error. if (Error != null) throw Error; return Results; } private ClassificationResults[] Results = null; private boolean Done = false; private SVAPIException Error = null; }
The following code sample also shows how an application could classify audio data from a URL, but using the batch audio functions.
public ClassificationResults[] classifyFileBatch(URL URL, ClassificationEngine Engine) throws RemoteException, SVAPIException, IOException { // Make an utterance from the file's data Utterance Utterance = Engine.createUtterance(URL); return Engine.classify(Utterance); }
Finally, the following code sample shows how an application might connect to a classification engine and have it classify audio from a URL, regardless of whether the engine supports live or batch audio.
public ClassificationResults[] classifyFile(URL URL) throws RemoteException, SVAPIException { EngineProperties Prop = new EngineProperties(); // Add the common criteria we are checking for Prop.addProperty("engine.audio", "stream"); Prop.addProperty("audio.format", "16bitLinear"); Prop.addProperty("audio.freq", "11025"); EnginePropertiesList List = SVAPI.availableClassificationEngines(Prop); if (List.size() == 0) throw new NoSuchEngineException(); Prop = List.elementAt(0); // Create a ClassificationEngine MyClassificationNotify Notify = new MyClassificationNotify(); ClassificationEngine Engine = SVAPI.createClassificationEngine(Prop, Notify); // Check if the engine supports live or batch audio. // Note that the engine might support both, in which case we default to batch. if (Prop.containsProperty("engine.audio", "utterance")) return classifyFileBatch(URL, Engine); else if (Prop.containsProperty("engine.audio", "stream")) { classifyFileLive(URL, Engine, Notify); return Notify.waitForResults(); } else throw new NoSuchEngineException("engine.audio"); }
In designing an application, the choice to support batch or live audio, or both, will probably be the most fundamental design decision.
A model is an abstraction of the data that an engine maintains to be able to identify a speaker. Models can be created by asking the engine to create one. A model can be created using data from one or more existing models, or new, empty models can be created and then enrolled. The enrollment process involves prompting the speaker to speak one or more utterances which are processed to extract important features.
The following sample shows how to create a new model and enroll it. The code shows how to enroll a model using an engine that allows a user to select a password, as well as an engine that is more interactive and generates different prompts each time. The code also demonstrates how to work with multiple prompt types. This code will work for either a verification or an identification engine. Pseudo-code function calls are shown in bold.
public Model createNewUser(ModelBasedEngine Engine, String Name) throws SVAPIException, IOException, RemoteException { // Ask the user for a password, if the engine can allow that. EngineProperties Prop = Engine.getProperties(); String Password = null; if (Prop.containsProperty("engine.enroll", "application")) { Password = askUserToChoosePassword(Name); if (Password == null) return null; } // Set some flags boolean LiveAudio = Prop.containsProperty("engine.audio", "stream"); boolean Done = false; Cancelled = false; // Create a new model Model Model = Engine.createModel(Name); try { // Setup the prompt types we can have Class[] PromptTypes = null; try { if (LiveAudio) { PromptTypes = new Class [2]; PromptTypes[0] = Class.forName("java.lang.String"); PromptTypes[1] = Class.forName("COM.novell.SVAPI.Utterance"); } else { PromptTypes = new Class [2]; PromptTypes[0] = Class.forName("java.lang.String"); } } catch (ClassNotFoundException e) {} // These classes are guaranteed to be found, so don't worry about exceptions // Prepare the model to be adapted Model.enroll(Password); while (!Done && !Cancelled) { // Always get our prompt, no matter what. // The engine may change the prompt each time // or put extra information in it. Object Prompt = Model.getPrompt(PromptTypes); // Do the prompting based on the prompt object type if (Prompt instanceof String) promptUserToSpeak(Name, (String) Prompt); else if (Prompt instanceof Utterance) ((Utterance) Prompt).play(); // Adapt the model until it says it is fully trained if (Prop.containsProperty("engine.audio", "stream")) Done = Model.adapt(); else { Utterance Utterance = collectAnUtterance(); Done = Model.adapt(); Utterance.dispose(); } } } finally { if (Cancelled) { // Remove the model from memory and from persistent storage. // Don't throw any exceptions because we want to preserve // the exception thrown by enroll. // We also want to do as much cleanup as possible, even if part of the cleanup fails. try { Model.abortEnrollment(); } catch (Exception e) {} try { Model.dispose(); } catch (Exception e) {} try { Engine.destroyModel(Name); } catch (Exception e) {} } } // If we got here, the call was successful or the user canceled. // Return null if the user canceled. if (Cancelled) return null; return Model; } // This function should be called if the user cancels during the enrollment process. public void cancelEnrollment(Model Model) throws SVAPIException, RemoteException { Cancelled = true; Model.abortEnrollment(); }
Persistent models have names and the engine will automatically save them in persistent storage so they can be reinstantiated later. Temporary models do not have names and are not saved in persistent storage.
Many SVAPI applications are likely to have high requirements for persistent storage and are going to want to use an industrial-strength database, probably the same database they keep their other client information in. Because of this, SVAPI provides a data store mechanism whereby the engine can use the application's database to store its persistent models. The interface for this is called DataStore and every application must provide the engine with an implementation. The DataStore interface has methods for managing named blobs (arbitrary binary data). DataStore has locking mechanisms which makes it suitable for shared databases.
For small-scale and test applications, SVAPI provides a simple implementation for the DataStore interface, called MemoryDataStore. The MemoryDataStore object is a simple memory-based database that can be loaded and saved to a single URL. The MemoryDataStore object cannot be shared across processes.
Verification involves comparing audio data to a voice model, or in the case of cohort-based verification, a model and the models of its cohorts. Verification can be performed on either live audio or batch audio. The VerificationEngine.verify() call performs verification on batch audio and can be called as follows:
Score Score = Engine.verify(Model, Utterance);
The interpretation of the score is described in a the section on Scores.
class Notify extends MyEngineNotify implements VerificationNotify { public static Score verify(VerificationEngine Engine, Model Model) throws SVAPIException, RemoteException { Notify Notify = new Notify(); Engine.enableAsynchronousVerification(Model, Notify); return Notify.getScore(); } public synchronized void asynchronousException(Engine Engine, Throwable e) { // Save the exception so it can be thrown to the application if (e instanceof SVAPIException) Error = (SVAPIException) e; else Error = new SVAPIException(null, e); // Signal the other threads notifyAll(); } public synchronized void verificationScoreChanged(VerificationEngine Engine, Utterance Utterance, Score Score) { // Save the score, signal any waiting people, and turn off verification this.Score = Score; notifyAll(); try { Engine.disableAsynchronousVerification(); } catch (Exception e) {} } public synchronized Score getScore() throws SVAPIException { // Wait until there is a score, then return it while (Score == null && Error == null) { try { wait(); } catch (InterruptedException e) {} } if (Error != null) throw Error; return Score; } private Score Score = null; private SVAPIException Error = null; }
Identification works almost exactly like verification. The principle difference is that the identification functions take multiple models instead of a single model and return multiple scores. Because identification functionality is so similar to verification, no sample code is provided.
Classification also works very much like verification, except that utterances are returned instead of scores. Note that classification engines do not require any support for models. The Working With Audio section contains some good classification sample code.
Both verification and identification engines return scores. A SVAPI score can have up to four components: a boolean accept/reject, a scalar raw score, an array of scalar raw scores for the speaker's cohorts (speakers that have similar voice patterns) and a binary score. The boolean accept/reject component is required. The other components are optional. The binary score is completely engine-dependent and should not be used if interoperability is desired. It is provided because some applications may need to archive this information for legal purposes, or have high-security needs that require information not supported by SVAPI. The score components that an engine outputs can be determined from the engine's properties.
Some engines perform verification by generating scalar raw scores for the assumed speaker and his or her cohorts. The actual score is not as important as the distribution of scores (for example, the assumed speaker might be required to have a higher score than all the cohorts). These engines are called cohort-based engines. Other engines perform verification by generating a scalar raw score for the assumed speaker. The actual score is generally compared against some threshold to perform an accept/reject decision. These engines are called threshold-based engines. The engine's properties can be examined to determine which type an engine is. The following samples show how an application might interpret the different types of scores.
// This function returns true if the scalar score is larger than the threshold. // Note that for some engine's smaller scores are better, // whereas in others larger scores are better. // In a real application, this would need to be determined. public boolean checkScalarScore(Score Score, double Threshold) { if (Score.hasScalarScore()) return Score.getScalarScore() > Threshold; return Score.getScore(); } // This function returns true if all of the cohorts scores are less than // MaxRatio times the assumed speaker's score. // Note that this assumes that the scalar scores are all positive // and that larger scores are better. // Also note that this is not necessarily a good algorithm. public boolean checkCohortScore(Score Score, double MaxRatio) { if (Score.hasCohortScore()) { CohortScore[] CohortScores = Score.getCohortScore(); double AssumedSpeakerScore = CohortScores[0].getScore(); for (int i = 1; i < CohortScores.length; i++) if (CohortScores[i].getScore() > AssumedSpeakerScore * MaxRatio) return false; return true; } return Score.getScore(); } // This function gives a skeleton for interpreting a binary score. public boolean checkBinaryScore(Score Score) throws IOException { if (Score.hasBinaryScore()) { ByteArrayInputStream InputStream = new ByteArrayInputStream(Score.getBinaryScore()); DataInputStream DataInputStream = new DataInputStream(InputStream); try { // Read data from DataInputStream to calculate score } finally { DataInputStream.close(); } } return Score.getScore(); }
Regardless of the types of scores an engine generates, an application can always count on the simple boolean accept/reject decision. However, this leaves the application at the engine's whim. For threshold-based engines, the boolean decision will generally be generated by comparing the raw score to some threshold. For such engines, the application can set the desired theshold. The application can then simply look at the boolean decision and not worry about looking at raw scores.
There are two ways to set the engine's threshold. The first way is by calling the engine's
ModelBasedEngine.setRawThreshol method. This function sets a scalar value to compare scores against to obtain the boolean decision. The second way is by calling the engine's ModelBasedEngine.setConfidenceThreshold method. This sets the threshold value from a confidence level, which is a measure of how "confident" the engine is that the assumed speaker is the actual speaker. The mapping of raw scores to confidence levels is based on statistical data and is sensitive to the environment in which the samples were taken. The environment includes such things as noise levels, sampling rate and format and microphone.
public double calcTheshold(VerificationEngine Engine, int ConfidenceLevel) throws SVAPIException, IOException, RemoteException { // Create a vector to hold the models and utterances to be used Vector TestPairs = new Vector(); // Go through all the models in the database Enumeration Iterator = Engine.getAllModels(); while (Iterator.hasMoreElements()) { String Name = (String) Iterator.nextElement(); // If something fails, just skip that user and go on to the next. // Don't allow the exception to go up the stack. try { // Try to load every model in the database Model Model = Engine.getModel(Name); // Get utterances for each speaker (possibly from a database). // Assume this function returns null if no utterance is available. Utterance[] Utterances = getUtterancesForUser(Engine, Model); // Add the model and utterances to the test vector if (Utterances != null) { for (int i = 0; i < Utterances.length; i++) TestPairs.addElement(new ModelUtterancePair(Model, Utterances[i])); } } catch (Exception e) {} } // Require at least three data points if (TestPairs.size() < 3) throw new SVAPIException("Not enough data"); // Copy the vector into an array ModelUtterancePair[] Pairs = new ModelUtterancePair [TestPairs.size()]; for (int i = 0; i < Pairs.length; i++) Pairs[i] = (ModelUtterancePair) TestPairs.elementAt(i); // Get the confidence mapping array from the data ConfidenceMapping[] Mapping = SVAPI.getConfidenceData(Engine, Pairs, false); // getConfidenceData returns the results sorted by raw score. // If this curve is descending, flip it so it is ascending. if (Mapping[Mapping.length - 1].getConfidence() < Mapping[0].getConfidence()) { for (int i = 0; i < Mapping.length / 2; i++) { ConfidenceMapping t = Mapping[i]; Mapping[i] = Mapping[Mapping.length - i - 1]; Mapping[Mapping.length - i - 1] = t; } } // Do a binary search to find the desired confidence level int a = 0; int b = 0; int c = Mapping.length - 1; while (c > a + 1) { b = (a + c) / 2; if (Mapping[c].getConfidence() > ConfidenceLevel) c = b; else a = b; } // A real program would probably interpolate, // but that is left as an exercise to the interested reader. return Mapping[b].getRawScore(); }
Finally, the ModelBasedEngine.verifyText method returns a score. This score is always a confidence level. No raw score is available for text verification.
In some applications it can be useful to combine the functionality of two engines. For example, an identification engine could be used to determine a person's identity. A verification engine could then be used to authenticate that person. In general, each engine will have its own implementation of each of the SVAPI interfaces. This means that an utterance or model created by one engine cannot necessarily be used in another.
The following example is fairly lengthy, but shows how an conference call could be broken up according to speaker by a classification engine, and then the classified audio identified by an identification engine. This example not only shows how to use multiple engines together, but also shows how to do open-set and closed-set identification.
// This is a helper class that holds the audio spoken by a given speaker class Participant { Participant(Model Model, int Identifier) { this.Model = Model; this.Identifier = Identifier; } Model Model; int Identifier; Vector Utterances = new Vector(); } class ConferenceNotify implements ClassificationNotify { // This function starts the logging process. // In order for this to work, both engines have to be using the same audio format. // The classification engine must support live audio, // and should be connected to an audio source previous to this. // The identification engine must support batch audio. public static void logConference(ClassificationEngine ClassEngine, IdentificationEngine IdentEngine, String[] Participants) throws SVAPIException, RemoteException { // Check to see if the engine supports open-set identification boolean OpenSet = IdentEngine.getProperties().containsProperty( "engine.identification.set", "open"); // If open-set is allowed, add an extra null element // at the end of the array of models. // The null model represents the unknown speaker in open-set identification. Model[] Models = new Model [OpenSet ? Participants.length + 1 : Participants.length]; for (int i = 0; i < Participants.length; i++) Models[i] = IdentEngine.getModel(Participants[i]); // Create our notification object and start listening to the stream. ConferenceNotify Notify = new ConferenceNotify(IdentEngine, Models, ClassEngine.isEngineCompatible(IdentEngine)); ClassEngine.enableAsynchronousClassification(Notify); } // Create a notification object ConferenceNotify(IdentificationEngine Engine, Model[] Models, boolean Compatible) { IdentEngine = Engine; this.Models = Models; this.CompatibleEngines = Compatible; } public synchronized void classified(ClassificationEngine Engine, ClassificationResults[] Results) { try { // Iterate through the results and save the utterances. // Each unique Identifier from a ClassificationResult // will have a Participant object. // The Participant object holds all the utterances // for the speaker that spoke them. for (int i = 0; i < Results.length; i++) { // Get the identifier and the utterance int Identifier = Results[i].getIdentifier(); Utterance Utterance = Results[i].getUtterance(); // See if we have already gotten an utterance from this person Participant Participant = (Participant) Participants.get(new Integer(Identifier)); if (Participant == null) { // If not, check for an identifier of 0. // If an engine classifies non-spoken audio, // it generates it with an identifier of 0. if (Identifier == 0) Participant = new Participant(null, 0); else { // If the CompatibleEngines flag is set, // utterances can be shared between the engines. // If not, we need to pull the audio data out of one engine // and create an utterance in the other. Utterance IdentUtterance = Utterance; if (!CompatibleEngines) IdentUtterance = IdentEngine.createUtterance(Utterance.toByteArray()); // Perform identification, only interested in one result IdentificationResults[] IdentResults = IdentEngine.identify(Models, IdentUtterance, 1); // Dispose the utterance if we created one if (!CompatibleEngines) IdentUtterance.dispose(); // Create a new participant Participant = new Participant(IdentResults[0].getModel(), Identifier); } // Add the new participant to the hash table Participants.put(new Integer(Identifier), Participant); } // Add the utterance to the participant's vector of utterances Participant.Utterances.addElement(Utterance); } } catch (Exception e) { try { Engine.disableAsynchronousClassification(); } catch (Exception exc) {} reportException(e); } } Hashtable Participants = new Hashtable(); // Holds all known participants IdentificationEngine IdentEngine; // Engine for identifying participants Model[] Models; // Possible participants boolean CompatibleEngines; // Can utterances be used directly? private void reportException(Exception e) {} }
SVAPI was designed to facilitate client-server architectures using the java rmi mechanism. In order to be remote-enabled, simply derive all classes which implement any of the SVAPI interfaces from java.rmi.server.UnicastRemoteObject (or any of its subclasses). This applies to applications and engines alike. An application can look at an engine's properties object to determine what host it is running on. If the engine is running on another host, there will be some audio restrictions.