recognition gives the text output to given voice, in shortly this is speech to
text (STT) conversion. It is helpful for the deaf, dumb and disables people.
This project is to improve the efficiency of the speech recognition accuracy.
Developed the speech recognition system with own dictionary, in-order to
improve the efficiency of the speech recognition system. Errors usually not
only vary in the numbers but also have different degrees of impact on
optimizing a set of acoustic models. It is important to correct the errors in
the results of speech recognition to increase the performance of a speech
recognition system. Errors are detected and corrected according to the database
learned from erroneous-correct utterance pairs. While running the speech
recognition system it displays the References and Hypothesis values and errors.
By balancing the errors we can improve the speech recognition accuracy. By
removing the silence from the speech signal we can improve the speech accuracy.
recognition is a process of converting the spoken words into text. Speech
recognition is analyzing an acoustic speech signal to identify the linguistic
message. Speech Recognition systems compare the spoken words and text then
gives the accuracy These Recognition systems are playing a vital role in
facilitating the daily activities. Speech Recognition applications include
voice dialing, call routing, and content based spoken audio search, data entry,
preparation of structured documents,
speech-to-text processing and in aircraft cockpits. In addition to these,
speech recognition system can be used for people with vision related
disabilities, crippled hands. In the underdeveloped countries where literacy
rate is poor, this can provide a mechanism of information access to people who
are unable to read and write as well as people who may be literate but not
qualified in computing skills.
Speech Recognition is defined as the ability for a
computer to understand spoken commands or responses is an important factor in
the human computer interaction. SR has been available for many years, but it
has not been practical due to the high cost of applications and computing
resources. The SR had significant growth in telephony, voice-to-text
applications. Increasing efficiency of workers that perform extensive typing,
assisting with disabilities and managing call centres by reducing staffing
costs, shows advantages of speech recognition. Speech recognition is the
process by which a computer identifies spoken words. Basically, it means
talking to your computer and having it correctly recognizes what you are
saying. Simply it is a Signal to Symbol transformation i.e., takes the speech
as input and gives the text as output.
2. Recognition Models:
Speaker dependent: Speech recognition
systems that can only recognize the speech of users it is trained to understand
is called speaker dependent speech recognizer. Limited to understand selected
Speech recognition software that recognizes a variety of speakers, without any
training is called the speaker independent speech recognizer.
3. Hidden Markov Model:
Every speech recognition system is
associated with the Hidden Markov Model:
A Hidden Markov Model is a
probabilistic state machine that can be used to model and recognize speech.
Consider the speech signal as a sequence of observable events generated by the
mechanical speech production system which transitions from one state to another
when producing speech. The term hidden refers to the fact the state of the
system (i.e. the configuration of the speech articulators) is not known to the
observer of the speech signal. Speech recognition systems use HMMs to model each
sound unit in the language. In an HMM, each state is associated with a
probability distribution that measures the likelihood of events generated by
the state. These distributions are known as output or observation probability
distributions. Each state is also associated with a set of transition
probabilities. Given the current state, transition probabilities model the
likelihood that the system will be in a certain state when the next observation
is produced. Typically, Gaussian distributions are used to model the output
distribution of each HMM state. The transition probabilities determine the rate
at which the model transitions from one state to the next, giving the model
some flexibility with respect to sound units which may vary in duration.
HMM = (?, A, B)
? = the vector of initial state
= the state transition matrix
= the confusion matrix
The definitions of HMMs, there
are three problems of interest:
Evaluation Problem: The forward-backward algorithm is used for the
finding the probability that the model generated the observations for a given
model and a sequence of observations.
Decoding Problem: The Viterbi algorithm can be found the most likely
state Sequence in the model that produced the observation for a given model and
the sequence of observations.
Learning Problem: The Baum-Welch algorithm find the model’s
parameters so that the maximum probability of generating the observations for a
given model and a sequence of observations.
The forward algorithm computes the all
possible state sequences of length that generate observation sequence and then
sum all the probabilities. The probability of each path is the product of the
state sequence probability and joint probability along the path.
forward algorithm, computes the probability that an HMM generates an
observation sequence by summing up the probabilities of all possible paths, so
it does not provide the best path or state sequence .In many applications ,it
is desirable to find such a path. Finding the best is the cornerstone for
searching in continuous speech recognition. Since the state sequence is hidden
in the HMM framework ,the most widely used criterion is to find the state
sequence that has the highest probability of being taken while generating the
observation sequence, The viterbi algorithm can be regarded as the dynamic
programming applied to the HMM or as a modified forward algorithm. Instead of
summing up probabilities from different paths coming to the same destination
state, the viterbi algorithm picks and remembers the best path.
also known as forward-backward algorithm used to model the observations in the
training data through the HMM parameters. This algorithm is a kind of EM
(Expectation maximization) algorithm that iterates through the data first in a
forward passes and then in a backward pass. During each pass, we adjust a set
of probabilities to maximize the probability of a given observation in the
training data corresponding to a given HMM state. Because this estimation
problem has no analytical solution, incremental iterations are necessary until
a convergence is achieved. In each iteration the algorithm tries to find better
probabilities that maximize the likelihood of observations and training data.
During this phase, we re-estimate the mixing weight, transition probabilities,
and mean and variance parameters. After
each Baum-Welch re-estimation iteration, we insert a normalization step. We compute the re-estimated model
parameters from the re-estimation counts
obtained through Baum-Welch. The combined Baum-Welch and normalization iteration repeats until we achieve an
acceptable parameter convergence.
have to write the Batch mode file.
It can be written as a text
transcription along with raw file .The raw file where we had saved and that is
the path name to the batch file. Installing the configuration file then we
have to build the xml file and call all
the files where we stored within sphinx4 folder run the xml file
.Running the sphinx 4 it displays the References and hypothesis values with
accuracy and error rate and it displays insertion, substitution,
deletion errors. Improving the efficiency of the speech recognition accuracy with
speech recognition system sphinx 4. Speech recognition system is developed with
own dictionary, in-order to improve the efficiency of the speech recognition
system. Recognition Errors not only vary in numbers but also have different
degrees of impact on optimizing a set of acoustic models. It is important to
correct the errors in the results of speech recognition to increase its
performance of a speech recognition system. Running the speech recognition
system it can displays the References and Hypothesis values and errors.
Here we can get three types of
An extra word was added in the recognized sentence
is called as Insertion error.
An incorrect word was substituted for the correct
word is called as Substitution error.
A correct word was omitted in the recognized sentence.
By correcting the speech recognition errors we can
improve the speech recognition accuracy. Two pairs of strings are used in the
speech. The first string is an erroneous string of the utterance predicted by
the speech recognition system. The second string is the corresponding section
of the actual utterance. Errors are detected and corrected according to the
database. When examining errors in speech recognition, we have to check total
database where the errors are found. An error pattern is made up of two
strings. One is the string including errors, and the other is the corresponding
These parts are extracted from the speech recognition results and the
corresponding actual utterances. The correction part is made by substituting a
correct part for an error part when the error part is detected in a recognition
result. Compare the references and hypothesis values from the database and
corrects the dictionary, reduce the insertion, substitution, deletion errors
and improve the speech recognition accuracy with corrected string. Speech
recognizer’s usually produces three different types of errors, including
insertion, substitution, and deletion. In speech recognition insertion,
substitution, deletion errors usually not only vary in numbers but also have
different degrees of impact on optimizing a set of acoustic models.
1: block diagram of Error-Pattern Correction
2: Sphinx-4 Architecture.
to improve the accuracy applied the Error-Pattern-Correction
For first 250 words
last 250 words:
By using error-pattern correction we can eliminate the error rate and
improved the speech recognition accuracy. Here, we have to correct the
dictionary and the batch file. If we did the three errors then easily improve
the accuracy and reduce the error rate. Do the insertion, substitution at a
time and deletion at one time to improve the speech recognition system
accuracy. The pronunciation dictionary is one of the core components of a
speech recognition system. The performance of a speech recognition system
mainly based on the choice of subunits and the accuracy of the speech. It may
vary the accuracy values by using audio finger print methods to speech recognition
system. By using classification techniques we can improve the accuracy of the
speech recognition system.
1. Paul Lamere, Philip kwok, William Walker.”
Design of the cmu sphinx-4 decoder” at Carnegie Mellon University, USA, 2004.
R.singh, M.Warmuth, B.Raj, and P.Lamere,”Classification with free energy at
raised temperatures”, in EURO-SPEECH, 2003.
William Walker, Rita Singh, Joe Woelfel.”Sphinx-4: A Flexible open source
framework for speech recognition, 2004.
Satoshi Kaki, Eiichiro Sumita, and Hitoshi iida.”A Method for Correcting Errors
in Speech Recognition Using the statistical Features of Character
Co-occurence,at ATR interpreting Telecommunications Research Lasbs,japan,2000.
5. Yuanfuliao,Chin Hui Lee,” An Enhanced minimum
classification error learning frame work for balancing insertion, substitution,
deletion errors, and 2008.
T.Araki et al., A method for detecting and correcting of Characters wrongly
Substituted, Deleted or inserted in Japanese strings, 2000.
Nikolaos vasiloglou, Ronald W.Schafer, Mat C.Hans,”Isolated word, speaker
dependent recognition under the presence of noise, base on an audio retrieval
Neil T.Kleynhans, Etienne Barnard,”Langugae dependence in multilingual speaker
A.Stolcke, E.shriberg, L.Ferrer,”speecg Recognition as feature extraction for
speaker recognition”, at speech technology and research laboratory, USA, 2006.
M.Benzerguiba, R.De mori, O.Derro,”Automatic speech recognition and intrinsic
speech variation”, at McGill University, 2004.