Abstract—This on extracted images. And this can be

Abstract—This electronic document is
a “live” template and already defines the components of your paper title,
text, heads, etc. in its style sheet.  *CRITICAL:  Do Not Use Symbols, Special Characters, or
Math in Paper Title or Abstract. (Abstract)

formatting; style; styling; insert (key words)

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

I.      Introduction

      Recognizing character and digit from
documents such as photographs captured at a street level is a very important
factor in developing modern-day map. For example, detect an address automatically
and accurately from street view images of that building. By using this
information more precise map can be built and it can also improve navigation
services. Though normal character classification is already a solved problem by
computer vision but still recognizing digit or character from the natural scene
like photographs are still a harder problem. 
The reason behind the difficulties may be the non-contrasting
backgrounds, low resolution, blurred images, fonts variation, lighting etc.

approaches for classifying characters and digits from natural images were
separated into two channels. First segmenting the images to extract isolated
characters and the perform recognition on extracted images. And this can be
done using multiple hand-crafted features and template matching. 1

The main purpose
of this project is to recognize the street view house number by using a deep
convolutional neural network.  For this
work, I considered the digit classification dataset of house numbers which I
extracted from street level images. 5 This dataset is similar in flavor to
MNIST dataset but with more labeled data. It has more than 600,000-digit images
which contain color information and various natural backgrounds and collected
from google street view images. 5 To achieve the goal, I formed an
application which will detect the number of just image pixels. Here, a
convolutional neural network model with multiple layers is used to train the
dataset and detect the house digit number with high accuracy. I used the
traditional convolutional architecture with different pooling methods and
multistage features and finally got 91.9% accuracy.


Figure 1:32×32 images samples from the SVHN Dataset 5

II.    Related

     Street view number detection is called natural
scene text recognition problem which is quite different from printed character
or handwritten recognition. Research in this field was started in 90’s, but
still it is considered as an unsolved issue. As I mentioned earlier that the
difficulties arise due to fonts variation, scales, rotations, low lights etc.

     In earlier years to deal with natural
scene text identification sequentially, first character classification by
sliding window or connected components mainly used. 4 After that word
prediction can be done by predicting character classifier in left to right
manner. Recently segmentation method guided by supervised classifier use where
words can be recognized through a sequential beam search. 4 But none of this
can help to solve the street view recognition problem.

     In recent works convolutional neural
networks proves its capabilities more accurately to solve object recognition
task. 4 Some research has done with CNN to tackle scene text recognition
tasks. 4 Studies on CNN shows its huge capability to represent all types of
character variation in the natural scene and till now it is holding this highly
variability. Analysis with convolutional neural network stars at early 80’s and
it successfully applied for handwritten digit recognition in 90’s. 4 With the
recent development of computer resources, training sets, advance algorithm and
dropout training deep convolutional neural networks become more efficient to recognize
natural scene digit and characters. 3

      Previously CNN used mainly to detecting a
single object from an input image. It was quite difficult to isolate each
character from a single image and identify them. Goodfellow et al., solve this
problem by using deep large CNN directly to model the whole image and with a
simple graphical model as the top inference layer. 4

      The rest of the paper is designed in
section III Convolutional neural network architecture, section IV Experiment,
Result, and Discussion and Future Work and Conclusion in section V.

III.   Convolutional
Neural Network Architecture

Neural Networks (CNN) can handle complex, high-dimensional data which is nearly
identical ordinary neural networks. It has some neuron which carries some
weight and biases. Each neuron takes images as inputs, then make a forward
function to implement and widely reduce the number of parameters in the
network. 7 Generally, CNN consists of several layers. 8 In the first layer,
the input will be convoluted with a set of filters to get the values of the
feature maps. Then to decrease the dimensionality of the spatial resolution of
the feature map, after each convolution layer there will be a sub-sampling or
pooling layer. Convolutional layers output substitute by sub-sampling layers and
create the feature extractor to retrieve selective features from the raw
images. These layers will be procced by fully connected layers (FCL) and the
output layer. The output from the earlier layer will be the input for the following
layer. For the different problem, CNN architectures may differ too.  


Figure 2: Convolutional Neural
Network Architecture 6

IV.   Experiment,
Result And Discussion

A.    Data

main objective of this project is detecting and identifying house-number signs from
street view images. The dataset I am considering for this project is street
view house numbers dataset taken from 5 has similarities with MNIST dataset. SVHN
dataset has more than 600,000 labeled characters and the images are in .png
format. After extract the dataset I resize all images in 32×32 pixels with
three color channels. There are 10 classes, 1 for each digit. Digit ‘1’ is
label as 1, ‘9’ is label as 9 and ‘0’ is label as 10. 5 The dataset is divided
into three subgroups: train set, test set, and extra set. Extra set is the
largest subset contains almost 531,131 images. Correspondingly, train dataset
has 73,252 and test data set has 26,032 images.


Figure 3: Example of the original, variable-resolution, colored

images with character level bounding

                boxes 5


in the images are level in bounding boxes and then bounding box information are
stored in digitStruct.mat instead of
drawn directly on the images in the dataset. digitStruct.mat file contains a
struct called digitStruct with the same
length as the number of original images. Each element in digitStruct has the
following fields: “name” which
is a string containing the filename of the corresponding image. “bbox” is a struct array that contains the position,
size and label of each digit bounding box in the image. For example, digitStruct(300). Bbox (2). height gives
height of the 2nd digit bounding box in the 300th image. 5

This is very
clear from Figure 3 that in SVHN dataset maximum house numbers signs are
printed signs and they are easy to read. 2 Because there is a large variation
in font, size and colors it makes the detection very difficult. The variation
of resolution is also large here. (Median: 28 pixels. Max: 403 pixels. Min: 9
pixels). 2 The graph below indicates that there is large variation in character
heights as measured by the height of the bounding box in original street view
dataset. That means size of all characters in the dataset, their placement, and
character resolution is not evenly distribute across the dataset. Due to data are
not uniformly distribute it is difficult to make correct house number detection.



Figure 4: Histogram of SVHN characters height in the

                image 2


B.    Experiment

 In my experiment, I train a multilayer CNN for
street view house numbers recognition and check the accuracy of test data. The
coding is done in python using Tensorflow, a powerful library for
implementation and training deep neural networks. The central unit of data in
TensorFlow is the tensor. A tensor consists of a set of primitive values shaped
into an array of any number of dimensions. A tensor’s rank is its number of
dimensions. 9 Along with TensorFlow used some other library function such as
Numpy, Mathplotlib, SciPy etc.


I perform my
analysis only using the train and test dataset due to limited technical resources.
And omit extra dataset which is almost 2.7GB. To make the analysis simpler delete
all those data points which have more than 5 digits. By preprocessing the data
from the original SVHN dataset a pickle file is created which being used in my
experiment. For the implementation, I randomly shuffle valid dataset and then
used the pickle file and train a 7-layer Convoluted Neural Network. Finally, cast-off the test data
to check for accuracy of the trained model to detect number from street house
number image.

At the very
beginning of the experiment, first convolution layer has 16 feature maps with
5×5 filters, and originate 28x28x16 output. A few ReLU layers are also added
after each convolution layer to add more non-linearity to the decision-making
process. After first sub-sampling the output size decrease in 14x14x10. The
second convolution has 512 feature maps with 5×5 filters and produces 10x10x32
output. By applying sub-sampling second time get the output size 5x5x32.
Finally, the third convolution has 2048 feature maps with same filter size. It
is mentionable that the stride size =1 in my experiment along with zero padding.
During my experiment, I use dropout technique to reduce the overfitting.
Finally, SoftMax regression layer is used to get the final output.

Weights are
initialized randomly using Xavier initialization which keeps the weights in the
right range. It automatically scales the initialization based on the number of
output and input neurons. After model buildup, start train the network and log
the accuracy, loss and validation accuracy for every 500 steps. To minimize the
loss, Adagrad Optimizer used. After reach in a suitable accuracy level stop train
the network and save the hyperparameters in a checkpoint file. When we need to
perform the detection, the program will load the checkpoint file without train
the model again.

By running the model
for a few thousand steps and check the mini batch loss, accuracy and validation
accuracy for every 500 images complete the training and then check the test set
accuracy. Initially,
the model produced an accuracy of 89% with just 3000 steps. It’s a great
starting point and certainly, after a few times of training the accuracy will reach
my benchmark of 90%. However, I added some simple improvements to further
increase the accuracy of few number of learning steps. First, added a dropout
layer after the third convolution layer just before fully connected layer. This
allows the network to become more robust and prevents overfitting. Secondly, introduced
exponential decay to learning rate instead of keeping it constant. This helps
the network to take bigger steps at first so that it learns fast but over time
as we move closer to the global minimum, take smaller noisier steps. With these
changes, the model is now able to produce an accuracy of 91.9% on the test set.
Since there are a large training set and test set, there is a chance of more
improvement if the model will train for a longer duration.

C.   Result
and Discussion

 Finally, my considered model reached in the accuracy
of 91.9% on test set which is satisfactory to me. As I mentioned in the earlier
section that the saved hyper-parameters in a checkpoint file will be restored
later to continue training or to use it for detecting new images. By using the dropout,
I confirm that the model is strong and can do well with most data. The model is
tested over a wide range of input from the test dataset and generalizes well.


Figure 5: Some labeled output
from the model


It can be seen
from Figure 5 that the model is able to predict the data clearly for most of
the images. However, it still gives incorrect output when the images are blurry
or any other noise. Due to my time limitation, I trained the model for a
relatively less duration. I believe there is a chance to increase the accuracy
level. Also, if we use a better hardware and GPU can be trained and run the
model faster.

Future Work And

In my experiment I proposed a multi-layer deep
convolutional neural network to do the street view house number recognition. I do
the experiment on more than 600,000 images and achieve almost 92% accuracy. So,
from the accuracy it can be clearly seen that the model produces correct output
for most images. However, the detection fails if the Image is blurry, with noise
etc. One interesting aspect of the project was to find out how well the
optimization tricks like dropout and exponential learning rate decay perform on
real data. One difficult aspect of the project was to choose appropriate
architecture for the problem. Since there are many ways the architecture can be
implemented it’s very difficult to understand why an architecture will work
best for any specific type data. The model implemented here is relatively
simple but does the job very well and is quite robust, however it’s still requires
a lot of work to make the model perform equivalent or better than a human
operator. As a future work, I will extend my experiment using different
technique and algorithms. And try to find out which one has better accuracy
with minimum cost and less number of loss.