In this blog, I want to talk a little bit about testing different fully connected neural networks to see how each performs on this dataset:
https://www.kaggle.com/c/digitrecognizer
this is a very famous and one of the first classical data sets in the computer vision field. we have images like this:
and we want to be able to identify each image with least amount of errors.
we have 42,000 labeled images for training
we have 28,000 nonlabeled for kaggle evaluation.
In this blog I'll use few deep neural networks using Keras to build different models and evaluate them. each model will be different in either number of layers, nodes per layer to get a sense of what increases the accuracy of a neural network in this kind of problems and whether more layer & units and larger models mean better performance.
I'll assume you are familiar with python, basics of neural network.
as for Keras basics if you are not familiar with it, I recommend googling things along the way as it's not very complicated.
the link for the note book is here:
https://github.com/blabadi/ml_mnist_numbers/blob/blogbranchpart1/Digit%20Recognizer.ipynb
1 imports and libs
regular libraries to build Keras model and simple layers for fully connected NN and some utils to show images
2 define some parameters
we have 10 digits to classify an image as (09)
our images are of size 28x28 pixles
the directory is to save models in
the batch size tells the model to train with 32 images each iteration.
A training iteration has two steps: forward and backward passes over an image batch
1 forward: it first starts with random weights and predicts the images based on these weights
2 backward: then we calculate the error and our optimization algorithm will try to update and adjust the parameters by trying to minimize the error (using partial derivatives) with a learning rate parameter that decides how big the change is.
the epoch is how many times to iterate over the whole training set while keeping the learned parameters from previous iterations so we can get as minimum error as possible based on the current hyper parameters setup.
3 Loading the data set
we load our labeled data (X) and split it to two sets:
1 training : to be used by the model optimizer
2 test (validation): to be used by us to check the accuracy of the model on data it didn't see before but we have labeled it so we can compare it to a ground truth.
my split was 32,000 train, 10,000 validation ( almost 76% train to 24% validation )
we also split the data to the labels set (Y) and converted it to 1 hot encoding vector of 10 classes
so if you look at the last cell output you find number 3 is represented with a vector with 1 in position 3 of that vector, this is to be able to output a probability for each class in the model
4Define the models
I created different models but here is the biggest one:
it has the input layer of size 28 x 28 = 784 input each represents a single pixle in the image and since it's only gray scale image ( a pixel can have value 0255) and only one channel no RGB
it has the following layers:

Drop out is a regularization layer to avoid overfitting (memorizing the dataset instead of mapping the pattern) it drops some nodes randomly to prevent the neural network from memorizing them.

Dense is a layer of nodes fully connected to the previous and next layer, each node has an activation function ( non linear function that is applied to the previous layer output ), each node will learn something about the data (feature, example the curves in a digit, straight lines, etc)
the deeper the node the more complex features it learns (that's why Neural networks work, they build connections that using composition can learn and map complex functions with ability to generalize for new data)
example:
image taken from this paper:
http://www.cs.cmu.edu/~aharley/vis/harley_vis_isvc15.pdf
the last layer has 10 nodes (each node will output the probability of each digit)
5 Train the model
the commented code above will train the model to fit the training data this is the most time consuming step
each epoch the optimizer iterates over all images and prints the accuracy it got
with this model we achieved fairly quick result but as you can see the gains slow down very quickly
it, in my case it took few minutes to finish these epochs since I'm using a GPU.
6 Loading And Evaluating the models
here you can see all the models I tried with and trained. they are all saved on files to be reused later if needed.
here is the evaluation result on the validation set:
we let each model predict the digits and compare that to the ground truth, Keras does that for us.
our model achieved 99.15 % in training accuracy while here it got 97.74% in validation set accuracy which is expected to achieve less since these images are completely new to it. what we want to be careful of here to know if our model is not generalizing well, if your model gets 99% in training and gets say 80% in test accuracy then the gap is big this can be an indication that your model is overfitting the training set and not generalizing on new data well (that's why I added regularization for my biggest model because the more parameter it has the easier it can overfit)
notes based on the results:
7 Sample predictions from the nonlabeled set
each image has the index:prediction over it, you can see some notsoclear images like index : 399
, 366, 445 but our model got them correctly.
8 testing with my image
I created a digit image myself to see how the model
only 3 models were able to predict the correct digit.
the model achieved 97.7 % on the kaggle test set after submission.. for simple fully connected neural network it seems good for me !
In the next parts I'll try with a more complex network using convolutions & residual network architecture to see how much more we can minimize the error, knowing that people achieved 99.7% on this already if not higher