Udacity Capstone Project — Dog Breed Detector

10 min readMar 15, 2021

If you are a dog lover like me, you probably asked yourself "what a nice dog! I wonder what breed is that". Well, today might be your lucky day. In this project, we want to train a machine learning model that outputs a dog breed given an image. All the code for this project can be found in my git.

Project Objective

This project was the capstone project of Udacity’s Data Science Nanodegree. The goal of the project is to create a pipeline that takes an image and detects whether a human or dog is present. If it is a dog, it predicts the breed of it, if it’s a human it gives a dog breed that looks similar to him/her. The models were implemented in Python with the help of Keras library.

In the last decade, it is much easier to use deep learning techniques with a few lines of python code to classify images. In this project, I will walk you through the steps to create Convolutional Neural Networks (CNN) from scratch and leverage the power of state-of-art image classification deep neural networks using transfer learning.

We are going to evaluate our model based on both accuracy and f1-score in the test set. We are going to train it on the train set and use the validation set to get the set of best weights. In the end, we will test the model in some dogs and human pictures.

Approach Overview

We can divide this project into the following steps:

Step 0: Import Datasets
Step 1: Detect Humans
Step 2: Detect Dogs
Step 3: Create a CNN to Classify Dog Breeds (from Scratch)
Step 4: Use a CNN to Classify Dog Breeds (using Transfer Learning)
Step 5: Create a CNN to Classify Dog Breeds (using Transfer Learning)
Step 6: Create the final model
Step 7: Test this model

But before using classifiers to detect images, we need the data!

1 — Importing Data

Udacity provided most of the data used, more importantly:

8351 dog pictures, divided into 133 breeds
13233 human pictures

from sklearn.datasets import load_files       
from keras.utils import np_utils
import numpy as np
from glob import glob

# define function to load train, test, and validation datasets
def load_dataset(path):
    data = load_files(path)
    dog_files = np.array(data['filenames'])
    dog_targets = np_utils.to_categorical(np.array(data['target']), 133)
    return dog_files, dog_targets

# load train, test, and validation datasets
train_files, train_targets = load_dataset('../../../data/dog_images/train')
valid_files, valid_targets = load_dataset('../../../data/dog_images/valid')
test_files, test_targets = load_dataset('../../../data/dog_images/test')

# load list of dog names
dog_names = [item[20:-1] for item in sorted(glob("../../../data/dog_images/train/*/"))]

The labels are encoded as one-hot encoding, since we have 133 different breeds the shape is nb_samples x number_classes.

Here are some samples of the dog pictures:

As you can see, there are dogs in different positions and angles. This should make it harder for the model to classify the images.

Let’s take a look at the dog breed distribution

The top 3 breeds for both testing and training are the same, we should do reasonably well in classifying those.

The data is not very unbalanced, the label with fewer labels has 26 and the one with max has 77.

The images are stored in RGB format. It means that for each pixel in the image we will have 3 values: one for red, one for blue, and one for green.

Below is a plot for these values as well as the original images.

The top row is what the model sees, the bottom row is what we see

2 — Detecting Humans

We used a face detector implemented by OpenCV — Haar feature-based cascade classifiers to detect human faces in images.

def face_detector(img_path):
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray)
    return len(faces) > 0

Running the algorithm for the first 100 images for dogs and 100 images for humans gives us 100% accuracy for detecting human faces in human pictures and 89% for detecting no faces in dog pictures.

The model got this one wrong, but honestly, it is pretty hard to decide if is whether a dog picture with a human or a human picture with a dog

The result is good enough for our application, let’s identify some dogs now!

3 — Detecting Dogs

We used a pre-trained ResNet-50 model to detect dogs from images. This model has been trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks. We can download it using the code below

from keras.applications.resnet50 import ResNet50

# define ResNet50 model
ResNet50_model = ResNet50(weights='imagenet')

In order to use this model with our images, we need to process the images into the correct tensor size for the model. The expected shape of input is (nb_samples, rows, columns, channels), where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.

from keras.preprocessing import image                  
from tqdm import tqdm

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(224, 224))
    # convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

The path_to_tensor function takes a string-valued file path to a color image as input, resizes it to a square image that is 224×224 pixels, converts it to an array, and then resizes it to a 4D tensor. In this case, since we are working with color images, each image has three channels. Thus, the final dimensions will be (nb_samples,224,224,3). We rescale the images by dividing every pixel in every image by 255.

train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255

We can now run the predictions using the pre-trained weights

from keras.applications.resnet50 import preprocess_input, decode_predictions

def ResNet50_predict_labels(img_path):
    # returns prediction vector for image located at img_path
    img = preprocess_input(path_to_tensor(img_path))
    return np.argmax(ResNet50_model.predict(img))def dog_detector(img_path):
    """
    return if the image is a dog or not
    Input:
    - image path : string
    Output
    - True if the image is a dog, false otherwise
    
    """
    prediction = ResNet50_predict_labels(img_path)
    return ((prediction <= 268) & (prediction >= 151))

The reason why the dog detector outputs True if a number is between 151 and 268 is that the model’s keys for dogs is an uninterrupted sequence and correspond to dictionary keys 151–268, inclusive, to include all categories from 'Chihuahua' to 'Mexican hairless'.

4 — Training a CNN from scratch

Now that we can differentiate humans from dogs, we need a way to predict breed from images. In this step, you train a CNN from scratch that classifies dog breeds. We need at least 1% accuracy on the test set.

I had these specific points below in my head when experimenting with the CNN

Since it is not easy to differentiate different breeds from one another, we are going to need a couple of convolutional layers to extract important features to different each breed
I doubled the size of each filter as we go deeper into the convolutions
To finalize, I employed global pooling and a dense layer with softmax.
I tried to keep the network not too deep in order to avoid overfitting and to train it faster.

The end architecture is the follows:

More details about the code can be found in the github.

We can now train the model for 10 epochs and evaluate its performance.

from keras.optimizers import Adam
model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
from keras.callbacks import ModelCheckpoint  
epochs = 10

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.from_scratch.hdf5', 
                               verbose=1, save_best_only=True)

model.fit(train_tensors, train_targets, 
          validation_data=(valid_tensors, valid_targets),
          epochs=epochs, batch_size=32, callbacks=[checkpointer], verbose=1)

We got an accuracy of 18.78% in the test set, not bad considering we started from scratch!

Since we have several labels, it is also a good idea to look at other metrics, such as f1-score.

0.11, we can still improve

It seems that the model starts to overfitting at epoch 8

5 — Using Transfer Learning

Now it comes the fun part, how can we leverage visual models that were already trained to help us in classifying different dogs?

It turns out is not that hard, we are provided with 4 pre-trained networks that are currently available in Keras to chose from:

VGG-19 bottleneck features
ResNet-50 bottleneck features
Inception bottleneck features
Xception bottleneck features

I have chosen ResNet-50 as it was the most light-weight library and I wanted to keep my model as efficient as possible.

We can load the features from this model and train it in a similar fashion than we train our CNN. We load the weights and just fit the last layer

bottleneck_features = np.load('bottleneck_features/DogResnet50Data.npz')
print(bottleneck_features)

train_dognet = bottleneck_features['train']
valid_dognet = bottleneck_features['valid']
test_dognet = bottleneck_features['test']dognet_model = Sequential()
dognet_model.add(GlobalAveragePooling2D(input_shape=train_dognet.shape[1:]))
dognet_model.add(BatchNormalization())
dognet_model.add(Dense(133, activation='softmax'))
dognet_model.summary()dognet_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.dognet.hdf5', 
                               verbose=0, save_best_only=True)

dognet_model.fit(train_dognet, train_targets, 
          validation_data=(valid_dognet, valid_targets),
          epochs=15, batch_size=32, callbacks=[checkpointer], verbose=1)

Let’s see the results:

We got an accuracy of 80%! Considering the number of classes, this result is quite impressive.

Let’s now look at the F1-score

Impressive! Especially considering 133 different breeds

Also a great result!

Look how the training accuracy rapidly goes to 100%, but the validation accuracy still remains stable. We have great performance without overfitting.

6 — Putting Everything Together

Now, let’s put everything together. We first need a function to return the breed of the dog.

def predict_breed(img_path):
    """
    Predict dog breed given image path
    
    Input:
    image path : string
    
    Output:
    dog breed : string
    """
    
    # path to tensor from image path
    tensor_path = path_to_tensor(img_path)
    
    # getting bottleneck features
    bottleneck_feature = extract_Resnet50(tensor_path)  
    # predicting the model
    pred_vector = dognet_model.predict(bottleneck_feature)
    # since the dog name is a path, we select the name after the last dot
    doggo_breed = dog_names[np.argmax(pred_vector)].split('.')[-1]
    # return predicted dog breed
    return doggo_breed

Now, we can use the face detector and the dog detector together with the breed classifier! We define our algorithm below

def breed_algorithm(img_path):
    """
    identify if the image is a human or dog
    return the most similar breed
    if the image is a neither a dog or human, return text
    
    Input:
    image path : string
    
    Output:
    predicted breed : string
    
    
    """
    # showing image
    print('Predicting the image below:')
    display(Image(img_path, width=200, height=200))
    if dog_detector(img_path):
        predicted_doggo = predict_breed(img_path)
        print(f"HI DOG, it seems that you are a {predicted_doggo}")
        return predicted_doggo
    elif face_detector:
        predicted_doggo = predict_breed(img_path)
        print(f'This is a human, not a dog... but it looks like a {predicted_doggo}')
        return predicted_doggo

    else:
        return print("It was not possible to identify if the image is a human or dog.")

The algorithm can be defined as:

if a dog is detected in the image, return the predicted breed.
if a human is detected in the image, return the resembling dog breed.
if neither is detected in the image, provide an output that indicates an error.

7 — Testing the Model

Finally, the moment we’ve all been waiting for. Out of the 6 dog pictures we tested, we got 5 of them right. Pretty cool,eh?

Transfer learning really helps with the performance

Let’s see which one the model got wrong

The one that the model got wrong was very close to the right one, instead of predicting American Water Spaniel, it predicted Boyking Spaniel. This type of mistake even a human is prone to make.

Let’s see a human picture!

Here’s how a cavalier king Charles spaniel looks like:

Pretty amazing results!

8 — Conclusion

The goal was to have a model with accuracy greater than 60%, we managed to achieve 80%! This project was pretty fun and I believe learning about transfer learning is particularly important nowadays. This is the state-of-the-art approach for most computer vision and language models!

Of course, we can still improve our model. Here are some suggestions:

Train the model with extra data, this can be done by getting more data or using data augmentation (rotation, zooming, etc)
We could also train the model with some added noise in the images, this way the features extracted by the model should be more robust and less affected by the lighting of the image or similar factors
We could also use more than one model for the final prediction, i.e., using an ensemble of models. This way, the final output would be the average output of each individual model. This way, we hope to avoid some biases that are developed when training each model.
Instead of using CNN, we could try to use attention layers

Stay safe, cheers!