Voice Biometrics Authentication using GMM and Face Recognition Using Facenet and Tensorflow

Voice Biometrics Authentication using GMM and Face Recognition Using Facenet and Tensorflow

How to Run :

Install dependencies by running pip3 install -r requirement.txt

1.Run in Jupyter notebook - main.ipynb

2.Run in terminal in following way :

To add new user :

  python3 add_user.py

To Recognize user :

  python3 recognize.py

To Recognize until KeyboardInterrupt (ctrl + c) by the user:

  python3 recognize_until_keyboard_Interrupt.py

To delete an existing user :

  python3 delete_user.py

Voice Authentication

For Voice recognition, GMM (Gaussian Mixture Model) is used to train on extracted MFCC features from audio wav file.

Face Recognition

Face Recognition system using Siamese Neural network. The model is based on the FaceNet model implemented using Tensorflow and OpenCV implementaion has been done for realtime face detection and recognition.
The model uses face encodings for identifying users.

The program uses a python dictionary for mapping for users to their corresponding face encodings.

How it works? Step-by-Step guide

Import the dependencies

 import tensorflow as tf
 import numpy as np
 import os
 import glob
 import pickle
 import cv2
 import time
 from numpy import genfromtxt

 from keras import backend as K
 from keras.models import load_model

 import pyaudio
 from IPython.display import Audio, display, clear_output
 import wave
 from scipy.io.wavfile import read
 from sklearn.mixture import GMM 
 import warnings

 from sklearn import preprocessing
 # for converting audio to mfcc
 import python_speech_features as mfcc

Facial Encodings

#provides 128 dim embeddings for face
def img_to_encoding(img):
   img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
   #converting img format to channel first
   img = np.around(np.transpose(img, (2,0,1))/255.0, decimals=12)

   x_train = np.array([img])

   #facial embedding from trained model
   embedding = model.predict_on_batch(x_train)
   return embedding 

The Function reads input image and convert image format to channel first as required by pre-trained facenet model. The model provides output as 128 dimensional encoding vector for the input image.

Triplet Loss

def triplet_loss(y_true, y_pred, alpha = 0.2):
    anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
    # triplet loss formula 
    pos_dist = tf.reduce_sum( tf.square(tf.subtract(y_pred[0], y_pred[1])) )
    neg_dist = tf.reduce_sum( tf.square(tf.subtract(y_pred[0], y_pred[2])) )
    basic_loss = pos_dist - neg_dist + alpha
    loss = tf.maximum(basic_loss, 0.0)
    return loss

# load the model
model = load_model('facenet_model/model.h5', custom_objects={'triplet_loss': triplet_loss})

Two encodings are compared and if they are similar then two images are of the same person otherwise they are different. They are compared by using triplet loss formula. After that load the facenet model which is trained on inception network.

MFCC features and Extract delta of the feature vector

def calculate_delta(array):
    rows,cols = array.shape
    deltas = np.zeros((rows,20))
    N = 2
    for i in range(rows):
        index = []
        j = 1
        while j <= N:
            if i-j < 0:
                first = 0
                first = i-j
            if i+j > rows -1:
                second = rows -1
                second = i+j
        deltas[i] = ( array[index[0][0]]-array[index[0][1]] + (2 * (array[index[1][0]]-array[index[1][1]])) ) / 10
    return deltas

#convert audio to mfcc features
def extract_features(audio,rate):    
    mfcc_feat = mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True, nfft=1103)
    mfcc_feat = preprocessing.scale(mfcc_feat)
    delta = calculate_delta(mfcc_feat)

    #combining both mfcc features and delta
    combined = np.hstack((mfcc_feat,delta)) 
    return combined

Converting audio into MFCC features and scaling it to reduce complexity of the model. Than Extract the delta of the given feature vector matrix and combine both mfcc and extracted delta to provide it to the gmm model as input.

Adding a New User's Face

def add_user():
    name = input("Enter Name:")
     # check for existing database
    if os.path.exists('./face_database/embeddings.pickle'):
        with open('./face_database/embeddings.pickle', 'rb') as database:
            db = pickle.load(database)   
            if name in db:
                print("Name Already Exists! Try Another Name...")
        #if database not exists than creating new database
        db = {}
    cap = cv2.VideoCapture(0)
    cap.set(3, 640)
    cap.set(4, 480)
    #detecting only frontal face using haarcascade
    face_cascade = cv2.CascadeClassifier('./haarcascades/haarcascade_frontalface_default.xml')
    i = 3
    face_found = False
    while True:            
        _, frame = cap.read()
        frame = cv2.flip(frame, 1, 0)
        cv2.putText(frame, 'Keep Your Face infront of Camera', (100, 200), cv2.FONT_HERSHEY_SIMPLEX, 
                    0.8, (255, 255, 255), 2)
        cv2.putText(frame, 'Starting', (260, 270), cv2.FONT_HERSHEY_SIMPLEX, 
                    0.8, (255, 255, 255), 2)
        cv2.putText(frame, str(i), (290, 330), cv2.FONT_HERSHEY_SIMPLEX, 
                    1.3, (255, 255, 255), 3)

        cv2.imshow('frame', frame)
        if i < 0:
    start_time = time.time()        
    img_path = './saved_image/1.jpg'

    ## Face recognition 
    while True:
        curr_time = time.time()
        _, frame = cap.read()
        frame = cv2.flip(frame, 1, 0)
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        face = face_cascade.detectMultiScale(gray, 1.3, 5)
        if len(face) == 1:
            for(x, y, w, h) in face:
                roi = frame[y-10:y+h+10, x-10:x+w+10]

                fh, fw = roi.shape[:2]

                #make sure the face roi is of required height and width
                if fh < 20 and fw < 20:

                face_found = True
                #cv2.imwrite(img_path, roi)

                cv2.rectangle(frame, (x-10,y-10), (x+w+10, y+h+10), (255, 200, 200), 2)

        if curr_time - start_time >= 3:
        cv2.imshow('frame', frame)

    if face_found:
        img = cv2.resize(roi, (96, 96))

        db[name] = img_to_encoding(img)

        with open('./face_database/embeddings.pickle', "wb") as database:
            pickle.dump(db, database, protocol=pickle.HIGHEST_PROTOCOL)
    elif len(face) > 1:
        print("More than one faces found. Try again...")
        print('There was no face found in the frame. Try again...')

This part of the function add_user() is used to add a new user's face into the database.

First it detects the face in the frame using haarcascade classifier, if exact one face is found than it resizes the roi and passes it to the function img_to_encoding(img) which will return 128 dim facial encodings and dumps that encodings into our face_database as pickle file.

The reason behind using haarcascade face detector is that it only detects frontal face and not the side faces, so it will only unlock when the user looks in front of the camera.

If more than one faces or no faces are found in the frame than it will generate an error message.

Adding a New User's voice

 #Voice authentication
  FORMAT = pyaudio.paInt16
  RATE = 44100
  CHUNK = 1024
  source = "./voice_database/" + name

  for i in range(3):
      audio = pyaudio.PyAudio()

      if i == 0:
          j = 3
          while j>=0:
              print("Speak your name in {} seconds".format(j))


      elif i ==1:
          print("Speak your name one more time")

          print("Speak your name one last time")

      # start Recording
      stream = audio.open(format=FORMAT, channels=CHANNELS,
                  rate=RATE, input=True,

      frames = []

      for _ in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
          data = stream.read(CHUNK)

      # stop Recording
      # saving wav file of speaker
      waveFile = wave.open(source + '/' + str((i+1)) + '.wav', 'wb')

  dest =  "./gmm_models/"
  count = 1

  for path in os.listdir(source):
      path = os.path.join(source, path)

      features = np.array([])
      # reading audio files of speaker
      (sr, audio) = read(path)
      # extract 40 dimensional MFCC & delta MFCC features
      vector   = extract_features(audio,sr)

      if features.size == 0:
          features = vector
          features = np.vstack((features, vector))
      # when features of 3 files of speaker are concatenated, then do model training
      if count == 3:    
          gmm = GMM(n_components = 16, n_iter = 200, covariance_type='diag',n_init = 3)

          # saving the trained gaussian model
          pickle.dump(gmm, open(dest + name + '.gmm', 'wb'))
          print(name + ' added successfully') 
          features = np.asarray(())
          count = 0
      count = count + 1

if __name__ == '__main__':

The second part of the function add_user() is used to add a new user's voice into the database.

The User have to speak his/her name one time at a time as the system asks the user to speak the name for three times. It saves three voice samples of the user as a wav file.

The function extract_features(audio, sr) extracts 40 dimensional MFCC and delta MFCC features as a vector and concatenates all the three voice samples as features and passes it to the GMM model and saves user's voice model as .gmm file.

Voice Authentication

def recognize():
   # Voice Authentication
   FORMAT = pyaudio.paInt16
   RATE = 44100
   CHUNK = 1024
   FILENAME = "./test.wav"

   audio = pyaudio.PyAudio()
   # start Recording
   stream = audio.open(format=FORMAT, channels=CHANNELS,
                   rate=RATE, input=True,

   frames = []

   for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
       data = stream.read(CHUNK)
   print("finished recording")

   # stop Recording

   # saving wav file 
   waveFile = wave.open(FILENAME, 'wb')

   modelpath = "./gmm_models/"

   gmm_files = [os.path.join(modelpath,fname) for fname in 
               os.listdir(modelpath) if fname.endswith('.gmm')]

   models    = [pickle.load(open(fname,'rb')) for fname in gmm_files]

   speakers   = [fname.split("/")[-1].split(".gmm")[0] for fname 
               in gmm_files]
   #read test file
   sr,audio = read(FILENAME)
   # extract mfcc features
   vector = extract_features(audio,sr)
   log_likelihood = np.zeros(len(models)) 

   #checking with each model one by one
   for i in range(len(models)):
       gmm = models[i]         
       scores = np.array(gmm.score(vector))
       log_likelihood[i] = scores.sum()

   pred = np.argmax(log_likelihood)
   identity = speakers[pred]
   # if voice not recognized than terminate the process
   if identity == 'unknown':
           print("Not Recognized! Try again...")
   print( "Recognized as - ", identity)

This part of the function recognizes voice of the user as the user have to speak his/her name as the system asks.

As the user speaks his/her name the function saves the voice sample as a test.wav file and than Reads it to extract 40 dim MFCC features.

Load all the pre-trained gmm models and passes the new extracted MFCC vector into the gmm.score(vector) function checking with each model one-by-one and sums the scores to calculate log_likelihood of each model. Takes the argmax value from the log_likelihood which provides the prediction of the user with highest prob distribution.

If the user's voice matches than it will go onto the face recogniton part otherwise the function will terminate by showing an appropriate message.

Face Recognition

   # face recognition
   print("Keep Your face infront of the camera")
   cap = cv2.VideoCapture(0)
   cap.set(3, 640)
   cap.set(4, 480)

   cascade = cv2.CascadeClassifier('./haarcascades/haarcascade_frontalface_default.xml')
   #loading the database 
   database = pickle.load(open('face_database/embeddings.pickle', "rb"))
   start_time = time.time()
   while True:
       curr_time = time.time()
       _, frame = cap.read()
       frame = cv2.flip(frame, 1, 0)
       gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

       face = cascade.detectMultiScale(gray, 1.3, 5)
       name = 'unknown'
       if len(face) == 1:

           for (x, y, w, h) in face:
               roi = frame[y-10:y+h+10, x-10:x+w+10]
               fh, fw = roi.shape[:2]
               min_dist = 100
               #make sure the face is of required height and width
               if fh < 20 and fh < 20:

               #resizing image as required by the model
               img = cv2.resize(roi, (96, 96))

               #128 d encodings from pre-trained model
               encoding = img_to_encoding(img)
               # loop over all the recorded encodings in database 
               for knownName in database:
                   # find the similarity between the input encodings and recorded encodings in database using L2 norm
                   dist = np.linalg.norm(np.subtract(database[knownName], encoding) )
                   # check if minimum distance or not
                   if dist < min_dist:
                       min_dist = dist
                       name = knownName

           # if min dist is less then threshold value and face and voice matched than unlock the door
           if min_dist < 0.4 and name == identity:
               print ("Door Unlocked! Welcome " + str(name))

       #open the cam for 3 seconds
       if curr_time - start_time >= 3:

       cv2.imshow('frame', frame)
   if len(face) == 0:
       print('There was no face found in the frame. Try again...')
   elif len(face) > 1:
       print("More than one faces found. Try again...")
   elif min_dist > 0.5 or name != identity:
       print("Not Recognized! Try again...")
if __name__ == '__main__':

If the User's voice matches than the function will execute the face recognition part.

First it will load the cascade classifier to detect the face from the frame and than loads embeddings.pickle file which holds facial embeddings of authorized users.

If only one face was found than it will capture the face roi, resizes it and computes the 128 dimensional facial encodings. Than loop over all the recorded encodings in the database and finds the similarity between the input encodings and recorded encodings in database using L2 norm, if min dist is less then threshold value and both face and voice matches than it identifies the user as authorized and unlocks the door.

If min_dist is greater than the threshold value than it will show the user as unauthorized, if no faces or more than one faces are found than it will generate an error message.

Controlling the face recognition accuracy: The threshold value controls the confidence with which the face is recognized, you can control it by changing the value which is here 0.5.

Another version of recognizing user will keep runnning until KeyboardInterrupt by the user. It is a modified version of recognize() function for real time situations.

References : Code for Facenet model is based on the assignment from Convolutional Neural Networks Specialization by Deeplearning.ai on Coursera.
https://www.coursera.org/learn/convolutional-neural-networks/home/welcome Florian Schroff, Dmitry Kalenichenko, James Philbin (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering The pretrained model used is inspired by Victor Sy Wang's implementation and was loaded using his code: https://github.com/iwantooxxoox/Keras-OpenFace. Inspiration from the official FaceNet github repository: https://github.com/davidsandberg/facenet