Live Object Detection with the Tensorflow Object Detection API

Live Object Detection with the Tensorflow Object Detection API
Update 04.11.19: Tensorflow Object Detection now works with Tensorflow 2.0. You can find the updated code on my Github.

Object detection deals with detecting instances of a certain class, like humans, cars or animals in an image or video. It can achieve this by learning the special features each object possesses.

The Tensorflow Object Detection API is an open source framework that allows you to use pretrained object detection models or create and train new models by making use of transfer learning. This is extremely useful because building an object detection model from scratch can be difficult and can take lots of computing power.

In this article, we will go through the process of rewriting the existing example code to detect objects in real time video stream.

Installation

For getting the video stream we will use the OpenCV(Open Source Computer Vision) library, which can be installed by typing:

pip install opencv-python==3.4.4.19
or
conda install opencv

If you don’t have the Tensorflow Object Detection API installed yet you can check out my article that goes through the installation step-by-step and at the end tests the functionality by executing the example notebook.

Object Detection

The example notebook  can be reused for our new application. This is because the main part of the notebook is importing the needed libraries, downloading the model and specifying useful helper code.

The only section we need to modify is the detection section, which comprises of the last three cells and at the moment is detecting objects in two manually loaded images. The first cell isn’t needed at all anymore since its only purpose was to get the paths to the test images.

In order to create a live object detection application, we need to make minor changes to the second and third cell.

First, we need to remove all the code that only needs to be executed once from the run_inference_for_single_images method. This is because if we need to call this method multiple times per second it is really computationally expensive to execute redundant code. The code statements to remove include everything from the with statements, that open both the graph and session, until the start of the if statement.

The removed lines will be copied into the next cell. The finished function looks like:

def run_inference_for_single_image(image, graph):      
    if 'detection_masks' in tensor_dict:
        # The following processing is only for single image
        detection_boxes = tf.squeeze(tensor_dict['detection_boxes'], [0])
        detection_masks = tf.squeeze(tensor_dict['detection_masks'], [0])
        # Reframe is required to translate mask from box coordinates to image coordinates and fit the image size.
        real_num_detection = tf.cast(tensor_dict['num_detections'][0], tf.int32)
        detection_boxes = tf.slice(detection_boxes, [0, 0], [real_num_detection, -1])
        detection_masks = tf.slice(detection_masks, [0, 0, 0], [real_num_detection, -1, -1])
        detection_masks_reframed = utils_ops.reframe_box_masks_to_image_masks(
            detection_masks, detection_boxes, image.shape[0], image.shape[1])
        detection_masks_reframed = tf.cast(
            tf.greater(detection_masks_reframed, 0.5), tf.uint8)
        # Follow the convention by adding back the batch dimension
        tensor_dict['detection_masks'] = tf.expand_dims(
            detection_masks_reframed, 0)
    image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')

    # Run inference
    output_dict = sess.run(tensor_dict,
                           feed_dict={image_tensor: np.expand_dims(image, 0)})

    # all outputs are float32 numpy arrays, so convert types as appropriate
    output_dict['num_detections'] = int(output_dict['num_detections'][0])
    output_dict['detection_classes'] = output_dict[
          'detection_classes'][0].astype(np.uint8)
    output_dict['detection_boxes'] = output_dict['detection_boxes'][0]
    output_dict['detection_scores'] = output_dict['detection_scores'][0]
    if 'detection_masks' in output_dict:
        output_dict['detection_masks'] = output_dict['detection_masks'][0]
    return output_dict

In the last cell, we will first of include all the code we removed from the cell above.

with detection_graph.as_default():
    with tf.Session() as sess:
        ops = tf.get_default_graph().get_operations()
        all_tensor_names = {output.name for op in ops for output in op.outputs}
        tensor_dict = {}
        for key in [
          'num_detections', 'detection_boxes', 'detection_scores',
          'detection_classes', 'detection_masks'
          ]:
            tensor_name = key + ':0'
            if tensor_name in all_tensor_names:
                tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(tensor_name)
        for image_path in TEST_IMAGE_PATHS:
            image = Image.open(image_path)
            # the array based representation of the image will be used later in order to prepare the
            # result image with boxes and labels on it.
            image_np = load_image_into_numpy_array(image)
            # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
            image_np_expanded = np.expand_dims(image_np, axis=0)
            # Actual detection.
            output_dict = run_inference_for_single_image(image_np, detection_graph)
            # Visualization of the results of a detection.
            vis_util.visualize_boxes_and_labels_on_image_array(
                image_np,
                output_dict['detection_boxes'],
                output_dict['detection_classes'],
                output_dict['detection_scores'],
                category_index,
                instance_masks=output_dict.get('detection_masks'),
                use_normalized_coordinates=True,
                line_thickness=8)
            plt.figure(figsize=IMAGE_SIZE)
            plt.imshow(image_np)

Now we will import OpenCV, create a VideoCapture object and change the for loop that loops through the test images to a while True loop.

Inside the loop, we won’t load the images using Image.open anymore and rather use the read function from the VideoCapture object to get the current frame.

Lastly, we also need to change the visualization part to use cv2.imshow, which creates a GUI that shows the live video instead of the plt.imshow function that just shows a static image. We also will define an if statement that checks if the q button was pressed and if it was closes the window and releases the webcam.

import cv2
cap = cv2.VideoCapture(0)

with detection_graph.as_default():
    with tf.Session() as sess:
        ops = tf.get_default_graph().get_operations()
        all_tensor_names = {output.name for op in ops for output in op.outputs}
        tensor_dict = {}
        for key in [
          'num_detections', 'detection_boxes', 'detection_scores',
          'detection_classes', 'detection_masks'
          ]:
            tensor_name = key + ':0'
            if tensor_name in all_tensor_names:
                tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(
              tensor_name)
        while True:
            ret, image_np = cap.read()
            # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
            image_np_expanded = np.expand_dims(image_np, axis=0)
            # Actual detection.
            output_dict = run_inference_for_single_image(image_np, detection_graph)
            # Visualization of the results of a detection.
            vis_util.visualize_boxes_and_labels_on_image_array(
              image_np,
              output_dict['detection_boxes'],
              output_dict['detection_classes'],
              output_dict['detection_scores'],
              category_index,
              instance_masks=output_dict.get('detection_masks'),
              use_normalized_coordinates=True,
              line_thickness=8)
            cv2.imshow('object detection', cv2.resize(image_np, (800,600)))
            if cv2.waitKey(1) & 0xFF == ord('q'):
                cap.release()
                cv2.destroyAllWindows()
                break

After running this a new window will open, which can be used to detect objects in real time.

Object Detection Example
Figure 2: Object Detection Example

Conclusion

Object detection deals with detecting instances of a certain class, like inside a certain image or video. It can achieve this by learning the special features each object possesses.

The Tensorflow Object Detection API allows you to easily create or use an object detection model by making use of pretrained models and transfer learning.

If you liked this article consider subscribing on my Youtube Channel and following me on social media.

The code covered in this article is available as a Github Repository.

If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section