Detecting Pedestrians using PyTorch – A Helpful Guide

Detecting Pedestrians using PyTorch – A Helpful Guide
Photo by Jacek Dylag / Unsplash

In this blog post, you will learn how to implement a Pedestrian Detection algorithm using PyTorch.

Computer vision is a field of computer science that applies artificial intelligence models to understand, reason with as well as synthesize visual information. This visual information is usually in the form of a two-dimensional image but can also be in other forms like videos, 3-D meshes, and polyhedrons, etc.

The most common problem addressed by computer vision is that of image classification i.e. taking an image as input and returning the type of the image. As trivial as it sounds, it was not an easy problem to solve for computers as recently as the beginning of the 21st century. Take the simple example of classifying whether an image is a cat or not. Cats come in various shapes (intra-class variations) and sizes (scale variation), are often found in front of cluttered indoor scenes (background clutter) or partially hidden (occlusion), among a host of other different variations.

Stock image showing variations in cats: appearance, shape, size and pose

Classifying images is just the tip of the iceberg when it comes to computer vision as the type of image is often the most basic information we can extract from it. Some other use cases of computer vision are as listed below:

  • Object detection: Determining the position and type of an object (or multiple objects) in an image
  • Semantic/Instance segmentation: Labeling all the pixels corresponding to an object type (semantic) or each occurrence of the object (instance)
  • Action recognition: Given a video, understanding the activity being performed by it
  • Pose detection: Identifying the pose of a human subject in an image/video
  • Visual question answering: Answering questions based on information provided by an image
  • Image generation: Generating novel images based on data/criterion provided to a computer vision model
Computer vision applications (clockwise from top left): Object detection, Action recognition, Relationship detection among visual entities, 3D mesh generation, Image generation, Semantic segmentation (Image source: [1])

In the past decade or so there has been tremendous progress in computer vision research and its real-world applications, enabled mostly by the success of deep learning models as well as software libraries which make it easy to implement and deploy these models.

In particular, Convolution Neural Networks (CNNs) have been extremely successful for computer vision applications. If you are unfamiliar with CNNs, there are many excellent resources on the internet to get started with. In particular, I’d recommend the reader to go through the course notes for CS231n ( which build up from the fundamentals and provide excellent visualization of convolutions, the fundamental mathematical operations used in CNNs.

PyTorch is a python library released by Facebook for building and training neural networks. Pytorch is particularly helpful for computer vision since it comes paired with the Torchvision library which provides common CNN architectures, pre-trained models, datasets for easily loading images, as well as many other helpful features to train your computer vision model. Since we will be using the PyTorch framework I’d recommend familiarizing yourself with the basics of building a neural network as well as a convolution neural network in PyTorch at (see references [3], [4]).

An example: Object Detection using PyTorch

Object detection is the problem of detecting the pixels corresponding to an object among all the pixels that constitute an image. The object detector returns a bounding box which is a rectangle surrounding all the object pixels. Object detection is a fundamental problem in computer vision and finds applications in almost all fields from robotics to autonomous driving to medical imaging.

Next, I will describe (with code) the steps involved in taking a pre-trained detection model in PyTorch and then finetuning it for your own object detection problem. This tutorial very closely follows and borrows the official PyTorch tutorial [5].

But unlike the official tutorial, which is focused on instance segmentation, I will address the more accessible problem of object detection.

An example of object (pedestrians) detection. Image source: [2] 

For this tutorial, I am going to use images from the Penn-Fudan dataset [2] which you can download at This dataset has just two types of objects: (a) Pedestrians (b) Background objects. Since we are only concerned with pedestrians I will disregard background objects for the sake of this tutorial. Let’s load an example image with the bounding boxes from the dataset for visualization:

from PIL import Image, ImageDraw

 source_img ='PennFudanPed/PNGImages/FudanPed00001.png')
 draw = ImageDraw.Draw(source_img)
 draw.rectangle(((160, 182), (302, 431)), outline="red")
 draw.rectangle(((420, 171), (535, 486)), outline="red")

In order to facilitate reading in data as well as to load it into the neural network models, PyTorch provides two very helpful classes in the module: Dataset and Dataloader. The dataset class can be modified to read in the data from our own dataset. The two main member functions in this class are __getitem__() and __len__(). The former tells PyTorch how to load one item from the dataset and the second tells PyTorch how many data points are in the dataset. I have implemented a PyTorch dataset class for the Penn-Fudan dataset:

class PennFudanDataset(object):

    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images ad masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img ="RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask =
        # convert the PIL Image into a numpy array
        mask = np.array(mask)
        # instances are encoded as different colors
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        # convert everything into a torch.Tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

Next, I implemented some pre-processing steps which transform the input training images in order to provide the model more data to train on. In particular, I flip half the training images horizontally during training. The Torchvision library provides the transforms module ( which has functions for this as well as several other pre-processing formulations.

from engine import train_one_epoch, evaluate import utils
import transforms as T

def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
    return T.Compose(transforms)

Now that we have our dataset, we split it into train and test sets and implement the Dataloaders to easily load the data into our object detection model for training and testing:

dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset =, indices[:-50])
dataset_test =, indices[-50:])

# define training and validation data loaders
data_loader =
    dataset, batch_size=2, shuffle=True, num_workers=4,

data_loader_test =
    dataset_test, batch_size=1, shuffle=False, num_workers=4,

Now that we are done with processing the dataset, I’ll move onto the actual model used for object detection. I will be using the Faster-RCNN [6] model which is available in the torchvision.models.detection module.

The Faster R-CNN architecture. Image source: [6]

The exact configuration of Faster R-CNN and its components is beyond the scope of this tutorial (as well as not necessary for getting started). However, the important thing to note is that unlike a classifier which has one final output ‘head’, a detector has two output heads: (a) the detector, which returns the four vertices of the bounding boxes of the detected objects, and (b) the classifier, which returns the output class of the objects.  Since we only have two classes in the Penn-Fudan dataset I’ve replaced the classifier (torchvision.models.detection.faster_rcnn.box_predictor) with a new classifier with 2 output classes using torchvision.models.detection.faster_rcnn.FastRCNNPredictor class.

import torchvision from torchvision.models.detection.faster_rcnnfrom torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_object_detection_model(num_classes):
    # load a Faster-RCNN object detection model pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

Now that I have my dataset and model ready, I’m going to use some of the training utities available at ( to train our model for 25 epochs:

num_epochs = 25

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    # update the learning rate
    # evaluate on the test dataset
    evaluate(model, data_loader_test, device=device)

Once trained, we can visualize the results of our model using it in evaluation mode with PyTorch’s model.eval() functionality. You’ll notice that just after 25 epochs on a small dataset like Penn-Fudan our model is able to get good object detection results.

# pick one image from the test setimg, _ = dataset_test[0]

# put the model in evaluation mode
with torch.no_grad():
    prediction = model([])

bbox_1 = prediction[0]['boxes'].cpu().numpy()[0]
bbox_2 = prediction[0]['boxes'].cpu().numpy()[1]

# convert the image, which has been rescaled to 0-1 and had the channels flipped
pred_img = Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())

draw = ImageDraw.Draw(pred_img)
draw.rectangle(((bbox_1[0], bbox_1[1]), (bbox_1[2], bbox_1[3])), outline="red")
draw.rectangle(((bbox_2[0], bbox_2[1]), (bbox_2[2], bbox_2[3])), outline="red")

And there you have it folks: now you know how to take your own dataset, finetune an object detection model on it and then perform object detection using PyTorch and Torchvision.

This blog post is written by Shashank Shekhar.


[1] CS231n: Convolutional Neural Networks for Visual Recognition (
[2] The Penn-Fudan Database (
[3] PyTorch tutorial on Neural Networks (
[4] PyTorch tutorial on Training a classifier (
[5] Torchvision Object Detection Finetuning Tutorial (
[6] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (