This blog is the second in a series. We will cover another part of Object Detection in this blog.

Earlier we covered how a non-algorithms person can also leverage the power of object detection in their projects using the Azure-API. This time, we are going to look at how an algorithmic person can do so. For the same, we will cover an amazing algorithmic model known to be YOLOv3 aka You Only Look Once. This architecture has been mentioned in this paper for interesting folks to read. It contains lots of engineering beauties to ponder upon.

**Outline: **

- YOLOv3 Architecture
- Using YOLOv3 with reference to COCO Dataset

**Prerequisites:**

- Knowledge about Convolutional Neural Networks
- Knowledge about popular Machine Learning algorithms like Logistic Regression
- Introductory knowledge about Deep Learning and loss function

**Recap**

In the previous blog, we familiarized ourselves with how object detection is different from the image segmentation and that it looks to output bounding boxes along with class labels signifying objects enclosed within these bounding boxes. In addition, we looked at performance metrics like IoU as well.

**YOLOv3 Architecture**

We will look at notable things from YOLOv3 architecture to solidify our understanding of this wonderful algorithm.

**Feature Extractor:**

In Transfer Learning, we typically look to build a model in such a way that we remove the last layer to use it as a feature extractor. Architectures, where there doesn’t exist a pooling layer, are referred to as fully convolutional networks(FCN). The architecture that is used in YOLO v3 is called DarkNet-53. It is also referred to as a backbone network for YOLO v3. Its primary job is to perform feature extraction. It has 53 layers of convolutions. There is no max-pooling here. For each convolution operation, we have convolution followed by BatchNormalization and leaky RELU. In the earlier versions of YOLO, we didn’t have batch normalization and were using max-pooling due to which results weren’t great. The reason being, batchnorm ensures that even when you’re deep into the network, your inputs are normalized. Also, these convolutions are stride-2 convolutions. As max-pooling wasn’t working well and we needed something to downsample our filter maps, hence strided-convolutions are used. See for Size being mentioned 3 x 3 /2, here /2 represents stride-2 convolutions. The stride will direct the space between each sample in an operation on a pixel grid. For example, if stride is two, there will be two pixels between each subsequent sample. If you see in the adjoining picture, my input is a 5 x 5 matrix and my output is a 2 x 2 matrix using 3 x 3 kernels and stride-2 convolutions.

Stride-2 convolutions makes the input half i.e. if input_size = 256 x 256, then output_size = 128 x 128. So, fun/2 represents stride-2 convolutions resulting in the image size being halved. If you look at the whole architecture, convolutional layers are followed by residual connection or network here. It basically means that from the previous blocks, I have residual connections here(residual connections come from ResNets). When I have a very deep neural network, residual or skip connections help me avoid overfitting. Also, 1x, 2x, 3x signify how many times this particular block has been repeated in the actual architecture. Due to these repeated blocks, the total number of convolutional layers in the architecture come out to be 53. And also, for every block we have a residual or skip connection which happens to be coming from the previous block’s output convolutional layer. Earlier, the authors were using DarkNet-19 in YOLOv2 which wasn’t performing as well as we wanted it to. And at the end, we have an average-pool/mean-pooling, fully connected layer, and a softmax layer. What the authors did was, they pre-trained this whole model on the Imagenet dataset so that all these weights are well-tuned for object recognition. We are only using Imagenet for it to say, is there an object in the image or not. Hence, first, they are only looking to detect objects in the image not to create bounding boxes or anything. Later, they are using some hacks to get the job done.

DarkNet-19 was used in YOLOv2. On the Imagenet dataset, how DarkNet 19 compares to ResNets, and why we are required to use DarkNet-53 and not Resorts? Take a look.

Here,

**Top-1 and Top-5:** accuracy on top 1 or top 5 classifiers.

**Bn Ops:** billions of operations required at predict time.

**BFlops:** means billions of floating-point operations required per second.

**FPS: **frames per second.

First two columns denote accuracy while the rest 3 speed. We want more accuracy and faster speeds. Hence, the fewer the number of billions of operations i.e. BnOps, the faster you can run. The more BFlops and FPS, the better. Now, looking at the results:

- DarkNet-19 had amazing accuracy and BnOps were small but that resulted in FPS to be large.
- Resnet-152 had good performance but had more BNOps and less BFlops, and FPS.

The authors wanted DarkNet-53 to work as good as Resnet-152 but it should be faster. And look at the results, it is so well crafted that it works phenomenally well giving comparable performance and fewer BnOps and more BFlops and FPS. It is slower than YOLOv2 but it performs way better than this. And it’s always speed vs performance trade-off.

Keep in mind that this is pretraining on Image and not an object detection task. Let’s use 416 x 416 x 3 images for the follow-up understanding. Once our pretraining gets completed, we will remove our ag pool, connected and softmax layer as the main job of this backbone network was to perform feature extraction. Now, how do you compute the output size just before these layers? If I don’t use stridden convolutions, my image size doesn’t change. But as I’m using stride-2 convolutions, my image size would get halved each time it appears. Looking at the architecture it’s evident that stride-2 convolutions appear 5 times. Hence, by the time I come to this last layer my image size is fun/32, or in other words, if my input image is 416 x 416 x 3, it will get downsampled to 13 x 13 x 1024. Now, 416/32 results in 13, that’s okay but how did 3 changed to 1024. If you look at the convolutions in the last layer they’re 1024, hence we will write 1024 in output image size. The final output image size is 13 x 13 x 1024. The reason being that size of the filters is increased as you move deep in order to increase the depth of your tensor.

**Bounding Box and Output:**

How can we build bounding boxes over objects in our images? Let’s try to answer that. Imagine this is my image and this is the bounding box that I want to detect.

Let’s break the problem: what I have is the output from feature extractor i.e. 13 x 13 x 1024. Now, how do I convert it to bounding boxes and confidence values? I need to know how to represent my output because once it can be done, I’ll be able to decide what big data operations to perform in order to arrive here from my input. As the output, I received from the feature extractor was 13 x 13 x 1024, I broke my image into height and width of 13. I will be left with a grid of 13 x 13 boxes or cells. We are using only 13, as I have a structure from the backbone architecture. It can be understood this way as the extreme left corner of this grid is obtained from the left corner pixels of the image. Also, our output will be something like 13 x 13 x something we don’t know yet. This something is “depth” in which I want to save some box level information here. Now, for each cell or box, I’ll represent bounding boxes. Now, how will I do it? Using a bunch of coordinates and probability scores. (**Note: **Keep in mind that we are working with cells not pixels from an image, for a 416 x 416 pixels image, we have created 13 x 13 cells, else it would have become much harder computationally). Now, looking at this image, how can I encode/represent this bounding box? One way to do this is by using the center pixel or box whose coordinates can be and. Then I can represent the height and width of the box as and. If you can give me these values, I can tell you where the bounding box is. Along with it, I need a probability score called objectivity score signifying whether there is an object or not in this bounding box. Then we have class probabilities (where c = number of classes; for COCO dataset c = 80 )signifying that what is the probability of an object belonging to a certain class to be in this bounding box. And how do we compute it? It is computed using the product of (objectivity score) and where I can be from any class 1 to 80. denotes given that there is an object in this bounding box, what is the probability of this class i object being there. Hence, each bounding box can be represented using 85 values(for COCO dataset). In which the first 5 values are: and (measuring there’s an object or not). Then we have class probabilities for the 80 classes that I have.

**Note: **Replace in the diagram with

When I multiply with all ‘s, I will be left with a bunch of scores, precisely to be 80 in numbers. The resultant product values tell us that this is the probability of finding an object of the class in this bounding box. The max score tells us that we have a probability of finding an object i.e. score belonging to class I within this bounding box. Hence, for one box, we require 85 values.

Now, is one box sufficient? Now, corresponding to each cell, and assuming it to be the center, there can be multiple bounding boxes with different height and width in order to get different objects. If you recall, the image that we had used in the first blog post, we could have the same center pixel for both the handbag and the chair.

So, the same cell could be center to multiple objects and so that I could draw multiple bounding boxes. If the total number of boxes is 85, then number of values we require 85 * B. Now, imagine If I had a center cell for which there could be at most 3 boxes(used in YOLOv3) around 3 objects, then provided our input obtained from feature extractor was 13 x 13 x 1024, output will be:

13 x 13 x (B * (5 + 80)) where B = 3. Imagine, for each cell, we had 5 bounding boxes(used in YOLOv2, more the number of boxes, more time to process), and each box is represented by 85 numbers, then output will be:

13 X 13 X 1024 X -> 13 X 13 X 425

Now, how do I change a tensor of 13 x 13 x 1024 to a tensor of 13 x 13 x 425? We know that the third dimension is our depth, in a way we want to change the depth. There is an interesting idea in Inception Net, of using 1 x 1 convolutions to do so. We can use 425 – 1 x 1 convolutions, to do it. This is our output that we required from the very start. It has all class probabilities and all. But not everything is over yet, there are other pieces as well. Suppose, if you visualize this output, this is the intermediate result you would obtain:

You can see, we have an image that is broken up into multiple cells to form a cell grid around it. If orange cells represent a car, in a center cell, say the one in the car, there is a bounding box around this cell. This will be represented by a 425-dimensional vector which in turn can be represented by 5 boxes. Each box will have a class probability on its own. We will color the box with the class color whose class probability is maximum. For this cell, there is a box that says ‘car’ class probability is highest. Likewise, for another cell, you take all the bounding boxes having it as the center, and you say what is the most probable class label..and it says traffic light.

Now, from 13 x 13 x 425 we have to design a loss function, and from which we will back-propagate to the pretrained model of ours. The whole objective of deep learning is to create differential backdrop driven algorithms. There is not a direct jump from 13 x 13 x 1024 to 13 x 13 x 425, there are few more layers in between apart from 425-1 x 1 convolutions which I’ll explain in a little while.

We had our input image of size 416 x 416. We received an output of 13 x 13 x 1024. If you look at one cell of it, there is a 1024 dimensional vector corresponding to it. But where are they coming from? We reached here from the input image using multiple stridden convolutions and residual connections, all these 1024 values are dependent on this first 32 x 32 x 3(as RGB) grid of values, only? Why only 32 x 32? Because 416 / 32 = 13. Only this 32 x 32 pixels are contributing to these 1024 values. This is known as effective receptive fields. In the previous diagram, the whole image is 416 x 416, and each cell here is 32 x 32 pixels. There are 13 such cells. Imagine, if there is an object which spans multiple 32 x 32 cells. How is that information taken into consideration because these 1024 values are corresponding to this 32 x 32 grid, how do I know information from other parts of the image, encompassed by other cells?

**Bounding Box Representation:**

We represent it by taking the central value, and height and width.

Let’s say, the blue box is our bounding box. In the first few iterations of YOLO, people said that let’s just predict, , and. But later they found out that directly predicting centers and height and width of the box was pretty hard. They tried several approaches that didn’t work, but one did and it was using anchor boxes. We’ll see how anchor boxes are used as box coordinates and how they are derived.

Imagine, if someone gives me an image of size 416 x 416, and let’s say I’ll have 5 anchor boxes. I want to have some predefined boxes. Instead of representing my box using, and, what if we use 5 anchor boxes which are predefined. Now, tableau bi development services providers try to encode the bounding box as if they will represent the bounding box as a small modification in the anchor box. Anchor box coordinates are sometimes referred to as priors in paper. Coordinates of the anchor box are: and. They tried representing the bounding box using the nearest anchor box to it.

[, , and ] in terms of [, , and ] with some changes as this bounding box is smaller than the anchor box. And that’s all represented using 4 very simple equations:

My bounding box center is the anchor box’s center and some other values. Some other value is times where is what will be represented in your box coordinates. Take a deep breath and read it now:

- final box coordinates: [, and ]
- anchor box coordinates: [, and ]
- what we’ll learn as part of our bounding box representation: [, and ]

And all of them are connected using the above 4 equations. In a nutshell, the bounding box is a small change to the anchor box. As directly predicting box coordinates wasn’t working well, they said, we will predefine a bunch of anchor boxes and will represent any bounding box in terms of anchor box coordinates, width and height and the box coordinates that I am learning. [, , and ]—these are not actual physical coordinates of the bounding boxes. In fact these are learnt by the model and how it learns? That we will explain after sometime.

How do I come up with these 5 anchor boxes? How do I design these anchor boxes? They used the COCO dataset and looked at all the bounding boxes in the training data and they started clustering these boxes using k-means clustering. They used k = 5. When they did so, they ended up with 5 anchor boxes. What information actually these anchor boxes are trying to give out? These are the 5 boxes in which there is a very high likelihood of finding an object in our train data. Hence, you can think of it as a prior to your bounding boxes. This tells you good enough regions to look at because most of the bounding boxes in your training data lie here. So using this information, for encoding other bounding boxes, gives you less chances of error. Using anchor boxes, the performance was seen to be improved by 3-4%.

Also, we are using, because we have studied it quite well, know about its differential behavior and as would be real numbers, it will push them into the range of 0-1. We are trying to get centers for our bounding boxes. The reason for doing so? We want our anchor boxes to be as close to our bounding boxes. Hence, by adding a small value to and, we are only going to be off by only a small margin like at most 1 pixel away.

**Per-Class Sigmoids:**

We have seen how the box coordinates themselves work.

Let’s move onto the objectness score. It checks if there is an object at all in this box represented using box coordinates. It tries to find a probability of finding an object in this box. This is a simple binary classification task. For each box, for each objectness score, I will have a logistic regression model. If I am using the COCO dataset, I have 80 classes here. Say, if I want to find the probability that this box contains an object from. It is obtained by multiplying the probability of objectness score and probability of this class. We have a logistic regression model for the objectness score, but for these 80 classes, the obvious choice is a softmax classifier as it is a multi-class classifier. It is basically an extension of logistic regression to a multi-class setting. They tried it in YOLOv2, but it was not working much well. When they tried digging deep to see why they found something interesting. Imagine, if my class 10 is a person and class 13 is a woman. Imagine, if a lady is standing in an image, and I create a bounding box around her, would I call it a person or a woman? As I was using a softmax classifier, I can only classify as one of two categories, not both. Softmax basically says which is the highest class probability for the object. But in your 80 classes, you have few classes which have hierarchical representation. And what do I mean by it? A woman is also a person. So it is possible that an object within the bounding box can be thought of as belonging to both classes, leading to very high class probabilities. And to represent this sort of system, softmax doesn’t work well, as it is fundamentally designed to throw out a single class having the highest probability. The sum of all the probabilities has to be 1. In YOLOv3 they said, let’s build one logistic regression for each of the classes. Now, for one bounding box, you have 4 box coordinates, one logistic regression for objectness score and you have 80 logistic regressions for each of the classes. To minimize the box coordinates values, we use square loss because we want these numbers to be less erroneous as possible. Things would change if we’d have multiple boxes. So, there is a very interesting idea called per class sigmoids. Having one sigmoid per class is better than a softmax classifier because there could be some boxes that need to be labeled with multiple classes.

**Loss Function:**

So, we had an input image of size 416 x 416 x 3. Then, we applied convolutions from DarkNet and obtained features of 13 x 13 x 1024. Afterward, we applied some more convolutions (will explain these in a little while) and received 13 x 13 x 255 (if using B = 3; larger B, more complex model —> more kernels required —> more time required), it can be interpreted as that there are 13 x 13 cells, in any of them there could be a center for bounding boxes, and I could have at most 3 bounding boxes for each of them. So, in total, I have 13 x 13 x 3 possible bounding boxes. And this is quite a large number of which there would be a very small number of bounding boxes actually containing an object. Also, for any of these bounding boxes, if the value of is very small, we can ignore it. This is what your model is predicting i.e.. There must be some ground truth (it is an array similar to the representation of boxes). Now, you have to design your loss function in such a way that, based on the boxes predicted by the artificial intelligence business solutions model and the actual boxes that are there. If I have a loss that is differentiable, I can backpropagate the error over convolutions easily over the entire network including pre-trained DarkNet. (**Note: **Remember that DarkNet weights are Imagenet trained weights and used only for initialization purposes). If I have a sensible loss and good representations, I can easily back prop and leverage the power of optimization to converge faster to the ground truth.

We have two broad summations: the first summation is, overall bounding boxes containing an object in the ground truth. We are predicting [, and ]. These are numerical values, and hence predicting them can be thought of as a regression problem. For these values, a squared loss would work just fine. They initially tried on all 4, but later found out that, using the square root of and is better because, and are mere coordinates while and represent the area of the box.

Now, they used log loss on because we are using logistic regression over a binary classification task. Likewise, for each, we are using log-loss as we are using logistic regression for each of the classes present in the dataset.

Now, there will be some bounding boxes that won’t be containing any object. No object boxes are such boxes where in my predictions, I am predicting that there is a box with some object, but in-ground truth, there is no such box. Now, what do I do? I don’t care about the box coordinates neither I care about class probabilities, I only care about the value of, which must be very very small now, almost close to 0. Hence, for all the objects which don’t contain any object in the ground truth, I am just minimizing them.

Now, there are two interesting things to ponder here. One is and another is. Our box coordinates are not binary but real values while we could be only between 0-1 as it’s a probability score. In many machine learning models (Logistic Regression, SVMs), in loss functions we have loss as well as a regularizer multiplied by. The job of this is to make a choice between minimizing loss and regularizing the model. Because of the scale of these two numbers being different, it is sensible to actually weigh them differently. Initially, they didn’t do it and weren’t getting good performance, but later thought that if they could create a weighted model, that might do the trick. These values are like hyperparameters, and the best values they found were, and. We are doing so, as, without it, the first summation might dominate the other one.

**Multi-scale Predictions:**

What was the output of DarkNet? 13 x 13 x 1024 right. That means I have a 13 x 13 cell grid. Now, each cell can be the center of a bounding box. Now, each cell is made up of 32 x 32 pixels, I can only detect large objects in this type of representation.

Imagine, we had an object of really small size, like 16 x 16 pixels, then we can’t recognize it because my box’s center itself is 32 x 32 pixels. So, the idea of multi-scale prediction says, instead of predicting with one, the grid size of 13 x 13, why don’t we also predict using other grid sizes? Hence, before the last stride-2 convolution was applied, the size would have been 26 x 26 x 512. Likewise, before the second last stride-2 convolution was applied, the size would have been 52 x 52 x 256. The idea behind doing so is that with the help of these grid sizes, the resulting number of pixels contained within each cell would reduce, due to which we can detect objects of smaller sizes.

Hence, to detect smaller objects, not only do I want to predict bounding boxes with a grid of size 13 x 13, but I also want other grid sizes like 26 x 26 and 52 x 52.

Now, there is a whole network called Feature Pyramid Networks (FPNs).

Image Pyramid is a concept from image processing somewhat 30-40 yrs older. Imagine if I have an image and I subsample it. If I use multiple stride-2 convolutions over my image, the image size will be halved every single time, hence creating a sort of like a pyramid when placed one over another because I am having as I go. This concept is used in FPNs. Now, one idea is as follows; the authors of YOLOv3 said that for each of the grid sizes, we will have a convolutional network to produce a bunch of bounding boxes: 13 x 13 –> NetworkNo1, 26 x 26 —> NetworkNo2, 52 x 52 —> NetworkNo3. But there is a problem here. Whatever information is learned here at one level is not being reused by the level below it to predict bounding boxes. What if I could do so? I am not borrowing information from the previous cell granularity as of now. If I could do so, I will be able to leverage real-time data analytics information at multiple scales. That’s where FPN comes into the picture. It’s a very general concept, but in our architecture, this is your whole DarkNet because it is using stride-2 convolutions for downsampling. Now, after applying a bunch of convolutions to get grid size 13 x 13 x 255; from this, I will predict bounding boxes. Now to predict boxes, I am going to use information coming from the same level with convolutions of course, but I will be also borrowing information from the level above it before making predictions. If I can borrow information from the smaller grid sizes to predict boxes at the larger grid sizes, I would be able to retain much information rather than throwing it away. This architectural beauty is sometimes referred to as FPNs with lateral connections.

**Combining Boxes from Various Scales:**

Let’s do a very quick computation. Imagine if my image is of size 416 x 416. And I am computing 3 bounding boxes at 52 x 52 cells grid, at 26 x 26 cells grid, and at 13 x 13 cells grid(each of these cells can be central to one of the bounding boxes and I can have at most 3 bounding boxes); then a total number of bounding boxes will be 10,647. Again, this is the maximum number of bounding boxes possible at a multi-scale level, and there would be many bounding boxes where would be very small and almost close to 0.

Now, looking at the diagram:

32 here at the top represents batch size for sending batches of inputs. From your first convolutional layer till your last residual block is your DarkNet 53. After which you’re left with 13 x 13 x 1024. I told you earlier that to reach from 13 x 13 x 1024 to 13 x 13 x 255, I can use 255-1 x 1 convolution. They used something similar in YOLO v1, but then they realized that instead of using something similar to this, can’t we add more convolutions so that we can get more power? If you look at the diagram, we are using 3 x (512 x 1 x 1 + 1024 x 3 x 3) convolutions. Let me tell you the reason for it and why it improves performance. It creates 3 times 512 kernels with 1 x 1 convolution followed by 1024 kernels of 3 x 3(why couldn’t we have all 1 x 1 convolution? I had a 13 x 13 cell grid, doing so in order to get information from surrounding cells as well to predict the bounding box at this cell) and again it is followed by 255 kernels with 1 x 1 convolution. to get 13 x 13 x 255.

Now, this would get me my bounding boxes with a 13 x 13 cell grid. Also, using the idea of FPNs we would do the same with a 26 x 26 cell grid and 52 x 52 cell grid in order to detect objects of sizes at varying scales. Now to predict bounding boxes at 26 x 26 grid, we will have data from the 26 x 26 x 512 data block i.e. residual block which was earlier used to make predictions for 13 x 13 grid. Now, I also want to borrow information from the 13 x 13 for the concept of FPNs, for this, we take our 13 x 13 x 1024 output received from DarkNet 53 and pass it onto 256 1 x 1 convolution which will make it 13 x 13 x 256. Thereafter we upsample. What did it do? It is the opposite of downsampling and it will make 13 x 13 x 256 into 26 x 26 x 256. Now, for multi-scale, also, I have other information from the previous residual block, i.e. 26 x 26 x 512. If I concatenate both of them what do I get? I get 26 x 26 x 768.

And later, I did something similar to what we did earlier in the 13 x 13 case to predict bounding boxes in the case of 52 x 52 cases.

**Filtering:**

For each box, we have and values. For each box, we multiply with each of the ‘s. Doing so, it will result in 80 probabilities. Take the maximum over all of these probabilities, and if it comes out to be less than 0.5, just ignore it. It is useless. We saw earlier how we got around 10k+ boxes, but we don’t want all of them, we just want useful ones and wish to throw away all the useless ones.

**Non-Max Suppression:**

Once I have got all my boxes, and I end up having a few highly overlapping boxes, which one should I keep, and which one should I reject? For every pair of boxes, I will compute IoU i.e. intersection over the union. The box with higher IoUs compared with other boxes will capture much of the information that is present in these other boxes. So, I will only have that box that has the maximum IoUs when compared with other boxes. All the other boxes will be suppressed, hence called as non-max suppression. Doing this instead of getting 3 boxes, I will end up having one clean box which is most likely going to be the best bounding box for my object.

In the next blog, we will look at the Keras implementation of YOLOv3 for better understanding and see how our learning translates into code by custom software development solutions providers. Also, we will look at the C code implementation provided by the authors in order to know how to use the model from scratch and in pretrained settings. Until then, happy learning!