# Object Detection ______ ## RetiaNet - Object detection model `RetinaNet` has been formed by making two improvements over existing single stage object detection models - Feature Pyramid Networks and Focal Loss. ![image](../assets/retianet.png) **Feature Pyramid Network** - Feature image pyramids used to detect objects with varying scales in an image. - In feature image pyramids, we take an input image and subsample it into lower resolution and smaller size images - With the advancements of deep learning, we now use the pyramidal hierarchical structure with CNNs. - In a CNN architecture, the output size of feature maps decreases after each successive block of convolutional operations, and forms a pyramidal structure. **Focal Loss** - Focal Loss is an enhancement over Cross-Entropy Loss and is introduced to handle the class imbalance problem with single-stage object detection models. - Single Stage models suffer from a extreme foreground-background class imbalance problem due to dense sampling of anchor boxes - Focal Loss reduces the loss contribution from easy examples and increases the importance of correcting misclassified examples. **Advantages** - Highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN - Surpassing the accuracy of all existing state-of-the-art two-stage detector ## UNETs **UNet** is a fully convolutional network (FCN) used for image segmentation. The goal is to predict each pixel's class in an image. **Architecture** - Three Main Components): - Encoder or Downsampling Path - Bottleneck - Decoder or Upsampling Path **Downsampling Path:** - Consists of two convolution layers each followed by a ReLU activation function and a 2x2 max pooling operation for `downsampling`. - At each `downsampling` step we double the number of feature channels **Bottleneck:** - part of the network is between the contracting and expanding paths. The bottleneck is built from 2 convolutional layers (with batch normalization) and with dropout. **Upsampling Path:** - Every step in the decoder path consists of an `upsampling` of the feature map followed by a 2x2 convolution, a concatenation with the corresponding feature map from the `downsampling` path, and two convolutions layers, each followed by a ReLU. **Final Layer:** - A 1x1 convolution is used to map each feature vector to the desired number of classes. **Loss function:** - The energy function is computed by a pixel-wise softmax over the final feature map and then applied cross-entropy loss function. ### Metrics - `IoU (intersection of union)` -> Area overlap / Area of Union **Notes** - Means converting a high resolution image to a low resolution image. By down sampling, the model better understands `What` is present in the image, but it loses the information of `Where` it is present. **Advantages** - UNet combines the location information from the downsampling path to finally obtain a general information combining localisation and context, which is necessary to predict a good segmentation map.