Flawed usage of anchor boxes introduces bias in AI
At Eydle, we protect brands from phishing attacks on websites, social media platforms, and app stores. In an earlier post on phishing, we described the emerging attack trends in phishing including impersonation on social media. Online phishing attacks often impersonate the look and feel of brand assets such as logos and images. In order to victimize users, an attacker can clone a website and copy over logos and images to their phishing page. To identify such phishing attacks, we use state-of-the-art AI and deep learning tools to analyze visual elements such as logos, color scheme, fonts and images.
To detect whether a brand logo appears within a set of webpages, we can use object detection methods such as You-Only-Look-Once (YOLO), Single-Shot-Detection (SSD), Fast R-CNN, and Region-based Fully Convolutional Network (R-FCN). Here we describe the TL process for object detection using the YOLOv2 model and the Flickr-47 dataset, which consists of 47 well-known brand logos.
Every time we onboard a new customer, we need to monitor new logos and need to update our object detection models. This raises two challenges: (1) how do we extend the model such that it retains its accuracy on the existing logos while it learns to detect the new logo, and (2) how do we keep the computational cost of the update under control. To address these challenges, we use Transfer Learning (TL) of object detection models. Here, we highlight one important but overlooked aspect: role of anchor boxes in TL. Flawed usage of anchor boxes decreases the accuracy of the trained model and introduces bias in the process of detecting objects in the real world.
Why Transfer learning
As we onboard a new brand/customer, we have to start monitoring brand assets for that customer. This requires retraining our AI models. One naive approach would be to train the model from scratch by initializing the model weights to random values and using the training data for all the logos: original and new. However, we want to reduce the amount of training required every time we add a new logo. To achieve that, we use transfer learning which implies transferring what the model has already learned from previous training into the new model, which now has to learn new features present in the new logo. Such transfer of learning has to be accurate and computationally efficient for this process to work in real-world cases.
Transfer learning has emerged as a powerful method to build an AI model on new and/or sparse data by using an existing AI model pre-trained on similar input data types but in another context, for another task, or another output. For example, an Imagenet model trained to identify cats versus dogs can be used to create artistic filter for photos or detect counterfeits of expensive brand-name handbags with design patterns. TL for object detection has special challenges one of which is how to define the anchor boxes.
Importance of Anchor boxes
In object detection models such as YOLO2, the shape, size and number of the anchor boxes play a vital role in the correct placement of bounding boxes around objects that will be detected in an image when the model is applied to that image. These anchor boxes are defined from the shape, size and number of bounding boxes used in labeling the training image dataset using methods such as k-means clustering. The number of clusters are equal to the number of anchors. In case of transfer learning of a pre-trained model, there already exists a set of anchor boxes that were found using the clustering algorithm on the bounding boxes in the original training dataset. Those anchor boxes were used during the training of the original model. Now that we have a new set of logo images, should we use the same original anchor boxes, define new anchor boxes based on the bounding boxes in the new images, or find a middle path that combines original and new anchor boxes? Should we blindly re-run the k-means algorithm on the original and new bounding boxes?
Anchor boxes’ shape and size must be representative of distinct objects embedded in the training and test image dataset. By representative we mean the shape and size of the anchor boxes are close to the shape and size of the bounding boxes in the labeled training images. By distinct we mean objects that look very different from each other in shape and size. For better detection accuracy in the wild, the anchor box shape and size should be representative of the objects in the to-be-detected future images that the model will be applied to. This is known as Out-Of-Distribution generalization and achieving that is the holy grail of deep learning. Another point is that the anchor box definition (number, size and shape) should remain consistent between model training and testing/detection stages, otherwise the result of detection can be sub-optimal. See the next section for more on this.
Retraining on a New Logo
Our initial model was trained using a pre-trained YOLOv2 logo detection model with a network architecture derived from Darknet. We converted the pre-trained model and its weights from an OpenCV format into a TensorFlow format. The pre-trained model was trained for 10,000 epochs on the Flickr-47 dataset. To add a new logo, we have to retrain the model. We add a new logo class for the brand, which is Chase in this example, and add 20 labeled training images for Chase to the original Flickr-47 training dataset. Adding a new logo class changes the model output size, we modify the required model layers to address the change. We adjust the anchor box definitions to be inclusive of the newly added logo class.
We freeze all the layers in the model except the last few layers and train the model for approximately 5000 epochs. Then we perform fine-tuning by unfreezing all the model layers and performing training for approximately 200 more epochs. During fine-tuning, we adjust the learning rate to be small enough so we avoid large perturbations to the model weights and avoid losing what we learned during the original training. An important point to note here is that the the number of training epochs for TL + fine-tuning is much smaller than the number of epochs in the original training, which is often in the range of 10,000–20,000.
In the previous section, we mentioned that the anchor box definition should be consistent between training and testing. Here, we emphasize that statement: the number of anchor boxes must be the same during training and testing because the model output tensor size depends on the number of anchor boxes. The model output tensor size is given by (#C + 5) * #B, where #C is the total number of object classes and #B is the total number of anchor boxes. The size and shape of the anchors can change between training and testing. But we recommend to keep them constant to determine and to optimize the Intersection-over-Union (IoU) metric, which is one of the accuracy metrics for object detection models. If we change the anchor definition between training and testing, the detection results can be sub-optimal.
The first image below shows the results of testing our TL model on a Chase image using the same anchor definitions which we used during training. The second image shows the results on the same image but this time using different anchor definitions, while keeping the number of anchors the same in both the tests.
One may get tempted to use as many anchor boxes as possible to optimize for the variety of logo sizes and aspect ratios in the training dataset, but this is not advisable. With the increasing number of anchors, the average IoU change plateaus as shown in the following graph. Also, increasing the number of anchors makes the model output tensor size large. This makes the training computationally expensive and makes it harder to keep the training batch size optimal.
We took a pre-trained model on 47 logos and extended it to detect a new logo (Chase) by adjusting the anchors to be inclusive of the new class. We applied transfer learning and fine-tuning to the original model by running training for a relatively small number of epochs compared to the number of epochs in the original training. This saves significant amount of computing resources. In this transfer learning process, we learned about the role of anchor boxes in determining both the logo detection accuracy and the computational cost of building the final model.
Real-world Implications of Anchor Boxes
An important implication of our work is in ethical and explainable AI. Consider, for an example, an AI model that is used to count the number of people entering through a gate at a public facility, e.g. a shopping mall or a government building. If it is an object detection model trained on video data (e.g. YOLOv2 or v3) of people walking or running through the gate, and the anchor box shapes are based on that data, it is likely that the model will classify someone on a wheelchair, or people with different body shapes, incorrectly. This is an example of a biased AI model trained on data representing the majority or “normal” cases. To rectify this model and make it more inclusive, one may collect a dataset of those minority cases and attempt to re-train the model. By definition, the minority dataset will be smaller in size compared to the original dataset and training a new model from start with a balanced dataset will be very slow and expensive. Transfer learning of the original model, with anchor boxes modified based on the bounding boxes in the minority dataset, as we discussed above, can save the day.