Fixing Cardinality Errors In DETR Models
Understanding the Cardinality Error Issue in DETR Variants
DETR (DEtection TRansformer) and its variants have revolutionized object detection, but a persistent issue affects their training and evaluation. Specifically, the cardinality error reported during training is often incorrect for models derived from DETR that do not have an explicit background class. This problem stems from how these models handle the absence of objects in an image. When a model predicts the presence of an object, it assigns a confidence score to each possible class. In the original DETR implementation, an extra class embedding is added to model the background. However, later models use a sigmoid/focal loss and the output logits have len(num_classes), which eliminates the background class. This fundamental difference leads to the cardinality error, which calculates the number of predictions that are classes[-1] versus ground truth. This is inaccurate because the models are designed differently, making the cardinality error an invalid metric. Because of the architectural change, the calculation is flawed and results in misleading information during the training process. This is a critical problem because of its impact on the user's ability to understand the model's performance and the reliability of training logs.
The Root Cause: Lack of Explicit Background Class
The original DETR architecture included an explicit background class, typically assigned an index of -1. This allowed the model to differentiate between objects and the background directly. However, in many DETR variants like Deformable DETR, Conditional DETR, and others, this explicit background class is removed. Instead, these models use different methods, such as sigmoid/focal loss, to handle the absence of objects. The ImageLoss function, which is supposed to calculate the cardinality error, relies on the assumption that an object is explicitly index -1. This assumption is invalid when the model does not have this class. The error essentially computes the number of predictions matching classes[-1] against ground truth, which is meaningless in the absence of an explicit background class. As a result, the cardinality error becomes an unreliable metric that does not change meaningfully during the training, which creates confusion for users. The lack of an explicit background class makes the direct application of the original cardinality calculation incorrect and misleading.
The Impact of Incorrect Cardinality Error
The presence of an incorrect cardinality error has several negative consequences, particularly affecting model developers and those using logging frameworks.
Firstly, it can mislead users during the training process, leading them to misinterpret the model's performance. The cardinality error, being a part of the loss dictionary, provides an inaccurate view of how well the model is learning. This can cause developers to make incorrect decisions about model architecture, hyperparameter tuning, or training strategies. Since it appears as part of the loss dict, the error may create the illusion that the model is performing poorly, when the real losses are small. The error is out of sync with other loss components and does not provide an accurate representation of the model's behavior.
Secondly, it breaks common logging practices. Many logging frameworks, such as TensorBoard and Weights & Biases, are used to track training progress by summing the individual losses. However, the incorrect cardinality error, being significantly large compared to other losses, throws off these calculations. This affects the interpretation of the overall loss and the ability to effectively monitor the training progress. The logging frameworks depend on accurate loss values. When one component is significantly off, the ability to gain insights into the training process and the performance of the model gets compromised, making it harder to track and address training issues.
Proposed Solution: Removing the Cardinality Error
The most straightforward solution involves removing the cardinality error calculation entirely from the loss_dict. This eliminates the misleading information and prevents the disruption of logging frameworks. This approach avoids confusion and promotes a more accurate understanding of the training process.
An alternative approach would involve computing the cardinality error based on a score threshold, which would be a post-processing step rather than part of the forward pass. But, the complexity would make it unnecessary. Since these models always predict a fixed number of objects, the concept of cardinality only becomes meaningful after filtering these predictions. Removing the error calculation provides a cleaner and more reliable training process. This fix ensures that the loss values are accurate, which enables better analysis and tracking of the model's training performance.
Steps to Reproduce and Expected Behavior
To reproduce the issue, you can run the forward pass with any of the affected models, including their associated targets. For example, using a 1-class model will highlight the problem. The expected behavior is a true cardinality error. However, since the model predicts all queries as foreground predictions, this calculation doesn't make sense and it provides incorrect insights.
Who Can Help and Further Information
The issue is well-documented and has been acknowledged by the Hugging Face Transformers community. The solution involves removing the error calculation and a PR is likely to be accepted. For more information, please check the following:
- The official example scripts: These scripts provide a practical demonstration of the issue, and you can test the provided examples.
- My own modified scripts: You can also use your custom scripts and reproduce the error.
This article provides an in-depth analysis of the cardinality error issue affecting DETR-based models, and offers practical solutions to resolve it. By removing the incorrect error, developers can ensure their model training and evaluation are accurate and reliable.
In summary, addressing the cardinality error in DETR variants ensures more reliable model training and more accurate evaluation.
For additional information, you can find further details on the issue in the Deformable DETR repository and the Hugging Face Transformers documentation.