What if you wish to write the entire object detection coaching pipeline from scratch, so you may perceive every step and be capable of customise it? That’s what I got down to do. I examined a number of well-known object detection pipelines and designed one which most accurately fits my wants and duties. Due to Ultralytics, YOLOx, DAMO-YOLO, RT-DETR and D-FINE repos, I leveraged them to achieve deeper understanding into numerous design particulars. I ended up implementing SoTA real-time object detection model D-FINE in my customized pipeline.
Plan
- Dataset, Augmentations and transforms:
- Mosaic (with affine transforms)
- Mixup and Cutout
- Different augmentations with bounding bins
- Letterbox vs easy resize
- Coaching:
- Optimizer
- Scheduler
- EMA
- Batch accumulation
- AMP
- Grad clipping
- Logging
- Metrics:
- mAPs from TorchMetrics / cocotools
- Easy methods to compute Precision, Recall, IoU?
- Choose an appropriate resolution to your case
- Experiments
- Consideration to information preprocessing
- The place to start out
Dataset
Dataset processing is the very first thing you normally begin engaged on. With object detection, you must load your picture and annotations. Annotations are sometimes saved in COCO format as a json file or YOLO format, with txt file for every picture. Let’s check out the YOLO format: Every line is structured as: class_id
, x_center
, y_center
, width
, top
, the place bbox values are normalized between 0 and 1.
When you’ve got your photographs and txt recordsdata, you may write your dataset class, nothing difficult right here. Load every thing, remodel (augmentations included) and return throughout coaching. I choose splitting the information by making a CSV file for every cut up after which studying it within the Dataloader class somewhat than bodily transferring recordsdata into prepare/val/check folders. That is an instance of a customization that helped my use case.
Augmentations
Firstly, when augmenting photographs for object detection, it’s essential to use the identical transformations to the bounding bins. To comfortably do this I exploit Albumentations lib. For instance:
def _init_augs(self, cfg) -> None:
if self.keep_ratio:
resize = [
A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),
A.PadIfNeeded(
min_height=self.target_h,
min_width=self.target_w,
border_mode=cv2.BORDER_CONSTANT,
fill=(114, 114, 114),
),
]
else:
resize = [A.Resize(self.target_h, self.target_w)]
norm = [
A.Normalize(mean=self.norm[0], std=self.norm[1]),
ToTensorV2(),
]
if self.mode == "prepare":
augs = [
A.RandomBrightnessContrast(p=cfg.train.augs.brightness),
A.RandomGamma(p=cfg.train.augs.gamma),
A.Blur(p=cfg.train.augs.blur),
A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)),
A.ToGray(p=cfg.train.augs.to_gray),
A.Affine(
rotate=[90, 90],
p=cfg.prepare.augs.rotate_90,
fit_output=True,
),
A.HorizontalFlip(p=cfg.prepare.augs.left_right_flip),
A.VerticalFlip(p=cfg.prepare.augs.up_down_flip),
]
self.remodel = A.Compose(
augs + resize + norm,
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
)
elif self.mode in ["val", "test", "bench"]:
self.mosaic_prob = 0
self.remodel = A.Compose(
resize + norm,
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
)
Secondly, there are quite a lot of fascinating and never trivial augmentations:
- Mosaic. The thought is straightforward, let’s take a number of photographs (for instance 4), and stack them collectively in a grid (2×2). Then let’s do some affine transforms and feed it to the mannequin.
- MixUp. Initially utilized in picture classification (it’s shocking that it really works). Concept – let’s take two photographs, put them onto one another with some p.c of transparency. In classification fashions it normally signifies that if one picture is 20% clear and the second is 80%, then the mannequin ought to predict 80% for sophistication 1 and 20% for sophistication 2. In object detection we simply get extra objects into 1 picture.
- Cutout. Cutout entails eradicating components of the picture (by changing them with black pixels) to assist the mannequin study extra sturdy options.
I see mosaic usually utilized with Chance 1.0 of the primary ~90% of epochs. Then, it’s normally turned off, and lighter augmentations are used. The identical concept applies to mixup, however I see it getting used lots much less (for the most well-liked detection framework, Ultralytics, it’s turned off by default. For an additional one, I see P=0.15). Cutout appears to be used much less incessantly.
You’ll be able to learn extra about these augmentations in these two articles: 1, 2.
Outcomes from simply turning on mosaic are fairly good (darker one with out mosaic obtained mAP 0.89 vs 0.92 with, examined on an actual dataset)

Letterbox or easy resize?
Throughout coaching, you normally resize the enter picture to a sq.. Fashions usually use 640×640 and benchmark on COCO dataset. And there are two most important methods the way you get there:
- Easy resize to a goal measurement.
- Letterbox: Resize the longest facet to the goal measurement (e.g., 640), preserving the side ratio, and pad the shorter facet to succeed in the goal dimensions.


Each approaches have benefits and drawbacks. Let’s talk about them first, after which I’ll share the outcomes of quite a few experiments I ran evaluating these approaches.
Easy resize:
- Compute goes to the entire picture, with no ineffective padding.
- “Dynamic” side ratio might act as a type of regularization.
- Inference preprocessing completely matches coaching preprocessing (augmentations excluded).
- Kills actual geometry. Resize distortion might have an effect on the spatial relationships within the picture. Though it is likely to be a human bias to suppose {that a} mounted side ratio is necessary.
Letterbox:
- Preserves actual side ratio.
- Throughout inference, you may reduce padding and run not on the sq. picture should you don’t lose accuracy (some fashions can degrade).
- Can prepare on an even bigger picture measurement, then inference with reduce padding to get the identical inference latency as with easy resize. For instance 640×640 vs 832×480. The second will protect the side ratios and objects will seem +- the identical measurement.
- A part of the compute is wasted on grey padding.
- Objects get smaller.
Easy methods to check it and resolve which one to make use of?
Practice from scratch with parameters:
- Easy resize, 640×640
- Hold side ratio, max facet 640, and add padding (as a baseline)
- Hold side ratio, bigger picture measurement (for instance max facet 832), and add padding Then inference 3 fashions. When the side ratio is preserved – reduce padding through the inference. Evaluate latency and metrics.
Instance of the identical picture from above with reduce padding (640 × 384):

Here’s what occurs whenever you protect ratio and inference by reducing grey padding:
params | F1 rating | latency (ms). |
-------------------------+-------------+-----------------|
ratio stored, 832 | 0.633 | 33.5 |
no ratio, 640x640 | 0.617 | 33.4 |
As proven, coaching with preserved side ratio at a bigger measurement (832) achieved the next F1 rating (0.633) in comparison with a easy 640×640 resize (F1 rating of 0.617), whereas the latency remained related. Observe that some fashions might degrade if the padding is eliminated throughout inference, which kills the entire goal of this trick and possibly the letterbox too.
What does this imply:
Coaching from scratch:
- With the identical picture measurement, easy resize will get higher accuracy than letterbox.
- For letterbox, When you reduce padding through the inference and your mannequin doesn’t lose accuracy – you may prepare and inference with an even bigger picture measurement to match the latency, and get slightly bit increased metrics (as within the instance above).
Coaching with pre-trained weights initialized:
- When you finetune – use the identical tactic because the pre-trained mannequin did, it ought to provide the finest outcomes if the datasets are usually not too completely different.
For D-FINE I see decrease metrics when reducing padding throughout inference. Additionally the mannequin was pre-trained on a easy resize. For YOLO, a letterbox is often a good selection.
Coaching
Each ML engineer ought to know the way to implement a coaching loop. Though PyTorch does a lot of the heavy lifting, you would possibly nonetheless really feel overwhelmed by the variety of design selections accessible. Listed below are some key elements to contemplate:
- Optimizer – begin with Adam/AdamW/SGD.
- Scheduler – mounted LR will be okay for Adams, however check out StepLR, CosineAnnealingLR or OneCycleLR.
- EMA. It is a good method that makes coaching smoother and generally achieves increased metrics. After every batch, you replace a secondary mannequin (usually referred to as the EMA mannequin) by computing an exponential transferring common of the first mannequin’s weights.
- Batch accumulation is sweet when your vRAM could be very restricted. Coaching a transformer-based object detection mannequin signifies that in some circumstances even in a middle-sized mannequin you solely can match 4 photographs into the vRAM. By accumulating gradients over a number of batches earlier than performing an optimizer step, you successfully simulate a bigger batch measurement with out exceeding your reminiscence constraints. One other use case is when you’ve got quite a lot of negatives (photographs with out goal objects) in your dataset and a small batch measurement, you may encounter unstable coaching. Batch accumulation may also assist right here.
- AMP makes use of half precision robotically the place relevant. It reduces vRAM utilization and makes coaching quicker (if in case you have a GPU that helps it). I see 40% much less vRAM utilization and a minimum of a 15% coaching pace improve.
- Grad clipping. Usually, whenever you use AMP, coaching can grow to be much less secure. This may additionally occur with increased LRs. When your gradients are too massive, coaching will fail. Gradient clipping will make certain gradients are by no means greater than a sure worth.
- Logging. Attempt Hydra for configs and one thing like Weights and Biases or Clear ML for experiment monitoring. Additionally, log every thing domestically. Save your finest weights, and metrics, so after quite a few experiments, you may all the time discover all the information on the mannequin you want.
def prepare(self) -> None:
best_metric = 0
cur_iter = 0
ema_iter = 0
one_epoch_time = None
def optimizer_step(step_scheduler: bool):
"""
Clip grads, optimizer step, scheduler step, zero grad, EMA mannequin replace
"""
nonlocal ema_iter
if self.amp_enabled:
if self.clip_max_norm:
self.scaler.unscale_(self.optimizer)
torch.nn.utils.clip_grad_norm_(self.mannequin.parameters(), self.clip_max_norm)
self.scaler.step(self.optimizer)
self.scaler.replace()
else:
if self.clip_max_norm:
torch.nn.utils.clip_grad_norm_(self.mannequin.parameters(), self.clip_max_norm)
self.optimizer.step()
if step_scheduler:
self.scheduler.step()
self.optimizer.zero_grad()
if self.ema_model:
ema_iter += 1
self.ema_model.replace(ema_iter, self.mannequin)
for epoch in vary(1, self.epochs + 1):
epoch_start_time = time.time()
self.mannequin.prepare()
self.loss_fn.prepare()
losses = []
with tqdm(self.train_loader, unit="batch") as tepoch:
for batch_idx, (inputs, targets, _) in enumerate(tepoch):
tepoch.set_description(f"Epoch {epoch}/{self.epochs}")
if inputs is None:
proceed
cur_iter += 1
inputs = inputs.to(self.system)
targets = [
{
k: (v.to(self.device) if (v is not None and hasattr(v, "to")) else v)
for k, v in t.items()
}
for t in targets
]
lr = self.optimizer.param_groups[0]["lr"]
if self.amp_enabled:
with autocast(self.system, cache_enabled=True):
output = self.mannequin(inputs, targets=targets)
with autocast(self.system, enabled=False):
loss_dict = self.loss_fn(output, targets)
loss = sum(loss_dict.values()) / self.b_accum_steps
self.scaler.scale(loss).backward()
else:
output = self.mannequin(inputs, targets=targets)
loss_dict = self.loss_fn(output, targets)
loss = sum(loss_dict.values()) / self.b_accum_steps
loss.backward()
if (batch_idx + 1) % self.b_accum_steps == 0:
optimizer_step(step_scheduler=True)
losses.append(loss.merchandise())
tepoch.set_postfix(
loss=np.imply(losses) * self.b_accum_steps,
eta=calculate_remaining_time(
one_epoch_time,
epoch_start_time,
epoch,
self.epochs,
cur_iter,
len(self.train_loader),
),
vram=f"{get_vram_usage()}%",
)
# Last replace for any leftover gradients from an incomplete accumulation step
if (batch_idx + 1) % self.b_accum_steps != 0:
optimizer_step(step_scheduler=False)
wandb.log({"lr": lr, "epoch": epoch})
metrics = self.consider(
val_loader=self.val_loader,
conf_thresh=self.conf_thresh,
iou_thresh=self.iou_thresh,
path_to_save=None,
)
best_metric = self.save_model(metrics, best_metric)
save_metrics(
{}, metrics, np.imply(losses) * self.b_accum_steps, epoch, path_to_save=None
)
if (
epoch >= self.epochs - self.no_mosaic_epochs
and self.train_loader.dataset.mosaic_prob
):
self.train_loader.dataset.close_mosaic()
if epoch == self.ignore_background_epochs:
self.train_loader.dataset.ignore_background = False
logger.data("Together with background photographs")
one_epoch_time = time.time() - epoch_start_time
Metrics
For object detection everybody makes use of mAP, and it’s already standardized how we measure these. Use pycocotools or faster-coco-eval or TorchMetrics for mAP. However mAP signifies that we test how good the mannequin is general, on all confidence ranges. mAP0.5 signifies that IoU threshold is 0.5 (every thing decrease is taken into account as a incorrect prediction). I personally don’t totally like this metric, as in manufacturing we all the time use 1 confidence threshold. So why not set the edge after which compute metrics? That’s why I additionally all the time calculate confusion matrices, and primarily based on that – Precision, Recall, F1-score, and IoU.
However logic additionally is likely to be difficult. Here’s what I exploit:
- 1 GT (floor fact) object = 1 predicted object, and it’s a TP if IoU > threshold. If there isn’t a prediction for a GT object – it’s a FN. If there isn’t a GT for a prediction – it’s a FP.
- 1 GT must be matched by a prediction just one time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP.
- Class ids also needs to match. If the mannequin predicts class_0 however GT is class_1, it means FP += 1 and FN += 1.
Throughout coaching, I choose the very best mannequin primarily based on the metrics which can be related to the duty. I usually think about the typical of mAP50 and F1-score.
Mannequin and loss
I haven’t mentioned mannequin structure and loss perform right here. They normally go collectively, and you’ll select any mannequin you want and combine it into your pipeline with every thing from above. I did that with DAMO-YOLO and D-FINE, and the outcomes had been nice.
Choose an appropriate resolution to your case
Many individuals use Ultralytics, nonetheless it has GPLv3, and you’ll’t use it in industrial initiatives until your code is open supply. So folks usually look into Apache 2 and MIT licensed fashions. Try D-FINE, RT-DETR2 or some yolo fashions like Yolov9.
What if you wish to customise one thing within the pipeline? Once you construct every thing from scratch, it’s best to have full management. In any other case, attempt selecting a challenge with a smaller codebase, as a big one could make it tough to isolate and modify particular person elements.
When you don’t want something customized and your utilization is allowed by the Ultralytics license – it’s an incredible repo to make use of, because it helps a number of duties (classification, detection, occasion segmentation, key factors, oriented bounding bins), fashions are environment friendly and obtain good scores. Reiterating ones extra, you in all probability don’t want a customized coaching pipeline if you’re not doing very particular issues.
Experiments
Let me share some outcomes I obtained with a customized coaching pipeline with the D-FINE mannequin and examine it to the Ultralytics YOLO11 mannequin on the VisDrone-DET2019 dataset.
Skilled from scratch:
mannequin | mAP 0.50. | F1-score | Latency (ms) |
---------------------------------+--------------+--------------+------------------|
YOLO11m TRT | 0.417 | 0.568 | 15.6 |
YOLO11m TRT dynamic | - | 0.568 | 13.3 |
YOLO11m OV | - | 0.568 | 122.4 |
D-FINEm TRT | 0.457 | 0.622 | 16.6 |
D-FINEm OV | 0.457 | 0.622 | 115.3 |
From COCO pre-trained:
mannequin | mAP 0.50 | F1-score |
------------------+------------|-------------|
YOLO11m | 0.456 | 0.600 |
D-FINEm | 0.506 | 0.649 |
Latency was measured on an RTX 3060 with TensorRT (TRT), static picture measurement 640×640, together with the time for cv2.imread.
OpenVINO (OV) on i5 14000f (no iGPU). Dynamic signifies that throughout inference, grey padding is being reduce for quicker inference. It labored with the YOLO11 TensorRT model. Extra particulars about reducing grey padding above (Letterbox or easy resize part).
One disappointing result’s the latency on intel N100 CPU with iGPU ($150 miniPC):
mannequin | Latency (ms) |
------------------+-------------|
YOLO11m | 188 |
D-FINEm | 272 |
D-FINEs | 11 |

Right here, conventional convolutional neural networks are noticeably quicker, possibly due to optimizations in OpenVINO for GPUs.
General, I carried out over 30 experiments with completely different datasets (together with real-world datasets), fashions, and parameters and I can say that D-FINE will get higher metrics. And it is smart, as on COCO, it is usually increased than all YOLO fashions.

VisDrone experiments:


Instance of D-FINE mannequin predictions (inexperienced – GT, blue – pred):

Last outcomes
Figuring out all the main points, let’s see a closing comparability with the very best settings for each fashions on i12400F and RTX 3060 with the VisDrone dataset:
mannequin | F1-score | Latency (ms) |
-----------------------------------+---------------+-------------------|
YOLO11m TRT dynamic | 0.600 | 13.3 |
YOLO11m OV | 0.600 | 122.4 |
D-FINEs TRT | 0.629 | 12.3 |
D-FINEs OV | 0.629 | 57.4 |
As proven above, I used to be in a position to make use of a smaller D-FINE mannequin and obtain each quicker inference time and accuracy than YOLO11. Beating Ultralytics, probably the most extensively used real-time object detection mannequin, in each pace and accuracy, is sort of an accomplishment, isn’t it? The identical sample is noticed throughout a number of different real-world datasets.
I additionally tried out YOLOv12, which got here out whereas I used to be writing this text. It carried out equally to YOLO11 and even achieved barely decrease metrics (mAP 0.456 vs 0.452). It seems that YOLO fashions have been hitting the wall for the final couple of years. D-FINE was an incredible replace for object detection fashions.
Lastly, let’s see visually the distinction between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms:

D-FINEs, conf 0.5, no nms, latency 12.3ms:

Each Precision and Recall are increased with the D-FINE mannequin. And it’s additionally quicker. Right here can be “m” model of D-FINE:

Isn’t it loopy that even that one automobile on the left was detected?
Consideration to information preprocessing
This half can go slightly bit exterior the scope of the article, however I need to a minimum of rapidly point out it, as some components will be automated and used within the pipeline. What I undoubtedly see as a Computer Vision engineer is that when engineers don’t spend time working with the information – they don’t get good fashions. You’ll be able to have all SoTA fashions and every thing completed proper, however rubbish in – rubbish out. So, I all the time pay a ton of consideration to the way to method the duty and the way to collect, filter, validate, and annotate the information. Don’t suppose that the annotation staff will do every thing proper. Get your palms soiled and test manually some portion of the dataset to make sure that annotations are good and picked up photographs are consultant.
A number of fast concepts to look into:
- Take away duplicates and close to duplicates from val/check units. The mannequin shouldn’t be validated on one pattern two instances, and undoubtedly, you don’t need to have a knowledge leak, by getting two similar photographs, one in coaching and one in validation units.
- Verify how small your objects will be. Every thing not seen to your eye shouldn’t be annotated. Additionally, keep in mind that augmentations will make objects seem even smaller (for instance, mosaic or zoom out). Configure these augmentations accordingly so that you received’t find yourself with unusably small objects on the picture.
- When you have already got a mannequin for a sure process and wish extra information – attempt utilizing your mannequin to pre-annotate new photographs. Verify circumstances the place the mannequin fails and collect extra related circumstances.
The place to start out
I labored lots on this pipeline, and I’m able to share it with everybody who desires to attempt it out. It makes use of the SoTA D-FINE mannequin beneath the hood and provides some options that had been absent within the unique repo (mosaic augmentations, batch accumulation, scheduler, extra metrics, visualization of preprocessed photographs and eval predictions, exporting and inference code, higher logging, unified and simplified configuration file).
Right here is the hyperlink to my repo. Right here is the original D-FINE repo, the place I additionally contribute. When you want any assist, please contact me on LinkedIn. Thanks to your time!
Citations and acknowledgments
@article{zhu2021detection,
title={Detection and monitoring meet drones problem},
writer={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin},
journal={IEEE Transactions on Sample Evaluation and Machine Intelligence},
quantity={44},
quantity={11},
pages={7380--7399},
12 months={2021},
writer={IEEE}
}
@misc{peng2024dfine,
title={D-FINE: Redefine Regression Process in DETRs as Advantageous-grained Distribution Refinement},
writer={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Solar and Feng Wu},
12 months={2024},
eprint={2410.13842},
archivePrefix={arXiv},
primaryClass={cs.CV}
}