-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Thank you for outstanding work, I got some problems when I try to reproduce the training of coco. Firstly I use your checkpoint and successfully got the same result 31.3 mAP, it proves that the dataset and python environment is correctly set.
And I use the command to train vild first: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py, and then formattly train coco: torchrun --nproc_per_node=2 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py, but I don't get correct result when I use the training checkpoint, here is my full result:
{'COCO_17_bbox_mAP_': '0.1495',
'COCO_17_bbox_mAP_50': '0.2830',
'COCO_17_bbox_mAP_75': '0.1398',
'COCO_17_bbox_mAP_copypaste': '0.1495 0.2830 0.1398 0.1060 0.1788 0.1816',
'COCO_17_bbox_mAP_l': '0.1816',
'COCO_17_bbox_mAP_m': '0.1788',
'COCO_17_bbox_mAP_s': '0.1060',
'COCO_48_17_bbox_mAP_': '0.2673',
'COCO_48_17_bbox_mAP_50': '0.4436',
'COCO_48_17_bbox_mAP_75': '0.2798',
'COCO_48_17_bbox_mAP_copypaste': '0.2673 0.4436 0.2798 0.1750 0.2916 0.3488',
'COCO_48_17_bbox_mAP_l': '0.3488',
'COCO_48_17_bbox_mAP_m': '0.2916',
'COCO_48_17_bbox_mAP_s': '0.1750',
'COCO_48_bbox_mAP_': '0.3090',
'COCO_48_bbox_mAP_50': '0.5005',
'COCO_48_bbox_mAP_75': '0.3293',
'COCO_48_bbox_mAP_copypaste': '0.3090 0.5005 0.3293 0.1994 0.3316 0.4080',
'COCO_48_bbox_mAP_l': '0.4080',
'COCO_48_bbox_mAP_m': '0.3316',
'COCO_48_bbox_mAP_s': '0.1994'}
By the way, I noticed that some abnormal data was output during the training process, the mAP result of coco_17_bbox is -1!!!, here I randomly cut partly of output during training, it is during iteration of 26000/40000:
2023-11-29 19:26:42,471 - mmdet - INFO - Iter(val) [2500] COCO_48_17_bbox_mAP_: 0.1982, COCO_48_17_bbox_mAP_50: 0.3539, COCO_48_17_bbox_mAP_75: 0.1999, COCO_48_17_bbox_mAP_s: 0.1101, COCO_48_17_bbox_mAP_m: 0.2075, COCO_48_17_bbox_mAP_l: 0.2655, COCO_48_17_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_48_bbox_mAP_: 0.1982, COCO_48_bbox_mAP_50: 0.3539, COCO_48_bbox_mAP_75: 0.1999, COCO_48_bbox_mAP_s: 0.1101, COCO_48_bbox_mAP_m: 0.2075, COCO_48_bbox_mAP_l: 0.2655, COCO_48_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_17_bbox_mAP_: -1.0000, COCO_17_bbox_mAP_50: -1.0000, COCO_17_bbox_mAP_75: -1.0000, COCO_17_bbox_mAP_s: -1.0000, COCO_17_bbox_mAP_m: -1.0000, COCO_17_bbox_mAP_l: -1.0000, COCO_17_bbox_mAP_copypaste: -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
And when I add --override to command like: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json, the checkpoint becomes unuseful:

why it makes this situation?
It seems like some parts of my experiment is wrong, how can I fixed it? And can you tell me how to use training command correctly? Appreciated!