Skip to content

Reproduction of oadp_ov_coco.py #15

@Lukas-Ma1

Description

@Lukas-Ma1

Thank you for outstanding work, I got some problems when I try to reproduce the training of coco. Firstly I use your checkpoint and successfully got the same result 31.3 mAP, it proves that the dataset and python environment is correctly set.

And I use the command to train vild first: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py, and then formattly train coco: torchrun --nproc_per_node=2 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py, but I don't get correct result when I use the training checkpoint, here is my full result:

{'COCO_17_bbox_mAP_': '0.1495',
'COCO_17_bbox_mAP_50': '0.2830',
'COCO_17_bbox_mAP_75': '0.1398',
'COCO_17_bbox_mAP_copypaste': '0.1495 0.2830 0.1398 0.1060 0.1788 0.1816',
'COCO_17_bbox_mAP_l': '0.1816',
'COCO_17_bbox_mAP_m': '0.1788',
'COCO_17_bbox_mAP_s': '0.1060',
'COCO_48_17_bbox_mAP_': '0.2673',
'COCO_48_17_bbox_mAP_50': '0.4436',
'COCO_48_17_bbox_mAP_75': '0.2798',
'COCO_48_17_bbox_mAP_copypaste': '0.2673 0.4436 0.2798 0.1750 0.2916 0.3488',
'COCO_48_17_bbox_mAP_l': '0.3488',
'COCO_48_17_bbox_mAP_m': '0.2916',
'COCO_48_17_bbox_mAP_s': '0.1750',
'COCO_48_bbox_mAP_': '0.3090',
'COCO_48_bbox_mAP_50': '0.5005',
'COCO_48_bbox_mAP_75': '0.3293',
'COCO_48_bbox_mAP_copypaste': '0.3090 0.5005 0.3293 0.1994 0.3316 0.4080',
'COCO_48_bbox_mAP_l': '0.4080',
'COCO_48_bbox_mAP_m': '0.3316',
'COCO_48_bbox_mAP_s': '0.1994'}

By the way, I noticed that some abnormal data was output during the training process, the mAP result of coco_17_bbox is -1!!!, here I randomly cut partly of output during training, it is during iteration of 26000/40000:

2023-11-29 19:26:42,471 - mmdet - INFO - Iter(val) [2500] COCO_48_17_bbox_mAP_: 0.1982, COCO_48_17_bbox_mAP_50: 0.3539, COCO_48_17_bbox_mAP_75: 0.1999, COCO_48_17_bbox_mAP_s: 0.1101, COCO_48_17_bbox_mAP_m: 0.2075, COCO_48_17_bbox_mAP_l: 0.2655, COCO_48_17_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_48_bbox_mAP_: 0.1982, COCO_48_bbox_mAP_50: 0.3539, COCO_48_bbox_mAP_75: 0.1999, COCO_48_bbox_mAP_s: 0.1101, COCO_48_bbox_mAP_m: 0.2075, COCO_48_bbox_mAP_l: 0.2655, COCO_48_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_17_bbox_mAP_: -1.0000, COCO_17_bbox_mAP_50: -1.0000, COCO_17_bbox_mAP_75: -1.0000, COCO_17_bbox_mAP_s: -1.0000, COCO_17_bbox_mAP_m: -1.0000, COCO_17_bbox_mAP_l: -1.0000, COCO_17_bbox_mAP_copypaste: -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000

And when I add --override to command like: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json, the checkpoint becomes unuseful:
截屏2023-11-30 09 44 37
why it makes this situation?

It seems like some parts of my experiment is wrong, how can I fixed it? And can you tell me how to use training command correctly? Appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions