Skip to content

"free(): invalid pointer" when using "begin_aggregate()" #1

@sysu19351115

Description

@sysu19351115

Hi,
I'm trying to train a new policy. Unfortunately, whether I train teacher or student, aborted occurs when creating the environment.

Take "mlp.py" as an example, specifically, the error occurs on line 95 in "dclaw_multiobjs,py" : "self.gym.begin_aggregate(env_ptr, max_agg_bodies, max_agg_shapes, True)". The specific error is as follows:

Importing module 'gym_38' (/workspace/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /workspace/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
mlp.py:17: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path=dexenv.PROJECT_ROOT.joinpath('conf').as_posix(), config_name="debug_dclaw")
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
PyTorch version 2.1.0+cu118
Device count 2
/workspace/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
/workspace/isaacgym/python/isaacgym/torch_utils.py:135: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def get_axis_params(value, axis_idx, x_value=0., dtype=np.float, n_dims=3):
2024-05-05 04:18:06.870 | INFO     | dexenv.utils.create_task_env:create_task_env:19 - Creating environment DclawMultiObjs
2024-05-05 04:18:06.871 | WARNING  | MODDclawMultiObjs:parse_obj_dataset:168 - Dataset path:/workspace/dexenv/dexenv/assets/miscnet/train
2024-05-05 04:18:06.877 | INFO     | MODDclawMultiObjs:__init__:23 - Object urdf root path:/workspace/dexenv/dexenv/assets/miscnet/train.
2024-05-05 04:18:06.877 | INFO     | MODDclawMultiObjs:__init__:24 - Number of available objects:150.
2024-05-05 04:18:06.877 | WARNING  | dexenv.envs.dclaw_base:__init__:29 - Domain randomization is enabled!
Obs type: full
2024-05-05 04:18:06.879 | INFO     | dexenv.envs.base.vec_task:__init__:89 - RL device:cuda
2024-05-05 04:18:06.879 | INFO     | dexenv.envs.base.vec_task:__init__:90 - Simulation device:cuda:0
2024-05-05 04:18:06.879 | INFO     | dexenv.envs.base.vec_task:__init__:91 - Graphics device:-1
/usr/local/lib/python3.8/dist-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[Warning] [carb.gym.plugin] useGpuPipeline is set, forcing GPU PhysX
[Warning] [carb.gym.plugin] useGpu is set, forcing single scene (0 subscenes)
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
2024-05-05 04:18:07.985 | INFO     | dexenv.envs.dclaw_base:get_dclaw_asset:455 - VHACD:True
/workspace/dexenv/dexenv/envs/dclaw_base.py:461: DeprecationWarning: an integer is required (got type isaacgym._bindings.linux-x86_64.gym_38.DofDriveMode).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
  asset_options.default_dof_drive_mode = gymapi.DOF_MODE_POS
Dclaw asset root:/workspace/dexenv/dexenv/assets/dclaw_4f robot name:dclaw_4f
D-Claw:
	 Number of bodies: 21
	 Number of shapes: 105
	 Number of dofs: 12
2024-05-05 04:18:10.423 | INFO     | dexenv.envs.dclaw_base:get_dclaw_asset:481 - Joint names:dict_keys(['four1_jnt', 'four2_jnt', 'four3_jnt', 'one1_jnt', 'one2_jnt', 'one3_jnt', 'three1_jnt', 'three2_jnt', 'three3_jnt', 'two1_jnt', 'two2_jnt', 'two3_jnt'])
Number of fingertips:4  Fingertips:['four_tip_link', 'one_tip_link', 'three_tip_link', 'two_tip_link']
Actuator   ---  DoF Index
	 four1_jnt   0
	 four2_jnt   1
	 four3_jnt   2
	 one1_jnt   3
	 one2_jnt   4
	 one3_jnt   5
	 three1_jnt   6
	 three2_jnt   7
	 three3_jnt   8
	 two1_jnt   9
	 two2_jnt   10
	 two3_jnt   11
Setting DOF velocity limit to:[6.55172413793103, 7.758620689655173, 7.7586206896551]
Setting DOF effort limit to:2.6
Setting stiffness to:[2.7724128689655174, 3.558619503448275, 3.5586195034482757]
Setting damping to:[0.273946924137931, 0.382384248275862, 0.382384248275862]
2024-05-05 04:18:10.460 | INFO     | MODDclawMultiObjs:load_object_asset:207 - Loading object IDs from 0 to 150.
Loading Asset: 100%|███████████████████████████████████| 150/150 [01:18<00:00,  1.92it/s]
free(): invalid pointer
Aborted (core dumped)

This error occurs very randomly, when the number of environments is set differently, it will occur in the aggregation of different objects. Also, I tried it on different computers and the same thing happened all the time, so I don't think it was a problem with my computer.

When I try to cancel "begin_aggregate", I get no error when I create the environment, but then I get a new error: "RuntimeError: CUDA error: an illegal memory access was encountered" when training. Even though I reduced the number of environments to 1000, it also happens. A smaller number of environments would lead to a longer training time, and I don't want that. I wonder if the cancellation of aggregation caused this problem?

The computer I'm using has 250GB of RAM, an RTX A6000 GPU, and 48GB of video memory

In general, I wanted to solve the "free(): invalid pointer" problem caused by aggregation, but I didn't succeed. Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions