Alternative workload management in MLSpace at Cloud.RU (aka SberCloud).
Default client_lib is awfully bad for workload management for multiple reason
enumerated below.
- Only
flags(i.e. short-oor long--optionwith-). - Overcomplicated API (wtf is
pytorchandpytorch2?). - Significant difference with
mpirunand/orslurmbut semantics the same. - Unclear "launch protocol" (i.e.
sshdin userspace, directory layout, etc). - Option
-Rinsshdoes not work. Consequently,rsyncdoes not work too! - Nothing has been changing in three years.
- Too many external Python dependencies.
A real world usage example follows. This command spawns a job in region SR006
in container based on $image image. Job command is env, i.e. simply to
print on stdout environment variables.
python -m mlspace -i "$image" -r SR006 -- envSpecify inject environment variable VAR=VAL to job's environment and run
command locally (useful for testing).
python -m mlspace -e VAR=VAL -l -- envRun Gateway API service for testing.
python -m mlspace.testing -H localhost -p 8080cmake -S . -B build -G 'Ninja Multi-Config' -DCMAKE_CXX_COMPILER=clang++
cmake --build build --config Releasepython -m build -nvwcibuildwheel --only cp313-manylinux_x86_64- Preparation.
- Serialize job launch parameters to JSON.
- Encode JSON to base64.
- Splite base64-encoded JSON on chunks of 64kB.
- Allocation (via Gateway v2 public API)
- Specify
launchbinary asscriptparameter oftype=binaryjob. - Enumerate all base64-encoded chunks to with
--spec-part-#options and combine them toflagsassociate array. - Add to
flagsarray--spec-versionand--spec-num-partsoptions. - Add
--spec-sha256for checksum verification. - Submit job on execution.
- Specify
- Launching (vai
launchbinary).- Process command line arguments and restore original base64-encoded JSON.
- Decode JSON to job spec.
- Validate job spec.
- Run target binary in common
fork/execvpe/waitway.- Change working directory.
- Update environment.
- Create non-privileged user
user.- It must have ids
1000:1000with home at/home/user. - The following directories must exist and be owned by
user:user:/home/jovyan,/home/user,/tmp/.jupyter, and/tmp/.jupyter_data,
- It must have ids
- Set up
sshdin userspace.- No
PAM. - Allow
ssh-rsakeys. - PID file must be in
/run/sshd/sshd.pid - Everything must be owned by `user.
- No
- Set up
hpc-x>=2.18(hpc-x>=2.21onubuntu:24.04).- Install directory must be
/opt/hpcx. - Configure environment variables
LD_LIBRARY_PATH,PATH, andOPAL_PREFIX.
- Install directory must be