Skip to content

UT-Austin-RPL/metamon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metamon Text Logo

Metamon Banner

Paper Website Discord


Metamon enables reinforcement learning (RL) research on Pokémon Showdown by providing:

  1. 20+ pretrained policies ranging from ~average to high-level human play.
  2. A dataset of >4M (and counting) trajectories "reconstructed" from real human battles.
  3. A dataset of >20M (and counting) trajectories generated by self-play between agents.
  4. Starting points for training (or finetuning) your own imitation learning (IL) and RL policies.
  5. A standardized suite of teams and heuristic opponents for evaluation.

Metamon is the codebase behind "Human-Level Competitive Pokémon via Scalable Offline RL and Transformers" (RLC, 2025). Please check out our project website for an overview of our original results. After the release of our conference paper, metamon served as a starter kit and winning baseline for the NeurIPS 2025 PokéAgent Challenge, which motivated significant improvements to our results and datasets.


Figure 1

Supported Rulesets

Pokémon Showdown hosts many different rulesets spanning nine generations of the video game franchise. Metamon initially focused on the most popular singles ruleset ("OverUsed") for Generations 1, 2, 3, and 4 but has recently expanded to include Generation 9 OverUsed (OU). We also support the UnderUsed (UU), NeverUsed (NU), and Ubers tiers for Generations 1, 2, 3, and 4 – though constant rule changes and small dataset sizes have always made these a bit of an afterthought.


Table of Contents

  1. Installation

  2. Quick Start

  3. Pretrained Models

  4. Battle Datasets

  5. Team Sets

  6. Baselines

  7. Observation Spaces, Action Spaces, & Reward Functions

  8. Training and Evaluation

  9. Other Datasets

  10. Battle Backends

  11. Team Preview

  12. FAQ

  13. Acknowledgement

  14. Citation




Installation

Metamon is written and tested for linux and python 3.10+. We recommend creating a fresh virtual environment or conda environment:

conda create -n metamon python==3.10
conda activate metamon

Then, install with:

git clone --recursive git@github.com:UT-Austin-RPL/metamon.git
cd metamon
pip install -e .

To install Pokémon Showdown, we'll need a modern version of npm / Node.js (instructions here). Note that Showdown undergoes constant updates... breaking changes are rare, but do happen. The version that downloads with this repo (metamon/server) is always supported.

cd server/pokemon-showdown
npm install

We will need to have the Showdown server running in the background while using Metamon:

# in the background (`screen`, etc.)
node pokemon-showdown start --no-security
# no-security removes battle speed throttling and password requirements on your local server

If necessary, we can customize the server settings (config/config.js) or the rules for each game mode.

Verify that installation has gone smoothly with:

# run a few test battles on the local server
python -m metamon.env

Metamon provides large datasets of Pokémon team files, human battles, and other statistics that will automatically download when requested. Specify a path with:

# add to ~/.bashrc
export METAMON_CACHE_DIR=/path/to/plenty/of/disk/space



Quick Start

Metamon makes it easy to turn Pokémon into an RL research problem. Pick a set of Pokémon teams to play with, an observation space, an action space, and a reward function:

from metamon.env import get_metamon_teams
from metamon.interface import DefaultObservationSpace, DefaultShapedReward, DefaultActionSpace

team_set = get_metamon_teams("gen1ou", "competitive")
obs_space = DefaultObservationSpace()
reward_fn = DefaultShapedReward()
action_space = DefaultActionSpace()

Then, battle against built-in baselines (or any poke_env.Player):

from metamon.env import BattleAgainstBaseline
from metamon.baselines import get_baseline

env = BattleAgainstBaseline(
    battle_format="gen1ou",
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    team_set=team_set,
    opponent_type=get_baseline("Gen1BossAI"),
)

# standard `gymnasium` environment
obs, info = env.reset()
next_obs, reward, terminated, truncated, info = env.step(env.action_space.sample())

The more flexible option is to request battles on our local Showdown server and battle anyone else who is online (humans, pretrained agents, or other Pokémon AI projects). If it plays Showdown, we can battle against it!

from metamon.env import QueueOnLocalLadder

env = QueueOnLocalLadder(
    battle_format="gen1ou",
    player_username="my_scary_username",
    num_battles=10,
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    player_team_set=team_set,
)

Metamon's main feature is that it creates a dataset of "reconstructed" human demonstrations for these environments:

from metamon.data import ParsedReplayDataset

human_dset = ParsedReplayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    formats=["gen1ou"],
)
obs_seq, action_seq, reward_seq, done_seq = human_dset[0]

We can also load a starting dataset of self-play trajectories generated by the metamon project:

from metamon.data import SelfPlayDataset

selfplay_dset = SelfPlayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    formats=["gen1ou"],
    subset="pac-base",  # or "pac-exploratory"
)

We can save our own agents' experience in the same format:

from metamon.data import MetamonDataset

env = QueueOnLocalLadder(
    .., # rest of args
    save_trajectories_to="my_data_path",
)
online_dset = MetamonDataset(
    dset_root="my_data_path",
    formats=["gen9ou"],  # match your env format
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
)
terminated = False
while not terminated:
    *_, terminated, _, _ = env.step(env.action_space.sample())

# find completed battles before loading examples
online_dset.refresh_files()

You are free to use this data to train an agent however you'd like, but we provide starting points for smaller-scale IL (python -m metamon.il.train) and RL (python -m metamon.rl.train), and a large set of pretrained models from our paper.




Pretrained Models

We have made every checkpoint of 29 models available on huggingface at jakegrigsby/metamon. You will need to install amago, which is an RL codebase by the same authors. Follow instructions here.

Figure 1

Load and run pretrained models with metamon.rl.evaluate. For example:

python -m metamon.rl.evaluate --eval_type heuristic --agent Kakuna --gens 1 --formats ou --total_battles 100

Will run the default checkpoint of the best model for 100 battles against a set of heuristic baselines highlighted in the paper.

Or to battle against whatever is logged onto the local Showdown server (including other pretrained models that are already waiting):

python -m metamon.rl.evaluate --eval_type ladder --agent Kakuna --gens 1 --formats ou --total_battles 50 --username <pick unique username> --team_set competitive

Featured Policies

There are now 29 official metamon models. Most of them were stepping stones to later (better) versions, and are now mainly useful as baselines or extra opponents in self-play data collection. Some notable exceptions worth knowing about are:

ModelSizeDateDescription Ladder Ratings with Sample Teams (GXE)
G1G2G3G4G9

SyntheticRLV2
200M Sep 2024 Original paper's best policy. Remains the basis of several successful third-party efforts to specialize in Gen1. Most previous models have complete human ratings (see Paper Policies below), but we have become a lot more cautious about laddering. 77%68%64%66%

Abra
57M Jul 2025 The best gen9ou agent that was open-sourced during the PokéAgent Challenge, and therefore the basis of many of the best third-party metamon extensions. 50%

Kadabra3
57M Sep 2025 The best policy trained in time to participate in the PokéAgent Challenge (as an organizer baseline). #1 in the Gen1OU qualifier and #2 in Gen9OU behind foul-play. 80%64%

Alakazam
57M Sep 2025 The final version of the PokéAgent Challenge effort. Patched a bug that made tera types invisible to the policy, which makes it the best candidate for future work at this model size.

Kakuna
142M Dec 2025 The best public metamon model – leading by nearly every metric. Trained on diverse teams to serve as a strong foundation for further research in any gen. Appears on all 5 OU leaderboards and is consistently 1500+ Elo in Gen1OU. 82%70%63%64%71%

Models can be loosely divided into two eras of active development:

  1. RLC Paper (Jan 2024 – Feb 2025): Trained on Gen 1-4 with old versions of the replay dataset and team sets.
  2. NeurIPS PokéAgent Challenge (July – November 2025): Basically restarted from scratch. Broadly speaking, we reduced model sizes, reward shaping, and the paper's emphasis on long-term memory while improving generalization over diverse team choices and prioritizing support for gen9ou. However, it took several iterations to recover the paper's Gen 1-4 performance.

Paper Policies

Paper policies play Gens 1-4 and are discussed in detail in the RLC 2025 paper. Some model sizes have several variants testing different RL objectives. See metamon/rl/pretrained.py for a complete list.

Model Name (--agent) Description
SmallIL (2 variants)15M imitation learning model trained on 1M human battles
SmallRL (5 variants)15M actor-critic model trained on 1M human battles
MediumIL50M imitation learning model trained on 1M human battles
MediumRL (3 variants)50M actor-critic model trained on 1M human battles
LargeIL200M imitation learning model trained on 1M human battles
LargeRL200M actor-critic model trained on 1M human battles
SyntheticRLV0200M actor-critic model trained on 1M human + 1M diverse self-play battles
SyntheticRLV1200M actor-critic model trained on 1M human + 2M diverse self-play battles
SyntheticRLV1_SelfPlaySyntheticRLV1 fine-tuned on 2M extra battles against itself
SyntheticRLV1_PlusPlusSyntheticRLV1 finetuned on 2M extra battles against diverse opponents
SyntheticRLV2Final 200M actor-critic model with value classification trained on 1M human + 4M diverse self-play battles.

Here is a reference of human evals for key models according to our paper:

Figure 1

PokéAgent Challenge Policies

Policies trained during the PokéAgent Challenge play Gens 1-4 and 9, but have a clear bias towards Gen 1 OU and Gen 9 OU. Their docstrings in metamon/rl/pretrained.py have some extra discussion and eval metrics.

Model Name (--agent) Description
SmallRLGen9BetaPrototype 15M actor-critic model trained after the dataset was expanded to include Gen9OU
Abra57M actor-critic trained on parsed-replays v3 and a small set of synthetic battles. First of a new series of Gen9OU-compatible policies trained in a similar style to the paper's "Synthetic" agents.
Kadabra, Kadabra2, Kadabra3, Kadabra4Are further extensions of Abra to larger datasets of self-play battles (> 11M) trained and deployed as organizer baselines throughout the PokéAgent Challenge practice ladder.
AlakazamConsidered the final edition of the main PokéAgent Challenge effort. Patches a bug that impacted tera type visibility. Actually slightly worse than Kadabra3/4 with competitive teams, but is more robust to diverse team choices thanks to a larger dataset.
Minikazam4.7M RNN trained on parsed-replays v4 and a large dataset of self-play battles. Tries to compensate for low parameter count by training on Alakazam's dataset. Creates a decent starting point for finetuning on any GPU. Evals here.
SuperkazamAn attempt to revisit Alakazam's (11M self-play + 4M human replay) dataset at a model size closer to the original paper (142M). Evals here.
KakunaThe best public metamon agent. Superkazam finetuned on 7M additional self-play battles collected at higher sampling temperature for improved exploration and value estimation. Reduced sampling weight of human replays to prioritize high-Elo self-play data. Compensates for our inattention to Gens2-4 during the PokéAgent Challenge. Evals here.

Internal Leaderboards

Human ratings above are the best way to anchor performance to an external metric, but we primarily rely on self comparisons across generations and team sets to guide new research. We typically use head-to-head comparisons between key baselines: see this Kakuna eval as an example. But we can get a general sense of the relative strength of metamon over time by turning policies loose on a locally hosted Showdown ladder and sampling from the same TeamSet.

Gold = PokéAgent Challenge policy, Pink = Paper policy.

Tip

These GXE values are a measure of performance relative to the listed models and have no connection to ratings on the public ladder.

Early Gen OU Local GXE
Model Competitive TeamSet Modern Replays TeamSet Avg Rank
G1 G2 G3 G4 G1 G2 G3 G4
Kakuna 75% 66% 63% 60% 68% 71% 67% 69% 1.0
Superkazam 67% 63% 59% 58% 64% 61% 62% 61% 3.0
Kadabra4 66% 60% 58% 58% 68% 60% 66% 63% 3.5
Kadabra3 68% 61% 57% 57% 67% 60% 60% 60% 4.0
Kadabra2 67% 60% 58% 57% 64% 62% 59% 60% 4.4
Alakazam 66% 59% 56% 57% 64% 58% 61% 58% 5.5
SynRLV2 50% 59% 55% 55% 54% 61% 55% 56% 6.9
Kadabra 56% 50% 47% 47% 55% 53% 50% 54% 7.9
SynRLV1++ 43% 47% 41% 45% 47% 49% 48% 48% 10.0
SynRLV1 43% 39% 42% 46% 46% 45% 44% 49% 10.2
SynRLV0 41% 38% 48% 40% 45% 41% 49% 45% 11.1
Abra 39% 44% 44% 45% 40% 45% 48% 48% 11.2
SmallRLGen9Beta 44% 42% 45% 48% 12.0
LargeRL 25% 35% 39% 39% 30% 39% 41% 44% 13.9
Minikazam 39% 34% 34% 34% 41% 36% 36% 39% 14.6
SmallILFA 24% 36% 39% 35% 28% 35% 38% 41% 14.8

Tip

Paper Policies are (predictably) weak in Gen9OU because they were never trained to play the format and use observation spaces that assume Team Preview is not available.

Gen9OU Local GXE
Model Competitive TeamSet Modern Replays TeamSet Avg Rank
Kakuna 76% 74% 1.0
Superkazam 75% 73% 2.5
Kadabra4 75% 73% 2.5
Kadabra3 73% 71% 4.5
Kadabra2 73% 69% 5.0
Alakazam 73% 71% 5.5
Abra 61% 57% 7.0
SmallRLGen9Beta 56% 57% 8.5
Kadabra 58% 55% 8.5
Minikazam 50% 50% 10.0
SynRLV0 32% 36% 11.5
SynRLV2 32% 38% 11.5
SynRLV1++ 32% 33% 13.5
LargeRL 29% 34% 14.0
SynRLV1 31% 32% 14.5
SmallILFA 23% 27% 16.0



Battle Datasets

Metamon provides two types of offline RL datasets in a flexible format that lets you customize observations, rewards, and actions on-the-fly.

Human Replay Datasets

Showdown creates "replays" of battles that players can choose to upload to the website before they expire. We gathered all surviving historical replays for Gen 1-4 OU/NU/UU/Ubers and Gen 9 OU, and continuously save new battles to grow the dataset.

Dataset Overview

Datasets are stored on huggingface in two formats:

Name Size Description
metamon-raw-replays 2M Battles Our curated set of Pokémon Showdown replay .json files... to save the Showdown API some download requests and to maintain an official reference of our training data. Will be regularly updated as new battles are played and collected.
metamon-parsed-replays 4M Trajectories The RL-compatible version of the dataset as reconstructed by the replay parser. This dataset has been significantly expanded and improved since the original paper.

Parsed replays will download automatically when requested by the ParsedReplayDataset, but these datasets are large. Download in advance with:

python -m metamon.data.download parsed-replays
from metamon.data import ParsedReplayDataset

replay_dset = ParsedReplayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_func,
    formats=["gen1ou", "gen9ou"],
)
obs_seq, action_seq, reward_seq, done_seq = replay_dset[0]

Server/Replay Sim2Sim Gap

In Showdown RL, we have to embrace a mismatch between the trajectories we observe in our own battles and those we gather from other player's replays. In short, replays are saved from the point-of-view of a spectator rather than the point-of-view of a player. The server sends info to the players that it does not save to its replay, and we need to try and simulate that missing info. Metamon goes to great lengths to handle this, and is always improving (more info here), but there is no way to be perfect.

Therefore, replay data is perhaps best viewed as pretraining data for an offline-to-online finetuning problem. Self-collected data from the online env fixes inaccuracies and can help concentrate on teams we'll be using on the ladder. The whole project is now set up to do this (see Quick Start), and we have open-sourced large self-play sets (below).


Self-Play Datasets

Almost all improvement in metamon's performance is driven by large and diverse datasets of agent vs. agent battles. Public self-play datasets are stored on huggingface at jakegrigsby/metamon-parsed-pile. Trajectories were generated by the rl/self_play launcher with various team sets and model pools.

There are currently two subsets:

Name Size Description
pac-base 11M Trajectories Partially comprised of battles played by organizer baselines on the PokéAgent Challenge practice ladder, but the vast majority are battles collected locally for the purposes of training the Abra, Kadabra, and Alakazam line of policies. The version uploaded here trained Alakazam, and previous models were trained on subsets of this dataset.
pac-exploratory 7M Trajectories Self-play revisited after the NeurIPS challenge with higher sampling temperature (to improve value estimates of sub-optimal actions). Notably also includes battles of official metamon policies against PA-Agent (the winning team of the gen1ou tournament), who trained a great policy by (~overfitting) SynRLV2 to the "competitive" gen1ou team set. This has inspired a fresh approach of distilling specialized policies back into the main line models. Kakuna was trained on metamon-parsed-replays, pac-base, and pac-exploratory.

Self-play data will download automatically when requested by the SelfPlayDataset, but these datasets are large. Download in advance with:

python -m metamon.data.download self-play

This downloads both subsets for all available formats (gen1ou, gen2ou, gen3ou, gen4ou, gen9ou). You can also specify formats explicitly: --formats gen1ou gen9ou.

from metamon.data import SelfPlayDataset

self_play_dset = SelfPlayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_func,
    subset="pac-base",  # or "pac-exploratory"
    formats=["gen1ou", "gen9ou"],
)
obs_seq, action_seq, reward_seq, done_seq = self_play_dset[0]

Self-play datasets are currently only available in the parsed replay format, which makes them liable to be deprecated should that format change or a major bug in the battle backend be found. When/if this happens, the replay parser would be expanded to parse ground-truth battle logs and the datasets would be re-released as a noisier aggregate of all the logs from every metamon development server during the same time period.




Team Sets

Team sets are dirs of Showdown team files that are randomly sampled between episodes. They are stored on huggingface at jakegrigsby/metamon-teams and can be downloaded in advance with python -m metamon.data.download teams

metamon.env.get_metamon_teams(battle_format : str, set_name : str)
set_name Teams Per Battle Format Description
"competitive" Varies (< 30) Human-made teams scraped from forum threads. These are usually official "sample teams" designed by experts for beginners, but we are less selective for non-OU tiers. This is the set used for human ladder evaluations in the paper.
"paper_variety" (Gen 1-4 Only) 1k Procedurally generated teams with unrealistic OOD lead-off Pokémon. The paper calls this the "variety set". Movesets were generated by sampling from all-time usage stats.
"paper_replays" 1k (Gen 1-4 OU Only) Predicted teams from replays. The paper calls this the "replay set". Surpassed by the "modern_replays" set below. Used the original prediction strategy of sampling from all-time usage stats.
"modern_replays" 8k-20k
(OU Only)
Predicted teams based on recent replays using the best prediction strategy we have available for each generation. The result is a diverse set representing the recent metagame with blanks filled by a mixture of historical trends.
"modern_replays_v2" Gen1: 19k, Gen2: 13k, Gen3: 31k, Gen4: 27k, Gen9: 158k. An expanded set of replay-predicted teams; updated with Summer 2025 replays.

The HF readme has more information.

We can also use our own directory of team files with, for example:

from metamon.env import TeamSet

team_set = TeamSet("/path/to/your/team/dir", battle_format: str) # e.g. gen3ou

But note that files would need to have the extension ".{battle_format}_team" (e.g., .gen3nu_team).




Baselines

baselines/ contains baseline opponents that we can battle against via BattleAgainstBaseline. baselines/heuristics provides more than a dozen heuristic opponents and starter code for developing new ones (or mixing ground-truth Pokémon knowledge into ML agents). baselines/model_based ties the simple il model checkpoints to poke-env (with CPU inference).

Here is an overview of the opponents mentioned in the paper:

from metamon.baselines import get_baseline, get_all_baseline_names
opponent = get_baseline(name)  # Get specific baseline
available = get_all_baseline_names()  # List all available baselines
name Description
BugCatcher An actively bad trainer that always picks the least damaging move. When forced to switch, picks the pokemon in its party with the worst type matchup vs the player.
RandomBaseline Selects a legal move (or switch) uniformly at random and measures the most basic level of learning early in training runs.
Gen1BossAI Emulates opponents in the original Pokémon Generation 1 games. Usually chooses random moves. However, it prefers using stat-boosting moves on the second turn and “super effective” moves when available.
Grunt A maximally offensive player that selects the move that will deal the greatest damage against the current opposing Pokémon using Pokémon’s damage equation and a type chart and selects the best matchup by type when forced to switch.
GymLeader Improves upon Grunt by additionally taking into account factors such as health. It prioritizes using stat boosts when the current Pokémon is very healthy, and heal moves when unhealthy.
PokeEnvHeuristic The SimpleHeuristicsPlayer baseline provided by poke-env with configurable difficulty (shortcuts like EasyPokeEnvHeuristic).
EmeraldKaizo An adaptation of the AI in a Pokémon Emerald ROM hack intended to be as difficult as possible. It selects actions by scoring the available options against a rule set that includes handwritten conditional statements for a large portion of the moves in the game.
BaseRNN A simple RNN IL policy trained on an early version of our parsed replay dataset. Runs inference on CPU.

Compare baselines with:

python -m metamon.baselines.compete --battle_format gen2ou --player GymLeader --opponent RandomBaseline --battles 10

Here is a reference for the relative strength of some heuristic baselines from the paper:

Figure 1




Observation Spaces, Action Spaces, & Reward Functions

Metamon tries to separate the RL from Pokémon. All we need to do is pick an ObservationSpace, ActionSpace, and RewardFunction:

  1. The environment outputs a UniversalState
  2. Our ObservationSpace maps the UniversalState to the input of our agent.
  3. Our agent outputs an action however we'd like.
  4. Our ActionSpace converts the agent's choice to a UniversalAction.
  5. The environment takes the current (UniversalState, UniversalAction) and outputs the next UniversalState. Our RewardFunction gives the agent a scalar reward.
  6. Repeat until victory.

Observations

UniversalState defines all the features we have access to at each timestep.

The ObservationSpace packs those features into a policy input.
We could create a custom version with more/less features by inheriting from metamon.interface.ObservationSpace.

Observation Space Description
DefaultObservationSpace The text/numerical observation space used in our paper.
ExpandedObservationSpace A slight improvement based on lessons learned from the paper. It also adds tera types for Gen 9.
TeamPreviewObservationSpace Further extends ExpandedObservationSpace with a preview of the opponent's team (for Gen 9).
OpponentMoveObservationSpace Modifies TeamPreviewObservationSpace to include the opponent Pokémon's revealed moves. Continues our trend of deemphasizing long-term memory.
Tokenization

Text features have inconsistent length, but we can translate to int IDs from a list of known vocab words. The built-in observation spaces are designed such that the "tokenized" version will have fixed length.

from metamon.interface import TokenizedObservationSpace, DefaultObservationSpace
from metamon.tokenizer import get_tokenizer

base_obs = DefaultObservationSpace()
tokenized_space = TokenizedObservationSpace(
    base_obs_space=base_obs,
    tokenizer=get_tokenizer("DefaultObservationSpace-v0"),
)

The vocabs are in metamon/tokenizer; they are generated by tracking unique words across the entire replay dataset, with an unknown token for rare cases we may have missed.

Tokenizer Name Description
allreplays-v3 Legacy version for pre-release models.
DefaultObservationSpace-v0 Updated post-release vocabulary as of metamon-parsed-replays dataset v2.
DefaultObservationSpace-v1 Updated vocabulary as of metamon-parsed-replays dataset v3-beta (adds ~1k words for Gen 9).

Actions

Metamon uses a fixed UniversalAction space of 13 discrete choices:

  • {0, 1, 2, 3} use the active Pokémon's moves in alphabetical order.
  • {4, 5, 6, 7, 8} switch to the other Pokémon in the party in alphabetical order.
  • {9, 10, 11, 12} are wildcards for generation-specific gimmicks. Currently, they only apply to Gen 9, where they pick moves (in alphabetical order) with terastallization.

That might not be how we want to set up our agent. The ActionSpace converts between whatever the output of the policy might be and the UniversalAction.

Action Space Description
DefaultActionSpace Standard discrete space of 13 and supports Gen 9.
MinimalActionSpace The original space of 9 choices (4 moves + 5 switches) --- which is all we need for Gen 1-4.

Any new action spaces would be added to metamon.interface.ALL_ACTION_SPACES. A text action space (for LLM-Agents) is on the short-term roadmap.

Rewards

Reward functions assign a scalar reward based on consecutive states (R(s, s')).

Reward Function Description
DefaultShapedReward Shaped reward used by the paper. +/- 100 for win/loss, light shaping for damage dealt, health recovered, status received/inflicted.
BinaryReward Removes the smaller shaping terms and simply provides +/- 100 for win/loss.
AggressiveShapedReward Edits DefaultShapedReward's sparse reward to +200 for winning +0 for losing.

Any new reward functions would be added to metamon.interface.ALL_REWARD_FUNCTIONS, and we can implement a new one by inheriting from metamon.interface.RewardFunction.



Training and Evaluation

Metamon & Amago Diagram

We trained all of our main RL & IL models with amago. Everything you need to train your own model on metamon data and evaluate against Pokémon baselines is provided in metamon/rl/.

Configure wandb logging (optional):

cd metamon/rl/
export METAMON_WANDB_PROJECT="my_wandb_project_name"
export METAMON_WANDB_ENTITY="my_wandb_username"

Train From Scratch

See python train.py --help for options. The training script implements offline RL on the human battle dataset and an optional extra dataset of self-play battles you may have collected.

We might retrain the "SmallIL" model like this:

python -m metamon.rl.train --run_name AnyNameHere --model_gin_config small_agent.gin --train_gin_config il.gin --save_dir ~/my_checkpoint_path/ --log

"SmallRL" would be the same command with --train_gin_config exp_rl.gin. Scan rl/pretrained.py to see the configs used by each pretrained agent. Larger training runs take days to complete and can (optionally) use mulitple GPUs (link). An example of a smaller RNN config is provided in small_rnn.gin.


Finetune from HuggingFace

See python finetune_from_hf.py --help to finetune an existing model to a new dataset, training objective, or reward function!

Provides the same setup as the main train script but takes care of downloading and matching the config details of our public models. Finetuning will inherit the architecture of the base model but allows for changes to the --train_gin_config and --reward_function. Note that the best settings for quick finetuning runs are likely different from the original run!

We might finetune "SmallRL" to the new gen 9 replay dataset and custom battles like this:

python -m metamon.rl.finetune_from_hf --finetune_from_model SmallRL --run_name MyCustomSmallRL --save_dir ~/metamon_finetunes/ --custom_replay_dir /my/custom/parsed_replay_dataset --custom_replay_weight .25 --epochs 10 --steps_per_epoch 10000 --log --formats gen9ou --eval_gens 9 

You can start from any checkpoint number with --finetune_from_ckpt. See the huggingface for a full list. Defaults to the official eval checkpoint.


Customize

Customize the agent architecture by creating new rl/configs/models/ .gin files. Customize the RL hyperparameters by creating new rl/configs/training/ files. Here is a link to a lot more information about configuring training runs. amago is modular, and you can swap just about any piece of the agent with your own ideas. Here is a link to more information about custom components.


Evaluate a Custom Model

metamon.rl.evaluate provides quick-setup evals (pretrained_vs_baselines, pretrained_vs_local_ladder, and pretrained_vs_pokeagent_ladder). Full explanations are provided in the source file.

To eval a custom agent trained from scratch (rl.train) we'd create a LocalPretrainedModel. LocalFinetunedModel provides some quick setup for models finetuned with rl.finetune_from_hf. examples/evaluate_custom_models.py shows an example for each, and deploys them on the PokéAgent Ladder!

Standalone Toy il (Deprecated)

Details

il/ is old toy code that does basic behavior cloning with RNNs. We used it to train early learning-based baselines (BaseRNN, WinsOnlyRNN, and MiniRNN) that you can play against with the BattleAgainstBaseline env. We may add more of these as the dataset grows/improves and more architectures are tried. Playing around with this code might be an easier way to get started, but note that the main rl/train script can also be configured to do RNN BC... but faster and on multiple GPUs.

Get started with something like:

cd metamon/il/
python train.py --run_name any_name_will_do --model_config configs/transformer_embedding.gin  --gpu 0


Other Datasets

To support the main raw-replays, parsed-replays, and teams datasets, metamon creates a few resources that may be useful for other purposes:

Usage Stats

Showdown records the frequency of team choices (items, moves, abilities, etc.) brought to battles in a given month. The community mainly uses this data to consider rule changes, but we use it to help predict missing details of partially revealed teams. We load data for an arbitrary window of history around the date a battle was played, and fall back to all-time stats for rare Pokémon where data is limited:

from metamon.backend.team_prediction.usage_stats import get_usage_stats
from datetime import date
usage_stats = get_usage_stats("gen1ou",
    start_date=date(2017, 12, 1),
    end_date=date(2018, 3, 30)
)
alakazam_info: dict = usage_stats["Alakazam"] # non alphanum chars and case are flexible

Download usage stats in advance with:

python -m metamon.data.download usage-stats

The data is stored on huggingface at jakegrigsby/metamon-usage-stats.

Revealed Teams

One of the main problems the replay parser has to solve is predicting a player's full team based on the "partially revealed" team at the end of the battle. As part of this, we record the revealed team in the standard Showdown team builder format, but with some magic keywords for missing elements. For example:

Tyranitar @ Custap Berry
Ability: Sand Stream
EVs: $missing_ev$ HP / $missing_ev$ Atk / $missing_ev$ Def / $missing_ev$ SpA / $missing_ev$ SpD / $missing_ev$ Spe
$missing_nature$ Nature
IVs: 31 HP / 31 Atk / 31 Def / 31 SpA / 31 SpD / 31 Spe
- Stealth Rock
- Stone Edge
- Pursuit
- $missing_move$

Given the size of our replay dataset, this creates a massive set of real (but incomplete) human team choices. The files are stored alongside the parsed-replay dataset and downloaded with:

python -m metamon.data.download revealed-teams

metamon/backend/team_prediction contains tools for filling in the blanks of these files, but this is all poorly documented and changes frequently, so we'll leave it at that for now.



Battle Backends

Converting Showdown messages to RL observations is hard, and there will always be bugs. Minor fixes to edge cases or rare Pokémon mechanics are fairly common and don't have a real impact on overall performance. However, a fix that directly impacts observation features (agent inputs) usually decreases performance of policies trained on older battles. We extend the lifespan of pretrained model weights by versioning the "battle backend" so that we can evaluate the agent in an environment that matches the dataset it was trained on.

battle_backend : str is an arg for all the RL environment wrappers (see Quick Start).

There are currently three versions:

battle_backend Description Known Bugs When To Use
"poke-env" Original paper verison. Uses poke-env to process online battles. - Creates a sim2sim gap with the replay parser that generates training data from replays.
- PP counting and tera types are broken.
When evaluating the original paper policies.
"pokeagent" Replaces poke-env's message parsing with metamon's replay parser. Maintains the version used by all the new baselines and datasets created for the PokéAgent Challenge. - Gen9 was in Beta; tera types are reported as missing. When evaluating a policy trained during the competition (see Pretrained Models).
"metamon" Always the latest version. When collecting new self-play data and training new policies from scratch.

A PretrainedAgentsaves the backend it "should" be evaluated with (if you're using them as a baseline). If you are collecting lots of new self-play data and actively working on new training runs: use "metamon". Thanks to a few hacks, it is still reasonable to use any PretrainedAgent to collect new training data in the current metamon backend.


Team Preview

In Generation 9, battles begin with a "team preview" phase where both players see each other's full team and choose which Pokémon to lead with. Metamon includes a separate model for this decision.

Training: Team preview models are trained via metamon/backend/team_preview/ using supervised learning on human replay data.

Evaluation: Pass a checkpoint to the evaluation script. An example checkpoint for gen9ou is included:

python -m metamon.rl.evaluate --team_preview_checkpoint metamon/backend/team_preview/gen9ou_high_elo_v4/best_model.pt --team_preview_use_argmax ...

The --team_preview_use_argmax flag selects the highest-probability lead deterministically; without it, the model samples from its predicted distribution.


FAQ

How can I contribute?

Please get in touch! Currently, the easiest place to reach us is via the PokéAgent Challenge Discord Server. You can also email the lead author.

Why do you focus on Gens 1-4?

Because there is no team preview before Gen 5, and inferring hidden information via long-term memory was our main focus from an RL research perspective. There's more about this in the paper. A common criticism was that we were avoiding the complexity that comes with later generations' increase in the number of available Pokémon, items, abilities, and so on. If this gap exists, it is more than made up for by the volume of gen9ou replays, as Gen9OU is now arguably our second best format.

Will you add support for the missing Gens 5-8?

The main engineering barrier is the replay parser and dataset, which supports gen9 but would surely need some updates for backwards-compatible edge cases. This is not a huge job... but redoing the self-play training process to catch up to the performance in existing gens would be. We would definitely accept contributions on this front, but honestly have no plans to do it ourselves, as in our opinion the expansion to gen9 answered research doubts about generality and model-free RL at low search depth and new (singles) formats are more Showdown infra trouble than they're worth.

What about VGC (doubles)?

Support for VGC has been in development but we aren't announcing any timelines on this just yet.


Acknowledgements

This project owes a huge debt to the amazing poke-env, as well Pokémon resources like Bulbapedia, Smogon, and of course Pokémon Showdown.



Citation

@misc{grigsby2025metamon,
      title={Human-Level Competitive Pok\'emon via Scalable Offline Reinforcement Learning with Transformers}, 
      author={Jake Grigsby and Yuqi Xie and Justin Sasek and Steven Zheng and Yuke Zhu},
      year={2025},
      eprint={2504.04395},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.04395}, 
}

About

Baselines and Datasets for Pokémon Showdown RL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •