Skip to content

Conversation

@markknoffler
Copy link

Summary

fixes #483

This PR fixes the crash in make_seq2seq_fields when an empty prompt array is passed. Instead of crashing with a cryptic NumPy error, the function now handles empty prompts gracefully by issuing a warning and using a default BOS token as a fallback.

Solution Overview

The fix adds input validation that:

  1. Detects empty prompts early (before any array operations)
  2. Issues a clear UserWarning to alert users
  3. Uses BOS token (2) as a fallback prompt for seq2seq compatibility
  4. Continues execution without crashing

Changes Made

File: gemma/gm/data/_functional.py

Lines 139-148: Added empty prompt handling with warning and fallback

# Handle empty prompt: issue warning and use a default BOS token (2) as fallback.
if len(prompt) == 0:
    warnings.warn(
        'Empty prompt provided. Using default BOS token (2) as prompt. '
        'Empty prompts are not recommended for sequence-to-sequence training.',
        UserWarning,
        stacklevel=2,
    )
    # Use BOS token (2) as a default prompt token for seq2seq compatibility.
    prompt = np.array([2], dtype=np.int32)

Additional change: Added import warnings at the top of the file

Why This Solution?

  1. Non-breaking - Code continues execution (better for pipelines and batch processing)
  2. User-aware - Warning alerts users to the issue without crashing
  3. Reasonable fallback - BOS token (2) is the standard begin-of-sentence token in Gemma models
  4. Better UX - Clear warning message instead of cryptic crash
  5. Backward compatible - Normal usage with non-empty prompts is completely unaffected

Behavioral Changes

Before:

result = gm.data.make_seq2seq_fields(prompt=[], response=[20, 21, 1])
# Raises: ValueError: negative dimensions are not allowed (crashes)

After:

import warnings
result = gm.data.make_seq2seq_fields(prompt=[], response=[20, 21, 1])
# Issues: UserWarning: Empty prompt provided. Using default BOS token (2)...
# Returns: Valid Seq2SeqFields result with BOS token as prompt (continues)

Technical Details

Why BOS Token (2) as Fallback?

  • Standard token in Gemma models (both Gemma2 and Gemma3 use BOS = 2)
  • Semantically appropriate default for seq2seq tasks
  • Ensures valid mask computation (prevents negative dimensions)
  • Consistent with tokenization practices where BOS is often prepended

Code Flow

Before (Buggy):

Empty prompt []
    ↓
np.concatenate([[], response])  ✓ Works
    ↓
np.zeros((len([]) - 1,), ...)  ✗ CRASH: len([]) - 1 = -1

After (Fixed):

Empty prompt []
    ↓
Detect len(prompt) == 0  ✓ Early detection
    ↓
Issue UserWarning  ✓ Alert user
    ↓
Replace with [2] (BOS token)  ✓ Graceful fallback
    ↓
Continue normal execution  ✓ Success

Testing

The fix has been verified to:

  • ✅ Handle empty prompts gracefully with clear warnings
  • ✅ Return valid results using BOS token fallback
  • ✅ Not break normal usage with non-empty prompts
  • ✅ Provide helpful warning messages to users
  • ✅ Work correctly through both direct API and transform pipelines

Backward Compatibility

This change is fully backward compatible:

  • No breaking changes - Normal usage with non-empty prompts works exactly as before
  • No API changes - Function signature and return types remain the same
  • No behavior changes for valid inputs - Only affects the edge case of empty prompts

Files Changed

  • gemma/gm/data/_functional.py - Added empty prompt handling logic and warnings import

Related Issues

Fixes the crash described in the issue where empty prompts cause ValueError: negative dimensions are not allowed.

Screenshot 2025-12-31 at 12 11 50 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make_seq2seq_fields crashes with confusing error when empty prompt array is passed

1 participant