Finetune Models

Framework Agnostic

Weaver is a framework-agnostic platform. You can choose and adapt any RL training framework you prefer. The Weaver team has integrated and deeply optimized NexRL as a reference implementation, which we'll use in the following examples.

NexRL Overview

NexRL is an RL training framework that seamlessly works with Weaver backend for large-scale post-training. It features modularized components for building custom RL pipelines, providing maximum flexibility and extensibility while maintaining clean abstractions and ease of use.

NexRL provides an end-to-end pipeline for reinforcement learning with Weaver:

Data → RolloutWorker → Trajectories → Trainer → Weaver Training API
                ↓                                        ↓
            Rewards                              Updated Weights

Key Features:

Modular design: Customize rollout workers and trainers independently
Recipe-based configuration: Reproducible experiments with version control
Flexible rollout workers: Simple text completion or complex agents (NexAU)
Trainer extensibility: Implement your own RL algorithms (GRPO, PPO, custom)
Direct Weaver integration: Seamless connection to Weaver's training API

Architecture Components:

DataLoader: Provides training examples (prompts, questions, etc.)
RolloutWorker (Customizable): Generates responses and computes rewards
TrajectoryPool: Collects and groups trajectories for training
Trainer (Customizable): Implements your RL algorithm (e.g., GRPO, PPO, custom)
WeightSync: Synchronizes updated weights back to Weaver inference service

For Weaver users, you typically customize two components:

RolloutWorker: Trajectory generation logic (how to collect data and compute rewards)
Trainer: Training algorithm logic (how to compute advantages and update the model)

Running Training

NexRL organizes each training job as a recipe - a self-contained directory with configuration and task-specific code.

Recipe Structure

recipe/
└── my_task/
    ├── my_task.yaml              # Main configuration
    ├── my_task.env.sh            # Environment setup (optional)
    └── agent_workspace/          # Task-specific files (for agents)
        ├── agent_config.yaml     # Agent configuration
        ├── evaluator.py          # Reward computation
        └── tools.py              # Custom tools (optional)

Running a Recipe

bash

# Navigate to NexRL directory
cd NexRL

# Launch training on Kubernetes
python scripts_new/training_service/run.py \
    --train-config recipe/my_task/my_task.yaml \
    --run-nexrl \
    --tag v1

Options:

--train-config: Path to the recipe configuration YAML file
--run-nexrl: Start training automatically (omit to manually start later)
--tag: Optional tag to add to job names for identification

Examples

Basic Training: Pig Latin

The Pig Latin task demonstrates basic supervised learning with NexRL and Weaver. It translates English text to Pig Latin using ground-truth labels.

Recipe Location: NexRL/recipe/weaver_pig_latin_qwen3_8b/

Key Configuration Highlights:

yaml

# Main config: weaver_pig_latin_qwen3_8b.yaml
project_name: "NexRL-weaver"
experiment_name: "weaver-pig_latin_qwen3_8b"

data:
  type: "torch"
  data_files:
    - "recipe/weaver_pig_latin_qwen3_8b/data/pig_latin_train.jsonl"
  batch_size: 4
  rollout_repeat_n: 1  # Supervised learning: 1 trajectory per example
  prompt_key: "input"
  max_prompt_length: 512
  max_response_length: 512

rollout_worker:
  type: "pig_latin"  # Custom rollout worker for supervised learning
  num_workers: 4
  need_llm_inference: false  # Uses ground truth labels, no inference needed

trainer:
  type: "remote_api_cross_entropy"  # Standard supervised fine-tuning
  total_train_steps: 6

service:
  weaver_service:
    lora_rank: 32
    api_key: "your-weaver-api-key"

  train_service:
    backend: weaver
    config:
      loss_fn: "cross_entropy"  # Standard supervised loss
      learning_rate: 1e-4

This example demonstrates:

Supervised learning workflow with NexRL
Custom rollout worker without LLM inference
Cross-entropy loss for supervised fine-tuning
Weaver integration for model training

For more details on NexRL framework design, configuration structure, and advanced features, refer to the NexRL Documentation.

Customizable Components

RolloutWorker

The RolloutWorker component generates training trajectories by converting input data to trajectories with rewards.

Built-in Worker Types:

simple: Basic text completion - sends prompt to LLM and compares answer directly.
nexau: NexAU agent worker - multi-turn reasoning with tools, skills, etc.
pig_latin: Supervised learning worker for the pig-latin task - uses ground-truth labels (no LLM inference).

Key Function to Customize: rollout()

The rollout() function converts input data to a trajectory. Typical RL processing involves:

Process input data: Extract prompt and metadata
Query LLM: Generate response using inference service
Process output: Extract answer and compute reward
Put trajectory: Submit trajectory to pool

Example:

python

# recipe/my_task/rollout_worker.py
from nexrl.rollout_worker import BaseRolloutWorker
from nexrl.nexrl_types import Trajectory

class MyTaskWorker(BaseRolloutWorker):
    """Custom rollout worker for my task."""

    def rollout(self, task: dict) -> str | None:
        """Convert input data to trajectory."""
        # 1. Process input data
        prompt = task["prompt"]
        ground_truth = task.get("ground_truth", "")

        # 2. Query LLM
        completion = self._inference_client.completion(prompt)
        prompt_tokens = completion["prompt_tokens"]
        response_tokens = completion["response_tokens"]

        # 3. Process output to get response and reward
        response = completion["response"]
        extracted_answer = self._extract_answer(response)
        reward = 1.0 if extracted_answer == ground_truth else 0.0

        # 4. Put trajectory
        trajectory = Trajectory(
            tokens=prompt_tokens + response_tokens,
            loss_mask=[0] * len(prompt_tokens) + [1] * len(response_tokens),
            reward=reward,
            extra_fields={
                "response": response,
                "ground_truth": ground_truth,
            }
        )
        return self._put_trajectory(trajectory)

Configuration:

yaml

rollout_worker:
  type: "custom"
  custom_rollout_worker_module_path: "recipe/my_task/rollout_worker.py"
  custom_rollout_worker_class_name: "MyTaskWorker"

Trainer

The Trainer component implements your RL algorithm by processing trajectories and training with Weaver.

Built-in Trainer Types:

remote_api_grpo: Group Relative Policy Optimization (GRPO algorithm)
remote_api_cross_entropy: Supervised fine-tuning (standard cross-entropy loss)

Key Function to Customize: _prepare_trajectories()

NexRL provides shared training functionality through the base RemoteApiTrainer class. The typical workflow is:

Prepare trajectories (compute advantages)
Convert to Weaver format
Train via Weaver API

You can override _prepare_trajectories() to implement custom advantage computation while using the shared training logic.

Example (GRPO Trainer):

python

# recipe/my_task/trainer.py
from nexrl.trainer import RemoteApiTrainer

class MyAlgorithmTrainer(RemoteApiTrainer):
    """Custom trainer with your algorithm."""

    def _prepare_trajectories(self, trajectories, metrics):
        """
        Implement your advantage computation here.
        The base class handles Weaver API communication.
        """
        # Your algorithm: PPO, A2C, or custom logic
        for traj in trajectories:
            # Example: Simple reward-to-go
            traj.advantage = traj.reward
            # Or: Value function bootstrapping
            # Or: GAE computation

        return trajectories

Configuration:

yaml

trainer:
  type: "custom"
  custom_trainer_module_path: "recipe/my_task/trainer.py"
  custom_trainer_class_name: "MyAlgorithmTrainer"

Next Steps

NexAU Agent Framework: Learn more about building agents in the NexAU guide
NexRL Documentation: Explore framework design, advanced configuration, and distributed training in the NexRL Developer Guide
Example Recipes: Browse complete recipe examples in NexRL/recipe/ directory

Finetune Models ​

NexRL Overview ​

Running Training ​

Recipe Structure ​

Running a Recipe ​

Examples ​

Basic Training: Pig Latin ​

Customizable Components ​

RolloutWorker ​

Trainer ​

Next Steps ​

Finetune Models

NexRL Overview

Running Training

Recipe Structure

Running a Recipe

Examples

Basic Training: Pig Latin

Customizable Components

RolloutWorker

Trainer

Next Steps