Appearance
Finetune Models
Framework Agnostic
Weaver is a framework-agnostic platform. You can choose and adapt any RL training framework you prefer. The Weaver team has integrated and deeply optimized NexRL as a reference implementation, which we'll use in the following examples.
NexRL Overview
NexRL is an RL training framework that seamlessly works with Weaver backend for large-scale post-training. It features modularized components for building custom RL pipelines, providing maximum flexibility and extensibility while maintaining clean abstractions and ease of use.
NexRL provides an end-to-end pipeline for reinforcement learning with Weaver:
Data → RolloutWorker → Trajectories → Trainer → Weaver Training API
↓ ↓
Rewards Updated WeightsKey Features:
- Modular design: Customize rollout workers and trainers independently
- Recipe-based configuration: Reproducible experiments with version control
- Flexible rollout workers: Simple text completion or complex agents (NexAU)
- Trainer extensibility: Implement your own RL algorithms (GRPO, PPO, custom)
- Direct Weaver integration: Seamless connection to Weaver's training API
Architecture Components:
- DataLoader: Provides training examples (prompts, questions, etc.)
- RolloutWorker (Customizable): Generates responses and computes rewards
- TrajectoryPool: Collects and groups trajectories for training
- Trainer (Customizable): Implements your RL algorithm (e.g., GRPO, PPO, custom)
- WeightSync: Synchronizes updated weights back to Weaver inference service
For Weaver users, you typically customize two components:
- RolloutWorker: Trajectory generation logic (how to collect data and compute rewards)
- Trainer: Training algorithm logic (how to compute advantages and update the model)
Running Training
NexRL organizes each training job as a recipe - a self-contained directory with configuration and task-specific code.
Recipe Structure
recipe/
└── my_task/
├── my_task.yaml # Main configuration
├── my_task.env.sh # Environment setup (optional)
└── agent_workspace/ # Task-specific files (for agents)
├── agent_config.yaml # Agent configuration
├── evaluator.py # Reward computation
└── tools.py # Custom tools (optional)Running a Recipe
bash
# Navigate to NexRL directory
cd NexRL
# Launch training on Kubernetes
python scripts_new/training_service/run.py \
--train-config recipe/my_task/my_task.yaml \
--run-nexrl \
--tag v1Options:
--train-config: Path to the recipe configuration YAML file--run-nexrl: Start training automatically (omit to manually start later)--tag: Optional tag to add to job names for identification
Examples
Basic Training: Pig Latin
The Pig Latin task demonstrates basic supervised learning with NexRL and Weaver. It translates English text to Pig Latin using ground-truth labels.
Recipe Location: NexRL/recipe/weaver_pig_latin_qwen3_8b/
Key Configuration Highlights:
yaml
# Main config: weaver_pig_latin_qwen3_8b.yaml
project_name: "NexRL-weaver"
experiment_name: "weaver-pig_latin_qwen3_8b"
data:
type: "torch"
data_files:
- "recipe/weaver_pig_latin_qwen3_8b/data/pig_latin_train.jsonl"
batch_size: 4
rollout_repeat_n: 1 # Supervised learning: 1 trajectory per example
prompt_key: "input"
max_prompt_length: 512
max_response_length: 512
rollout_worker:
type: "pig_latin" # Custom rollout worker for supervised learning
num_workers: 4
need_llm_inference: false # Uses ground truth labels, no inference needed
trainer:
type: "remote_api_cross_entropy" # Standard supervised fine-tuning
total_train_steps: 6
service:
weaver_service:
lora_rank: 32
api_key: "your-weaver-api-key"
train_service:
backend: weaver
config:
loss_fn: "cross_entropy" # Standard supervised loss
learning_rate: 1e-4This example demonstrates:
- Supervised learning workflow with NexRL
- Custom rollout worker without LLM inference
- Cross-entropy loss for supervised fine-tuning
- Weaver integration for model training
For more details on NexRL framework design, configuration structure, and advanced features, refer to the NexRL Documentation.
Customizable Components
RolloutWorker
The RolloutWorker component generates training trajectories by converting input data to trajectories with rewards.
Built-in Worker Types:
simple: Basic text completion - sends prompt to LLM and compares answer directly.nexau: NexAU agent worker - multi-turn reasoning with tools, skills, etc.pig_latin: Supervised learning worker for the pig-latin task - uses ground-truth labels (no LLM inference).
Key Function to Customize: rollout()
The rollout() function converts input data to a trajectory. Typical RL processing involves:
- Process input data: Extract prompt and metadata
- Query LLM: Generate response using inference service
- Process output: Extract answer and compute reward
- Put trajectory: Submit trajectory to pool
Example:
python
# recipe/my_task/rollout_worker.py
from nexrl.rollout_worker import BaseRolloutWorker
from nexrl.nexrl_types import Trajectory
class MyTaskWorker(BaseRolloutWorker):
"""Custom rollout worker for my task."""
def rollout(self, task: dict) -> str | None:
"""Convert input data to trajectory."""
# 1. Process input data
prompt = task["prompt"]
ground_truth = task.get("ground_truth", "")
# 2. Query LLM
completion = self._inference_client.completion(prompt)
prompt_tokens = completion["prompt_tokens"]
response_tokens = completion["response_tokens"]
# 3. Process output to get response and reward
response = completion["response"]
extracted_answer = self._extract_answer(response)
reward = 1.0 if extracted_answer == ground_truth else 0.0
# 4. Put trajectory
trajectory = Trajectory(
tokens=prompt_tokens + response_tokens,
loss_mask=[0] * len(prompt_tokens) + [1] * len(response_tokens),
reward=reward,
extra_fields={
"response": response,
"ground_truth": ground_truth,
}
)
return self._put_trajectory(trajectory)Configuration:
yaml
rollout_worker:
type: "custom"
custom_rollout_worker_module_path: "recipe/my_task/rollout_worker.py"
custom_rollout_worker_class_name: "MyTaskWorker"Trainer
The Trainer component implements your RL algorithm by processing trajectories and training with Weaver.
Built-in Trainer Types:
remote_api_grpo: Group Relative Policy Optimization (GRPO algorithm)remote_api_cross_entropy: Supervised fine-tuning (standard cross-entropy loss)
Key Function to Customize: _prepare_trajectories()
NexRL provides shared training functionality through the base RemoteApiTrainer class. The typical workflow is:
- Prepare trajectories (compute advantages)
- Convert to Weaver format
- Train via Weaver API
You can override _prepare_trajectories() to implement custom advantage computation while using the shared training logic.
Example (GRPO Trainer):
python
# recipe/my_task/trainer.py
from nexrl.trainer import RemoteApiTrainer
class MyAlgorithmTrainer(RemoteApiTrainer):
"""Custom trainer with your algorithm."""
def _prepare_trajectories(self, trajectories, metrics):
"""
Implement your advantage computation here.
The base class handles Weaver API communication.
"""
# Your algorithm: PPO, A2C, or custom logic
for traj in trajectories:
# Example: Simple reward-to-go
traj.advantage = traj.reward
# Or: Value function bootstrapping
# Or: GAE computation
return trajectoriesConfiguration:
yaml
trainer:
type: "custom"
custom_trainer_module_path: "recipe/my_task/trainer.py"
custom_trainer_class_name: "MyAlgorithmTrainer"Next Steps
- NexAU Agent Framework: Learn more about building agents in the NexAU guide
- NexRL Documentation: Explore framework design, advanced configuration, and distributed training in the NexRL Developer Guide
- Example Recipes: Browse complete recipe examples in
NexRL/recipe/directory