Overview
Evaluations in Pulze provide a systematic way to test and benchmark AI performance. Unlike simple testing, evaluations allow you to assess not just individual models, but also:- AI Models: Test and compare different language models
- AI Agents/Assistants: Evaluate specialized agents with their tools and capabilities
- Entire Spaces: Test complete space configurations including models, assistants, data, and permissions
Evaluation Workflow
Pulze evaluations use a two-part approach:- Evaluation Templates: Reusable configurations that define how to evaluate performance
- Evaluation Runs: Actual test executions using templates against datasets
Evaluation Templates
Templates are the foundation of your evaluation strategy. They define:- What to evaluate: Models, spaces, or specific configurations
- How to evaluate: Rater model, evaluation criteria, and scoring thresholds
- Advanced settings: Feature flags, space impersonation, custom headers
- Reusable: Create once, use multiple times
- Consistent: Ensure the same evaluation criteria across runs
- Customizable: Tailor evaluation logic to your specific needs
- Version-controlled: Track changes to evaluation standards over time
Open-Source Evaluation Templates
Pulze provides evaluation templates from the Pulze Evals open-source repository. These templates include:- Industry-standard evaluation rubrics used across the AI community
- The same templates used to build Pulze routers - our routers were trained and optimized using these exact evaluation criteria
- Community-contributed templates for diverse use cases
The evaluation templates in the Pulze Evals repository represent battle-tested criteria. They’re the same rubrics we use internally to ensure our Pulze routers deliver high-quality results.
- Use existing templates from the repository
- Customize templates for your specific needs
- Contribute your own evaluation templates back to the community
- Build specialized templates for your domain
Contribute Evaluation Templates
Visit the repository to explore templates or submit your own
Creating Evaluation Templates
Basic Template Configuration
Evaluation templates consist of several key components:1. Template Identity
- Name: Descriptive identifier (e.g., “Customer Support Quality Assessment”)
- Description: Purpose and use case explanation
2. Rater Model (LLM-as-a-Judge)
The rater model is an AI model that evaluates other AI model responses. This “LLM-as-a-judge” approach allows for:- Automated, consistent evaluation at scale
- Nuanced assessment of quality dimensions
- Cost-effective alternative to human evaluation
- Select from any available model in your organization
- Consider using stronger models (e.g., GPT-4, Claude) for more reliable judgments
- Balance cost vs. accuracy for your use case
3. Pass Threshold
Set the minimum score (0.0 to 1.0) required for an evaluation to pass:- 0.0-0.3: Poor/failing responses
- 0.4-0.6: Acceptable but needs improvement
- 0.7-0.9: Good quality responses
- 0.9-1.0: Excellent responses
Start with a threshold around 0.7 and adjust based on your quality requirements. Too strict (>0.9) may flag acceptable responses as failures.
4. Metrics Configuration
Metrics define what dimensions to evaluate. Pulze provides predefined metrics and supports custom metrics. Predefined Metrics:- Accuracy: Factual correctness of the response
- Relevance: How well the response addresses the question
- Helpfulness: Practical value and usefulness to the user
- Click “Add Metric” in the template editor
- Enter your metric name (e.g., “professionalism”, “conciseness”, “creativity”)
- The evaluation prompt automatically updates to include your metric
- The rater model will score responses on all defined metrics
- Tone: Professional vs. casual communication style
- Conciseness: Brevity and clarity
- Empathy: Understanding of user emotions
- Technical Depth: Level of technical detail
- Safety: Absence of harmful or biased content
The evaluation JSON structure automatically adapts to include all your metrics, ensuring consistent scoring across dimensions.
5. Evaluation Prompt
The evaluation prompt instructs the rater model on how to assess responses. A good evaluation prompt includes:- Clear criteria: Specific dimensions to evaluate
- Scoring scale: How to assign scores (0.0-1.0)
- Output format: JSON structure for consistent parsing
- Examples (optional): Sample evaluations for clarity
Advanced Template Modes
Pulze evaluation templates support powerful advanced configurations:1. Space Impersonation
Evaluate entire space configurations by selecting a space to impersonate during evaluation:- Tests with the space’s specific permissions
- Uses the space’s enabled models and routers
- Accesses the space’s uploaded data and documents
- Leverages the space’s configured AI agents and tools
When you select a space for evaluation, Pulze automatically generates a temporary API key with that space’s permissions. This ensures evaluations run exactly as they would for real users of that space.
2. Agentic Feature Flags
Control automatic AI agent behavior during evaluations: Auto Tools 🛠️- Automatically selects appropriate tools to help generate responses
- Tests whether your AI agents choose the right tools for each task
- Validates tool integration and execution
- Uses learned patterns from liked responses to improve outputs
- Tests how well the system adapts to successful patterns
- Validates learning system effectiveness
Feature flags let you test different AI behaviors. For example, compare performance with and without automatic tool selection to measure the impact of agentic features.
3. Additional Advanced Features
Beyond basic configuration, evaluation templates support sophisticated testing scenarios: Targeting Specific Tools: Use custom headers to test specific tool usage:- Target particular tools for AI agents to use
- Validate tool selection and execution
- Test tool integration in different scenarios
- Test individual assistants within a space
- Compare assistant configurations
- Validate assistant behavior with different prompts
- Model Being Evaluated: Configure the model/space you’re testing
- Rater Model: Customize how the evaluation model behaves
- Custom routing headers for specific model behavior
- Temperature or parameter overrides for consistency
- Organization-specific configuration testing
- A/B testing different AI configurations
- Tool availability and selection validation
- Assistant-specific prompt testing
- Multi-assistant comparison within spaces
Developer Guide - Feature Flags
See practical examples of using feature flags and custom headers in the Developer Guide
Running Evaluations
Once you have templates and datasets, you can run evaluations:Single Dataset Evaluation
- Navigate to Evals → Evaluations
- Click Run Evaluation
- Select your evaluation template
- Choose one dataset
- Select models or spaces to evaluate
- Run the evaluation
Multi-Dataset Evaluation
Evaluate across multiple datasets simultaneously:- Select multiple datasets when configuring your run
- Each dataset contributes to the overall score
- View aggregated results across all datasets
- Compare performance on different types of questions
- Comprehensive Coverage: Test across diverse scenarios
- Balanced Assessment: No single dataset dominates the score
- Efficiency: Run once instead of multiple single-dataset evaluations
Multi-dataset evaluations automatically calculate total scores by averaging performance across all selected datasets. This gives you a holistic view of model performance.
Evaluation Results
Automatic Scoring
Pulze automatically calculates scores for each evaluation run:- Per-Item Scores: Individual question/response scores (0.0-1.0)
- Dataset Scores: Average across all items in a dataset
- Total Score: Overall average when using multiple datasets
- Pass/Fail Status: Based on your configured threshold
Results Dashboard
The evaluation results view shows:- Model Rankings: See which models perform best
- Score Distributions: Understand performance patterns
- Pass Rates: Track how many responses met your threshold
- Detailed Analysis: Drill down into individual responses
Comparing Models
Evaluate multiple models simultaneously to compare:- Side-by-side scores: See which model performed better
- Cost analysis: Compare performance relative to cost
- Speed metrics: Track response times
- Quality trends: Identify consistent performers
Evaluation Purposes
1. Model Selection
Compare different AI models to find the best fit:- Test GPT-4, Claude, Gemini, or other models
- Evaluate proprietary vs. open-source options
- Balance performance, cost, and speed
2. Assistant Validation
Test AI agents and assistants with their full capabilities:- Validate tool usage and selection
- Ensure agents follow instructions correctly
- Test multi-step reasoning and planning
3. Space Configuration Testing
Validate entire space setups before deployment:- Test with specific data access and permissions
- Verify assistant configurations
- Ensure tool integrations work correctly
4. Regression Testing
Catch performance degradation after changes:- Run evaluations before and after updates
- Compare results to detect regressions
- Maintain quality standards over time
5. Quality Assurance
Maintain consistent behavior across your AI systems:- Define quality standards via thresholds
- Ensure responses meet requirements
- Track quality metrics over time
Best Practices
Start with Clear Objectives: Define what you want to measure before creating evaluation templates. Are you testing accuracy, helpfulness, tool usage, or something else?
For Templates
- Descriptive Names: Use clear names like “Customer Support Quality” instead of “Template 1”
- Detailed Prompts: Provide comprehensive evaluation criteria to the rater model
- Appropriate Thresholds: Set realistic pass/fail thresholds based on your use case
- Relevant Metrics: Choose metrics that align with your goals
For Evaluation Runs
- Representative Datasets: Use datasets that reflect real-world usage
- Multiple Datasets: Combine different dataset types for comprehensive testing
- Regular Cadence: Schedule periodic evaluations to catch issues early
- Baseline Comparisons: Always compare against a baseline or previous version
For Space Evaluations
- Test Production Config: Use space impersonation to test exactly as users will experience
- Validate Permissions: Ensure data access controls work as expected
- Check Tool Integration: Verify AI agents use tools correctly
- Monitor Agent Behavior: Track how agents make decisions with feature flags
Evaluation Templates Library
Pulze provides predefined evaluation templates to get you started quickly:- Quality Assessment: General-purpose quality evaluation
- Factual Accuracy: Tests for correct information
- Instruction Following: Measures adherence to prompts
- Custom Templates: Create your own for specific use cases
Integration with Data
Evaluations are tightly integrated with Pulze’s data ecosystem:With Datasets
- Use any dataset type (Manual, Learning, or Benchmark)
- Combine multiple datasets in one evaluation
- Create datasets specifically for evaluation purposes
With Spaces
- Evaluate using space-specific data and documents
- Test with space permissions and access controls
- Validate space configurations before user deployment
Results Storage
- All evaluation results are stored and versioned
- Track performance trends over time
- Export results for external analysis
- Share findings with your team
Advanced Evaluation Patterns
A/B Testing Configurations
Use evaluation templates to A/B test different configurations:- Create two templates with different settings (e.g., with/without auto_tools)
- Run both against the same datasets
- Compare results to determine which configuration performs better
Continuous Evaluation
Integrate evaluations into your CI/CD pipeline:- Create datasets that represent your test cases
- Set up evaluation templates for your standards
- Run evaluations automatically on changes
- Block deployments that don’t meet thresholds
Progressive Testing
Test changes incrementally:- Start with a small dataset to validate basic functionality
- Expand to larger datasets for comprehensive testing
- Run space-impersonated evaluations for final validation
- Deploy with confidence
Monitoring and Alerts
Set up monitoring for evaluation results to catch performance degradation early. If scores drop below your threshold, investigate before the issue affects users.
- Set up alerts for failing evaluations
- Monitor score trends
- Track pass rates across models
- Identify degradation patterns
API Access
Evaluations are accessible via the Pulze API for automation:- Create and manage templates programmatically
- Trigger evaluation runs automatically
- Retrieve results for custom analytics
- Integrate with external monitoring systems
Next Steps
To get started with evaluations:- Create datasets with representative test cases (Manual, Learning, or Benchmark)
- Design evaluation templates that define your quality standards
- Run your first evaluation to establish baselines
- Compare results across models, assistants, or configurations
- Iterate and improve based on evaluation insights