Overview
Datasets are curated collections of prompts and expected responses used for systematic testing, evaluation, and benchmarking of AI performance. With datasets, you can evaluate:- AI Models: Test and compare different language models
- AI Agents/Assistants: Evaluate specialized agents with specific tools and capabilities
- Entire Spaces: Test complete space configurations including models, assistants, data, and permissions
Dataset Types
Pulze offers three distinct types of datasets to fit your evaluation needs:1. Manual Datasets
Manual datasets allow you to create custom test cases by manually entering prompts as if you were chatting in a space. Key Features:- Space Context: Select a specific space, and your dataset inherits all space configurations
- Full Space Access: Use all enabled models, assistants, tools, and data available in that space
- Custom Prompts: Write your own test prompts and define expected “golden” answers
- Flexible Creation: Add prompts one at a time or in bulk
- Testing specific business scenarios or edge cases
- Quality assurance for customer-facing assistants
- Regression testing after configuration changes
- Creating domain-specific evaluation sets
- Navigate to Data → Datasets
- Click Create Dataset
- Choose Manual as the dataset type
- Select the space whose context you want to use
- Add prompts and their expected responses
- Save your dataset
Manual datasets are perfect when you have specific test cases in mind or want to validate particular scenarios that reflect your unique use case.
2. Learning Datasets (Space-Based)
Learning datasets are automatically generated from real conversations in your spaces, specifically from messages that received positive feedback (likes). Key Features:- Automatic Generation: Built from actual user interactions
- Quality-Driven: Only includes prompts that received positive feedback
- Space-Scoped: Limited to conversations within specific spaces
- Permission-Based: You only see spaces you have access to
- Real-World Data: Reflects actual usage patterns and questions
- Capturing successful interactions for future testing
- Building evaluation sets from production usage
- Identifying patterns in high-quality responses
- Creating benchmarks based on real user satisfaction
- Navigate to Data → Datasets
- Click Create Dataset
- Choose Learn as the dataset type
- Select the space to pull liked conversations from
- Configure filters (date range, minimum likes, etc.)
- Generate the dataset
Learning datasets require that users have actively liked messages in your spaces. Encourage your team to use the like feature to build better evaluation sets!
3. Benchmark Datasets
Benchmark datasets provide access to industry-leading, standardized evaluation benchmarks from the Pulze Evals open-source repository. Key Features:- Open-Source: Community-contributed benchmarks
- Industry Standard: Well-established evaluation frameworks
- Diverse Coverage: Multiple domains, subjects, and difficulty levels
- Browse & Select: Interactive browser to explore and select specific prompts
- Remix Capability: Mix and match questions from different benchmarks
- Export & Contribute: Create your own benchmarks and contribute back
- Academic knowledge tests (MMLU, ARC, etc.)
- Reasoning benchmarks (HellaSwag, WinoGrande, etc.)
- Coding challenges
- Mathematical problem-solving
- Domain-specific evaluations
Browse and Select from Existing Benchmarks
- Navigate to Data → Datasets
- Click Create Dataset
- Choose Benchmarks as the dataset type
- Browse Benchmarks:
- View all available benchmarks with descriptions
- See total items and available subjects for each
- Search benchmarks by name or description
- Select Benchmark Items:
- Choose a benchmark to explore
- Filter by subject/category
- Configure dataset size:
- All items: Include everything from the benchmark
- First N items: Take the first X questions
- Random sample: Randomly select X questions (with configurable seed)
- Range selection: Select items from index X to Y
- Search specific questions or answers
- Select individual items or bulk select
- Remix Multiple Benchmarks:
- Add items from one benchmark
- Switch to another benchmark and add more items
- Build custom evaluation sets mixing different sources
- Use Selected Items: Create your dataset with all accumulated items
Dataset Size Options
Control exactly how many items you want:
- All items: Complete benchmark
- First N: Sequential selection
- Random: Reproducible random sampling
- Range: Specific index range
Bulk Subject Sampling
Efficiently sample across multiple subjects:
- Select multiple subjects
- Set sample size per subject
- Maintain balanced representation
Contributing Your Own Benchmarks
The Pulze Evals repository is open-source, and we welcome community contributions! Export and Contribute:- Create and refine your custom dataset in Pulze
- Run evaluations to validate quality
- Export your dataset as a benchmark
- Submit a pull request to github.com/pulzeai-oss/evals
- Share your benchmark with the community
- Help establish new evaluation standards
- Share domain-specific benchmarks
- Build reputation in the AI evaluation community
- Enable others to benchmark in your domain
The benchmark repository is designed for community collaboration. Whether you’re testing medical AI, legal reasoning, or creative writing, your benchmarks can help others evaluate similar systems.
Dataset Management
Viewing Datasets
All datasets are accessible from the Data section:- View dataset type (Manual, Learn, or Benchmarks)
- See item counts and creation dates
- Filter and search across all datasets
- Track who created each dataset
Dataset Information
Each dataset includes:- Name: Descriptive identifier
- Description: Purpose and contents
- Type: Manual, Learn, or Benchmarks
- Item Count: Number of prompt-response pairs
- Created By: Dataset author
- Status: Ready, Creating, or Failed
- Associated Space: For manual and learning datasets
- Benchmark Name: Source benchmark
- Subjects: Available categories or topics
Editing Datasets
- Update Name/Description: Keep your datasets well-organized
- Add Items: Expand existing datasets with new prompts
- Remove Items: Clean up or refine your evaluation sets
- View Details: Expand to see all items with questions and answers
Using Datasets
Datasets integrate seamlessly with evaluations:- Select one or multiple datasets for evaluation runs
- Test different models against the same prompts
- Compare performance across assistants or configurations
- Track improvements over time
Permissions and Access
Dataset visibility follows your space permissions:- Manual & Learning Datasets: Scoped to specific spaces
- You can only see datasets from spaces you have access to
- Dataset context matches the space’s configuration
- Benchmark Datasets: Available to all organization members
- Open-source benchmarks are accessible to everyone
- Custom benchmarks follow organization permissions
If you lose access to a space, you’ll also lose access to its associated manual and learning datasets.
Best Practices
Start Small, Iterate: Begin with a focused dataset of 50-100 carefully chosen prompts, then expand based on evaluation insights.
For Manual Datasets
- Representative Prompts: Cover common use cases and edge cases
- Clear Golden Answers: Define specific, measurable expected responses
- Regular Updates: Refresh datasets as your use case evolves
- Version Control: Create dated versions for tracking changes
For Learning Datasets
- Encourage Feedback: Train your team to like helpful responses
- Quality Threshold: Set minimum like counts to ensure quality
- Regular Generation: Update learning datasets monthly or quarterly
- Review Before Use: Manually review generated items for appropriateness
For Benchmark Datasets
- Mix Sources: Combine items from multiple benchmarks for comprehensive coverage
- Balanced Sampling: Ensure even distribution across subjects/categories
- Document Composition: Track which benchmarks and items you’ve included
- Reproducibility: Use fixed random seeds for consistent sampling
- Share Back: Export successful custom benchmarks to the community
Integration with Evaluations
Datasets power the evaluation system:Evaluation Workflow
- Create Dataset: Build your test set using any of the three types
- Configure Evaluation: Select datasets, models, and rater configuration
- Run Evaluation: Execute tests across your chosen models
- Analyze Results: Compare performance, identify issues, track progress
Multi-Dataset Evaluations
- Combine multiple datasets in a single evaluation run
- Mix manual, learning, and benchmark datasets
- Compare model performance across different question types
- Get comprehensive evaluation coverage
Learn More About Evaluations
Discover how to run comprehensive evaluations using your datasets
Technical Details
Dataset Storage
- All datasets are stored at the organization level
- Items include prompts, expected responses, and metadata
- Benchmark datasets cache items from the open-source repository
Performance Considerations
- Large datasets (1000+ items) are paginated for performance
- Benchmark browser supports up to 50,000+ item benchmarks
- Search and filtering work across entire dataset
API Access
Datasets are accessible via the Pulze API for programmatic use:- Create datasets programmatically
- Add items in bulk
- Query dataset contents
- Integrate with CI/CD pipelines
Open-Source Benchmarks
The Pulze Evals repository is a community-driven collection of AI evaluation benchmarks. Repository Features:- Well-documented benchmark formats
- Easy contribution process
- Versioned benchmark releases
- Automated validation
- Community reviews and feedback
- MMLU (Massive Multitask Language Understanding): 57 subjects across STEM, humanities, social sciences
- ARC (AI2 Reasoning Challenge): Science questions requiring reasoning
- HellaSwag: Commonsense reasoning about everyday situations
- TruthfulQA: Evaluating truthfulness and reducing hallucinations
- GSM8K: Grade school math word problems
- And many more…
Contribute to Pulze Evals
Visit the open-source repository to contribute your benchmarks or explore existing ones