Skip to main content

Overview

Datasets are curated collections of prompts and expected responses used for systematic testing, evaluation, and benchmarking of AI performance. With datasets, you can evaluate:
  • AI Models: Test and compare different language models
  • AI Agents/Assistants: Evaluate specialized agents with specific tools and capabilities
  • Entire Spaces: Test complete space configurations including models, assistants, data, and permissions
Pulze supports multiple types of datasets, each designed for different use cases and workflows, allowing you to thoroughly validate your AI systems before deployment.

Dataset Types

Pulze offers three distinct types of datasets to fit your evaluation needs:

1. Manual Datasets

Manual datasets allow you to create custom test cases by manually entering prompts as if you were chatting in a space. Key Features:
  • Space Context: Select a specific space, and your dataset inherits all space configurations
  • Full Space Access: Use all enabled models, assistants, tools, and data available in that space
  • Custom Prompts: Write your own test prompts and define expected “golden” answers
  • Flexible Creation: Add prompts one at a time or in bulk
Use Cases:
  • Testing specific business scenarios or edge cases
  • Quality assurance for customer-facing assistants
  • Regression testing after configuration changes
  • Creating domain-specific evaluation sets
How to Create:
  1. Navigate to DataDatasets
  2. Click Create Dataset
  3. Choose Manual as the dataset type
  4. Select the space whose context you want to use
  5. Add prompts and their expected responses
  6. Save your dataset
Manual datasets are perfect when you have specific test cases in mind or want to validate particular scenarios that reflect your unique use case.

2. Learning Datasets (Space-Based)

Learning datasets are automatically generated from real conversations in your spaces, specifically from messages that received positive feedback (likes). Key Features:
  • Automatic Generation: Built from actual user interactions
  • Quality-Driven: Only includes prompts that received positive feedback
  • Space-Scoped: Limited to conversations within specific spaces
  • Permission-Based: You only see spaces you have access to
  • Real-World Data: Reflects actual usage patterns and questions
Use Cases:
  • Capturing successful interactions for future testing
  • Building evaluation sets from production usage
  • Identifying patterns in high-quality responses
  • Creating benchmarks based on real user satisfaction
How to Create:
  1. Navigate to DataDatasets
  2. Click Create Dataset
  3. Choose Learn as the dataset type
  4. Select the space to pull liked conversations from
  5. Configure filters (date range, minimum likes, etc.)
  6. Generate the dataset
Learning datasets require that users have actively liked messages in your spaces. Encourage your team to use the like feature to build better evaluation sets!

3. Benchmark Datasets

Benchmark datasets provide access to industry-leading, standardized evaluation benchmarks from the Pulze Evals open-source repository. Key Features:
  • Open-Source: Community-contributed benchmarks
  • Industry Standard: Well-established evaluation frameworks
  • Diverse Coverage: Multiple domains, subjects, and difficulty levels
  • Browse & Select: Interactive browser to explore and select specific prompts
  • Remix Capability: Mix and match questions from different benchmarks
  • Export & Contribute: Create your own benchmarks and contribute back
Available Benchmarks Include:
  • Academic knowledge tests (MMLU, ARC, etc.)
  • Reasoning benchmarks (HellaSwag, WinoGrande, etc.)
  • Coding challenges
  • Mathematical problem-solving
  • Domain-specific evaluations
How to Create:

Browse and Select from Existing Benchmarks

  1. Navigate to DataDatasets
  2. Click Create Dataset
  3. Choose Benchmarks as the dataset type
  4. Browse Benchmarks:
    • View all available benchmarks with descriptions
    • See total items and available subjects for each
    • Search benchmarks by name or description
  5. Select Benchmark Items:
    • Choose a benchmark to explore
    • Filter by subject/category
    • Configure dataset size:
      • All items: Include everything from the benchmark
      • First N items: Take the first X questions
      • Random sample: Randomly select X questions (with configurable seed)
      • Range selection: Select items from index X to Y
    • Search specific questions or answers
    • Select individual items or bulk select
  6. Remix Multiple Benchmarks:
    • Add items from one benchmark
    • Switch to another benchmark and add more items
    • Build custom evaluation sets mixing different sources
  7. Use Selected Items: Create your dataset with all accumulated items

Dataset Size Options

Control exactly how many items you want:
  • All items: Complete benchmark
  • First N: Sequential selection
  • Random: Reproducible random sampling
  • Range: Specific index range

Bulk Subject Sampling

Efficiently sample across multiple subjects:
  • Select multiple subjects
  • Set sample size per subject
  • Maintain balanced representation

Contributing Your Own Benchmarks

The Pulze Evals repository is open-source, and we welcome community contributions! Export and Contribute:
  1. Create and refine your custom dataset in Pulze
  2. Run evaluations to validate quality
  3. Export your dataset as a benchmark
  4. Submit a pull request to github.com/pulzeai-oss/evals
  5. Share your benchmark with the community
Why Contribute:
  • Help establish new evaluation standards
  • Share domain-specific benchmarks
  • Build reputation in the AI evaluation community
  • Enable others to benchmark in your domain
The benchmark repository is designed for community collaboration. Whether you’re testing medical AI, legal reasoning, or creative writing, your benchmarks can help others evaluate similar systems.

Dataset Management

Viewing Datasets

All datasets are accessible from the Data section:
  • View dataset type (Manual, Learn, or Benchmarks)
  • See item counts and creation dates
  • Filter and search across all datasets
  • Track who created each dataset

Dataset Information

Each dataset includes:
  • Name: Descriptive identifier
  • Description: Purpose and contents
  • Type: Manual, Learn, or Benchmarks
  • Item Count: Number of prompt-response pairs
  • Created By: Dataset author
  • Status: Ready, Creating, or Failed
  • Associated Space: For manual and learning datasets
For benchmark datasets, you’ll also see:
  • Benchmark Name: Source benchmark
  • Subjects: Available categories or topics

Editing Datasets

  • Update Name/Description: Keep your datasets well-organized
  • Add Items: Expand existing datasets with new prompts
  • Remove Items: Clean up or refine your evaluation sets
  • View Details: Expand to see all items with questions and answers

Using Datasets

Datasets integrate seamlessly with evaluations:
  • Select one or multiple datasets for evaluation runs
  • Test different models against the same prompts
  • Compare performance across assistants or configurations
  • Track improvements over time

Permissions and Access

Dataset visibility follows your space permissions:
  • Manual & Learning Datasets: Scoped to specific spaces
    • You can only see datasets from spaces you have access to
    • Dataset context matches the space’s configuration
  • Benchmark Datasets: Available to all organization members
    • Open-source benchmarks are accessible to everyone
    • Custom benchmarks follow organization permissions
If you lose access to a space, you’ll also lose access to its associated manual and learning datasets.

Best Practices

Start Small, Iterate: Begin with a focused dataset of 50-100 carefully chosen prompts, then expand based on evaluation insights.

For Manual Datasets

  1. Representative Prompts: Cover common use cases and edge cases
  2. Clear Golden Answers: Define specific, measurable expected responses
  3. Regular Updates: Refresh datasets as your use case evolves
  4. Version Control: Create dated versions for tracking changes

For Learning Datasets

  1. Encourage Feedback: Train your team to like helpful responses
  2. Quality Threshold: Set minimum like counts to ensure quality
  3. Regular Generation: Update learning datasets monthly or quarterly
  4. Review Before Use: Manually review generated items for appropriateness

For Benchmark Datasets

  1. Mix Sources: Combine items from multiple benchmarks for comprehensive coverage
  2. Balanced Sampling: Ensure even distribution across subjects/categories
  3. Document Composition: Track which benchmarks and items you’ve included
  4. Reproducibility: Use fixed random seeds for consistent sampling
  5. Share Back: Export successful custom benchmarks to the community

Integration with Evaluations

Datasets power the evaluation system:

Evaluation Workflow

  1. Create Dataset: Build your test set using any of the three types
  2. Configure Evaluation: Select datasets, models, and rater configuration
  3. Run Evaluation: Execute tests across your chosen models
  4. Analyze Results: Compare performance, identify issues, track progress

Multi-Dataset Evaluations

  • Combine multiple datasets in a single evaluation run
  • Mix manual, learning, and benchmark datasets
  • Compare model performance across different question types
  • Get comprehensive evaluation coverage

Learn More About Evaluations

Discover how to run comprehensive evaluations using your datasets

Technical Details

Dataset Storage

  • All datasets are stored at the organization level
  • Items include prompts, expected responses, and metadata
  • Benchmark datasets cache items from the open-source repository

Performance Considerations

  • Large datasets (1000+ items) are paginated for performance
  • Benchmark browser supports up to 50,000+ item benchmarks
  • Search and filtering work across entire dataset

API Access

Datasets are accessible via the Pulze API for programmatic use:
  • Create datasets programmatically
  • Add items in bulk
  • Query dataset contents
  • Integrate with CI/CD pipelines

Open-Source Benchmarks

The Pulze Evals repository is a community-driven collection of AI evaluation benchmarks. Repository Features:
  • Well-documented benchmark formats
  • Easy contribution process
  • Versioned benchmark releases
  • Automated validation
  • Community reviews and feedback
Popular Benchmarks:
  • MMLU (Massive Multitask Language Understanding): 57 subjects across STEM, humanities, social sciences
  • ARC (AI2 Reasoning Challenge): Science questions requiring reasoning
  • HellaSwag: Commonsense reasoning about everyday situations
  • TruthfulQA: Evaluating truthfulness and reducing hallucinations
  • GSM8K: Grade school math word problems
  • And many more…

Contribute to Pulze Evals

Visit the open-source repository to contribute your benchmarks or explore existing ones
I