Datasets - Pulze.ai

Overview

Datasets are curated collections of prompts and expected responses used for systematic testing, evaluation, and benchmarking of AI performance. With datasets, you can evaluate:

AI Models: Test and compare different language models
AI Agents/Assistants: Evaluate specialized agents with specific tools and capabilities
Entire Spaces: Test complete space configurations including models, assistants, data, and permissions

Pulze supports multiple types of datasets, each designed for different use cases and workflows, allowing you to thoroughly validate your AI systems before deployment.

Dataset Types

Pulze offers three distinct types of datasets to fit your evaluation needs:

1. Manual Datasets

Manual datasets allow you to create custom test cases by manually entering prompts as if you were chatting in a space. Key Features:

Space Context: Select a specific space, and your dataset inherits all space configurations
Full Space Access: Use all enabled models, assistants, tools, and data available in that space
Custom Prompts: Write your own test prompts and define expected “golden” answers
Flexible Creation: Add prompts one at a time or in bulk

Use Cases:

Testing specific business scenarios or edge cases
Quality assurance for customer-facing assistants
Regression testing after configuration changes
Creating domain-specific evaluation sets

How to Create:

Navigate to Data → Datasets
Click Create Dataset
Choose Manual as the dataset type
Select the space whose context you want to use
Add prompts and their expected responses
Save your dataset

Manual datasets are perfect when you have specific test cases in mind or want to validate particular scenarios that reflect your unique use case.

2. Learning Datasets (Space-Based)

Learning datasets are automatically generated from real conversations in your spaces, specifically from messages that received positive feedback (likes). Key Features:

Automatic Generation: Built from actual user interactions
Quality-Driven: Only includes prompts that received positive feedback
Space-Scoped: Limited to conversations within specific spaces
Permission-Based: You only see spaces you have access to
Real-World Data: Reflects actual usage patterns and questions

Use Cases:

Capturing successful interactions for future testing
Building evaluation sets from production usage
Identifying patterns in high-quality responses
Creating benchmarks based on real user satisfaction

How to Create:

Navigate to Data → Datasets
Click Create Dataset
Choose Learn as the dataset type
Select the space to pull liked conversations from
Configure filters (date range, minimum likes, etc.)
Generate the dataset

Learning datasets require that users have actively liked messages in your spaces. Encourage your team to use the like feature to build better evaluation sets!

3. Benchmark Datasets

Benchmark datasets provide access to industry-leading, standardized evaluation benchmarks from the Pulze Evals open-source repository. Key Features:

Open-Source: Community-contributed benchmarks
Industry Standard: Well-established evaluation frameworks
Diverse Coverage: Multiple domains, subjects, and difficulty levels
Browse & Select: Interactive browser to explore and select specific prompts
Remix Capability: Mix and match questions from different benchmarks
Export & Contribute: Create your own benchmarks and contribute back

Available Benchmarks Include:

Academic knowledge tests (MMLU, ARC, etc.)
Reasoning benchmarks (HellaSwag, WinoGrande, etc.)
Coding challenges
Mathematical problem-solving
Domain-specific evaluations

How to Create:

Browse and Select from Existing Benchmarks

Navigate to Data → Datasets
Click Create Dataset
Choose Benchmarks as the dataset type
Browse Benchmarks:
- View all available benchmarks with descriptions
- See total items and available subjects for each
- Search benchmarks by name or description
Select Benchmark Items:
- Choose a benchmark to explore
- Filter by subject/category
- Configure dataset size:
  - All items: Include everything from the benchmark
  - First N items: Take the first X questions
  - Random sample: Randomly select X questions (with configurable seed)
  - Range selection: Select items from index X to Y
- Search specific questions or answers
- Select individual items or bulk select
Remix Multiple Benchmarks:
- Add items from one benchmark
- Switch to another benchmark and add more items
- Build custom evaluation sets mixing different sources
Use Selected Items: Create your dataset with all accumulated items

Dataset Size Options

Control exactly how many items you want:

All items: Complete benchmark
First N: Sequential selection
Random: Reproducible random sampling
Range: Specific index range

Bulk Subject Sampling

Efficiently sample across multiple subjects:

Select multiple subjects
Set sample size per subject
Maintain balanced representation

Contributing Your Own Benchmarks

The Pulze Evals repository is open-source, and we welcome community contributions! Export and Contribute:

Create and refine your custom dataset in Pulze
Run evaluations to validate quality
Export your dataset as a benchmark
Submit a pull request to github.com/pulzeai-oss/evals
Share your benchmark with the community

Why Contribute:

Help establish new evaluation standards
Share domain-specific benchmarks
Build reputation in the AI evaluation community
Enable others to benchmark in your domain

The benchmark repository is designed for community collaboration. Whether you’re testing medical AI, legal reasoning, or creative writing, your benchmarks can help others evaluate similar systems.

Dataset Management

Viewing Datasets

All datasets are accessible from the Data section:

View dataset type (Manual, Learn, or Benchmarks)
See item counts and creation dates
Filter and search across all datasets
Track who created each dataset

Dataset Information

Each dataset includes:

Name: Descriptive identifier
Description: Purpose and contents
Type: Manual, Learn, or Benchmarks
Item Count: Number of prompt-response pairs
Created By: Dataset author
Status: Ready, Creating, or Failed
Associated Space: For manual and learning datasets

For benchmark datasets, you’ll also see:

Benchmark Name: Source benchmark
Subjects: Available categories or topics

Editing Datasets

Update Name/Description: Keep your datasets well-organized
Add Items: Expand existing datasets with new prompts
Remove Items: Clean up or refine your evaluation sets
View Details: Expand to see all items with questions and answers

Using Datasets

Datasets integrate seamlessly with evaluations:

Select one or multiple datasets for evaluation runs
Test different models against the same prompts
Compare performance across assistants or configurations
Track improvements over time

Permissions and Access

Dataset visibility follows your space permissions:

Manual & Learning Datasets: Scoped to specific spaces
- You can only see datasets from spaces you have access to
- Dataset context matches the space’s configuration
Benchmark Datasets: Available to all organization members
- Open-source benchmarks are accessible to everyone
- Custom benchmarks follow organization permissions

If you lose access to a space, you’ll also lose access to its associated manual and learning datasets.

Best Practices

Start Small, Iterate: Begin with a focused dataset of 50-100 carefully chosen prompts, then expand based on evaluation insights.

For Manual Datasets

Representative Prompts: Cover common use cases and edge cases
Clear Golden Answers: Define specific, measurable expected responses
Regular Updates: Refresh datasets as your use case evolves
Version Control: Create dated versions for tracking changes

For Learning Datasets

Encourage Feedback: Train your team to like helpful responses
Quality Threshold: Set minimum like counts to ensure quality
Regular Generation: Update learning datasets monthly or quarterly
Review Before Use: Manually review generated items for appropriateness

For Benchmark Datasets

Mix Sources: Combine items from multiple benchmarks for comprehensive coverage
Balanced Sampling: Ensure even distribution across subjects/categories
Document Composition: Track which benchmarks and items you’ve included
Reproducibility: Use fixed random seeds for consistent sampling
Share Back: Export successful custom benchmarks to the community

Integration with Evaluations

Datasets power the evaluation system:

Evaluation Workflow

Create Dataset: Build your test set using any of the three types
Configure Evaluation: Select datasets, models, and rater configuration
Run Evaluation: Execute tests across your chosen models
Analyze Results: Compare performance, identify issues, track progress

Multi-Dataset Evaluations

Combine multiple datasets in a single evaluation run
Mix manual, learning, and benchmark datasets
Compare model performance across different question types
Get comprehensive evaluation coverage

Learn More About Evaluations

Discover how to run comprehensive evaluations using your datasets

Technical Details

Dataset Storage

All datasets are stored at the organization level
Items include prompts, expected responses, and metadata
Benchmark datasets cache items from the open-source repository

Performance Considerations

Large datasets (1000+ items) are paginated for performance
Benchmark browser supports up to 50,000+ item benchmarks
Search and filtering work across entire dataset

API Access

Datasets are accessible via the Pulze API for programmatic use:

Create datasets programmatically
Add items in bulk
Query dataset contents
Integrate with CI/CD pipelines

Open-Source Benchmarks

The Pulze Evals repository is a community-driven collection of AI evaluation benchmarks. Repository Features:

Well-documented benchmark formats
Easy contribution process
Versioned benchmark releases
Automated validation
Community reviews and feedback

Popular Benchmarks:

MMLU (Massive Multitask Language Understanding): 57 subjects across STEM, humanities, social sciences
ARC (AI2 Reasoning Challenge): Science questions requiring reasoning
HellaSwag: Commonsense reasoning about everyday situations
TruthfulQA: Evaluating truthfulness and reducing hallucinations
GSM8K: Grade school math word problems
And many more…

Contribute to Pulze Evals

Visit the open-source repository to contribute your benchmarks or explore existing ones

Data Overview

Learn about Pulze’s data management system

Evaluations

Run evaluations using your datasets

Spaces

Understand how spaces provide context for datasets

Pulze Evals Repo

Open-source benchmark repository

Getting Started

Models

AI Agents

Pulze Guide

Tools Guide

Vibe Coding

Developer Guide

API REFERENCE

COMMUNITY

PULZE ACADEMY

​Overview

​Dataset Types

​1. Manual Datasets

​2. Learning Datasets (Space-Based)

​3. Benchmark Datasets

​Browse and Select from Existing Benchmarks

Dataset Size Options

Bulk Subject Sampling

​Contributing Your Own Benchmarks

​Dataset Management

​Viewing Datasets

​Dataset Information

​Editing Datasets

​Using Datasets

​Permissions and Access

​Best Practices

​For Manual Datasets

​For Learning Datasets

​For Benchmark Datasets

​Integration with Evaluations

​Evaluation Workflow

​Multi-Dataset Evaluations

Learn More About Evaluations

​Technical Details

​Dataset Storage

​Performance Considerations

​API Access

​Open-Source Benchmarks

Contribute to Pulze Evals

​Related Resources

Data Overview

Evaluations

Spaces

Pulze Evals Repo

Overview

Dataset Types

1. Manual Datasets

2. Learning Datasets (Space-Based)

3. Benchmark Datasets

Browse and Select from Existing Benchmarks

Contributing Your Own Benchmarks

Dataset Management

Viewing Datasets

Dataset Information

Editing Datasets

Using Datasets

Permissions and Access

Best Practices

For Manual Datasets

For Learning Datasets

For Benchmark Datasets

Integration with Evaluations

Evaluation Workflow

Multi-Dataset Evaluations

Technical Details

Dataset Storage

Performance Considerations

API Access

Open-Source Benchmarks

Related Resources