Skip to main content
The sample command creates a subset of a dataset using various sampling strategies, with optional category filtering and format conversion.

Usage

panlabel sample -i <INPUT> -o <OUTPUT> [OPTIONS]

Parameters

--input
path
required
Input path to the source dataset.Short form: -i
--output
path
required
Output path for the sampled dataset.Short form: -o
--from
string
default:"auto"
Source format (or auto-detect).Supported values: auto, ir-json, coco, cvat, label-studio, tfod, yolo, voc
--to
string
Target format for the output.Behavior:
  • If omitted and --from is explicit, uses the same format as input
  • If omitted and --from auto, defaults to ir-json
Supported values: ir-json, coco, cvat, label-studio, tfod, yolo, voc
--n
integer
Number of images to sample (absolute count).Short form: -nNote: Exactly one of --n or --fraction is required.
--fraction
float
Fraction of images to sample (0.0 to 1.0).Examples:
  • 0.1 - Sample 10% of images
  • 0.5 - Sample 50% of images
Note: Exactly one of --n or --fraction is required.
--seed
integer
Random seed for deterministic sampling.Using the same seed with the same input will always produce the same sample. Useful for reproducible experiments.
--strategy
string
default:"random"
Sampling strategy to use.Options:
  • random - Uniform random sampling
  • stratified - Category-aware stratified sampling (maintains category distribution)
--categories
string
Comma-separated list of category names to filter on.Example: person,car,bicycleWhen specified, only images containing at least one of these categories are considered for sampling (see --category-mode).
--category-mode
string
default:"images"
How to handle category filtering.Options:
  • images - Keep whole images that contain at least one selected category (all annotations preserved)
  • annotations - Keep only annotations of selected categories (other annotations removed)
--allow-lossy
flag
default:"false"
Allow lossy format conversions that may drop information.Without this flag, the command fails if the target format cannot preserve all information from the source.

Sampling Strategies

Random Sampling (--strategy random)

Selects images uniformly at random from the dataset. Characteristics:
  • Fast and simple
  • May not preserve category distribution
  • Good for general-purpose sampling
Example:
panlabel sample -i dataset.json -o sample.json -n 100 --strategy random

Stratified Sampling (--strategy stratified)

Selects images while attempting to maintain the original category distribution. Characteristics:
  • Preserves category proportions
  • Better for training/validation splits
  • Ensures rare categories are represented
Example:
panlabel sample -i dataset.json -o sample.json --fraction 0.2 --strategy stratified

Category Filtering

Image Mode (--category-mode images)

Keeps entire images that contain at least one annotation of the specified categories. All annotations on those images are preserved, even if they’re not in the category list. Use case: Training a detector for specific objects while keeping scene context. Example:
# Keep all images with people or cars (and all their annotations)
panlabel sample -i dataset.json -o filtered.json \
  --categories person,car \
  --category-mode images \
  -n 500

Annotation Mode (--category-mode annotations)

Keeps only the annotations that match the specified categories. Images may end up with fewer annotations than the original. Use case: Creating a dataset for a specific subset of classes. Example:
# Keep only person and car annotations
panlabel sample -i dataset.json -o filtered.json \
  --categories person,car \
  --category-mode annotations \
  --fraction 1.0
With --category-mode annotations, images that lose all annotations are still kept in the output (as images without annotations).

Reproducibility

Use the --seed parameter to ensure reproducible sampling:
# Always produces the same sample
panlabel sample -i dataset.json -o sample.json -n 100 --seed 42
This is essential for:
  • Creating consistent train/validation splits
  • Reproducing experiments
  • Sharing sampling configurations with collaborators

Examples

Sample 20% Randomly

panlabel sample -i large_dataset.json -o small_dataset.json --fraction 0.2

Sample 1000 Images with Stratified Sampling

panlabel sample -i training_data.json -o sample.json \
  -n 1000 \
  --strategy stratified \
  --seed 42

Filter for Specific Categories

panlabel sample -i dataset.json -o people_only.json \
  --categories person,pedestrian \
  --category-mode images \
  --fraction 1.0

Sample and Convert Format

panlabel sample -i coco_dataset.json -o yolo_sample/ \
  --from coco \
  --to yolo \
  -n 500 \
  --allow-lossy

Create Validation Split

# Stratified 80/20 split
panlabel sample -i full_dataset.json -o train.json \
  --fraction 0.8 \
  --strategy stratified \
  --seed 42

panlabel sample -i full_dataset.json -o val.json \
  --fraction 0.2 \
  --strategy stratified \
  --seed 43  # Different seed for non-overlapping sample

Category-Specific Subset with Annotation Filtering

panlabel sample -i multi_class.json -o binary.json \
  --categories positive_class,negative_class \
  --category-mode annotations \
  --fraction 1.0

Auto-Detect and Sample YOLO Dataset

panlabel sample -i /data/yolo_full/ -o /data/yolo_sample/ \
  --from auto \
  -n 200 \
  --seed 42

Important Notes

IDs are preserved: Sampled images and annotations retain their original IDs from the source dataset.
All categories preserved: The output dataset includes all category definitions from the source, even if some categories have no annotations after sampling.
When using --fraction with small datasets, the actual number of images may be less than expected due to rounding. Use -n for precise control.

Output

After sampling, Panlabel prints a summary:
Sampled 1000 images -> 200 images: dataset.json (coco) -> sample.json (coco)

Conversion Report:
  ✓ Lossless conversion
  Images: 200
  Annotations: 1,456
  Categories: 8

Common Workflows

Create Train/Val/Test Splits

#!/bin/bash
SEED=42

# 70% train
panlabel sample -i full.json -o train.json \
  --fraction 0.7 --strategy stratified --seed $SEED

# 20% val
panlabel sample -i full.json -o val.json \
  --fraction 0.2 --strategy stratified --seed $((SEED + 1))

# 10% test
panlabel sample -i full.json -o test.json \
  --fraction 0.1 --strategy stratified --seed $((SEED + 2))

Sample for Quick Experimentation

# Quick 100-image sample for testing code
panlabel sample -i huge_dataset.json -o quick_test.json -n 100

Balance Rare Classes

# First, get all images with rare class
panlabel sample -i dataset.json -o rare_class_images.json \
  --categories rare_class \
  --category-mode images \
  --fraction 1.0

# Then sample from that filtered set
panlabel sample -i rare_class_images.json -o balanced_sample.json \
  -n 500 --strategy random

Exit Codes

  • 0 - Sampling successful
  • 1 - Error occurred (invalid parameters, lossy conversion blocked, etc.)

See Also

Convert Command

Convert without sampling

Stats Command

Analyze category distribution before sampling

Validate Command

Validate sampled output