Skip to main content
The stats command analyzes annotation datasets and generates comprehensive statistics reports including category distribution, annotation counts, dimension analysis, and more.

Usage

panlabel stats <INPUT> [OPTIONS]

Parameters

input
path
required
Path to the dataset to analyze. Can be a file or directory depending on the format.
--format
string
Input format. If omitted, Panlabel auto-detects the format.Supported values: ir-json, coco, cvat, label-studio, tfod, yolo, vocAliases: coco-json, cvat-xml, label-studio-json, ls, tfod-csv, ultralytics, yolov8, yolov5, pascal-voc, voc-xml
When auto-detection fails for a JSON file, stats falls back to reading it as ir-json.
--top
number
default:"10"
Number of top labels and label pairs to show in the report.Useful for large datasets with many categories.
--tolerance
float
default:"0.5"
Tolerance in pixels for out-of-bounds checks.Annotations within this tolerance of the image boundary are not flagged as out-of-bounds.
--output
string
default:"text"
Output format for the statistics report.Options:
  • text - Human-readable text report with ASCII visualizations
  • json - Machine-readable JSON with full statistics
  • html - Self-contained HTML report with interactive charts

Statistics Included

The stats report includes:

Dataset Overview

  • Total images, annotations, and categories
  • Images with/without annotations
  • Average annotations per image

Category Distribution

  • Annotation count per category
  • Visual bar charts (text mode) or interactive charts (HTML mode)
  • Top N most frequent categories (controlled by --top)

Dimension Analysis

  • Image dimension distribution
  • Bounding box size statistics (min, max, average)
  • Aspect ratio analysis

Quality Metrics

  • Out-of-bounds annotations (beyond --tolerance)
  • Empty or zero-area bounding boxes
  • Images with duplicate annotations

Co-occurrence Analysis

  • Top N category pairs that appear together (controlled by --top)
  • Useful for understanding object relationships in your dataset

Examples

Basic Statistics

panlabel stats dataset.json
Auto-detects the format and prints a text report.

Explicit Format with JSON Output

panlabel stats dataset.json --format coco --output json > stats.json
Generates a machine-readable JSON report.

HTML Report

panlabel stats dataset.json --output html > report.html
Creates a self-contained HTML file with interactive visualizations.

Show Top 20 Categories

panlabel stats large_dataset.json --top 20
Displays the 20 most frequent categories and pairs.

YOLO Dataset Statistics

panlabel stats /data/yolo_dataset --format yolo --output text
Analyzes a YOLO directory structure.

Custom Tolerance for OOB Checks

panlabel stats dataset.json --tolerance 1.0
Uses 1 pixel tolerance instead of the default 0.5 pixels.

Output Examples

Text Report

Dataset Statistics

Overview:
  Images:              150
  Annotations:         1,234
  Categories:          8
  Avg annotations/img: 8.2
  Images w/o annot:    12 (8.0%)

Top 10 Categories:
  person      ████████████████████ 456 (37.0%)
  car         ███████████████      298 (24.2%)
  bicycle     ██████               124 (10.1%)
  motorcycle  ████                  89 (7.2%)
  bus         ███                   67 (5.4%)
  truck       ██                    54 (4.4%)
  traffic     ██                    48 (3.9%)
  stop_sign   █                     32 (2.6%)

Top 10 Category Co-occurrences:
  person × car         89 images
  person × bicycle     54 images
  car × traffic        42 images
  person × motorcycle  38 images
  car × truck          28 images
  bus × car            24 images
  bicycle × traffic    19 images
  person × bus         15 images
  motorcycle × traffic 12 images
  truck × traffic      11 images

Dimension Analysis:
  Image sizes: 640×480 (95), 1920×1080 (45), 800×600 (10)
  BBox width:  min=12px, max=580px, avg=156px
  BBox height: min=18px, max=420px, avg=142px

Quality:
  Out-of-bounds: 3 (0.2%)
  Zero-area:     0 (0.0%)

JSON Report Structure

{
  "overview": {
    "images": 150,
    "annotations": 1234,
    "categories": 8,
    "avg_annotations_per_image": 8.227,
    "images_without_annotations": 12
  },
  "categories": [
    {
      "name": "person",
      "count": 456,
      "percentage": 37.0
    },
    {
      "name": "car",
      "count": 298,
      "percentage": 24.2
    }
  ],
  "dimensions": {
    "image_sizes": {
      "640x480": 95,
      "1920x1080": 45,
      "800x600": 10
    },
    "bbox_stats": {
      "width": {"min": 12, "max": 580, "avg": 156.4},
      "height": {"min": 18, "max": 420, "avg": 142.1}
    }
  },
  "quality": {
    "out_of_bounds": 3,
    "zero_area": 0
  },
  "co_occurrences": [
    {"pair": ["person", "car"], "count": 89},
    {"pair": ["person", "bicycle"], "count": 54}
  ]
}

HTML Report

The HTML output creates a self-contained report with:
  • Interactive bar charts for category distribution
  • Searchable/sortable tables
  • Collapsible sections
  • Responsive design for mobile viewing
  • No external dependencies (all CSS/JS embedded)
panlabel stats dataset.json --output html > report.html
open report.html  # View in browser

Use Cases

Dataset Quality Assessment

# Check if dataset is balanced
panlabel stats training_data.json --top 50 --output text
Quickly identify imbalanced categories that may need more samples.

Pre-Training Analysis

# Generate comprehensive report before training
panlabel stats dataset.json --output html > analysis.html
Share with team members or include in documentation.

Automated Monitoring

# Track stats over time
panlabel stats current_dataset.json --output json | \
  jq '.overview' >> stats_history.jsonl
Monitor how your dataset evolves during annotation.

Compare Dataset Versions

# Generate stats for before/after
panlabel stats v1.json --output json > v1_stats.json
panlabel stats v2.json --output json > v2_stats.json
jd v1_stats.json v2_stats.json  # Use jd or diff tool

Performance Notes

  • Stats computation is fast even for large datasets (millions of annotations)
  • JSON output is more verbose but easier to parse programmatically
  • HTML generation adds minimal overhead and produces self-contained files
  • Use --top to limit output size for datasets with hundreds of categories

See Also

Validate Command

Validate dataset quality

Diff Command

Compare two datasets

Sample Command

Create balanced subsets based on stats