Summarization operations

Summarization operations extract information and generate reports from tabular data files without modifying the original data. These operations are essential for quality assurance, understanding dataset structure, and validating annotations.

What are summarizations?

Summarizations analyze data files and accumulate results across the dataset. They can:

  • Validate data: Check for errors, inconsistencies, or missing information

  • Profile datasets: Understand column structure, value distributions, and patterns

  • Generate reports: Create text and JSON summaries for documentation

  • Extract metadata: Pull out definitions, conditions, and design information

Key characteristics:

  • Do not modify the input data files

  • Accumulate information across multiple files (stateful)

  • Generate both text and JSON format outputs

  • Can be used multiple times in a pipeline as checkpoints

Common summarization workflows

Quality assurance pipeline:

  1. Summarize column names - Verify consistent column structure

  2. Summarize column values - Check for unexpected values

  3. Summarize HED validation - Validate HED annotations

Dataset understanding:

  1. Summarize column names - Identify column patterns

  2. Summarize column values - Understand value distributions

  3. Summarize HED type - Extract experimental design

HED annotation analysis:

  1. Summarize HED validation - Check annotation validity

  2. Summarize definitions - Review HED definitions

  3. Summarize HED tags - Analyze tag usage patterns

Available summarizations

Summarization summary

Column analysis

HED operations

Common parameters

All summarization operations require two standard parameters:

summary_name (str)

A unique identifier for this summary instance. Use descriptive names that indicate what is being summarized.

summary_filename (str)

Base filename for saving the summary. Timestamp and extension (.txt or .json) are added automatically.

append_timecode (bool, optional, default: False)

If true, append a timestamp to the filename to prevent overwriting previous summaries.

Output formats

Summaries are saved in two formats:

Text format (.txt)

Human-readable format with:

  • Overall dataset summary

  • Individual file details (when requested)

  • Formatted tables and lists

  • Good for documentation and manual review

JSON format (.json)

Machine-readable format with:

  • Structured data for programmatic access

  • All summary information preserved

  • Suitable for automated processing

  • Can be loaded into analysis tools

Output location

When processing full datasets (not individual files), summaries are automatically saved to:

<dataset_root>/derivatives/remodel/summaries/

The directory structure is created automatically if it doesn’t exist.

HED operations note

Operations with “HED” in their name require:

  • A HED schema version specified when creating the Dispatcher

  • Often a JSON sidecar file containing HED annotations

  • Data files with HED-annotated columns

See the User guide for details on using HED operations.

Examples

Basic column profiling:

[
    {
        "operation": "summarize_column_names",
        "description": "Check column consistency across files",
        "parameters": {
            "summary_name": "column_name_check",
            "summary_filename": "column_names"
        }
    },
    {
        "operation": "summarize_column_values",
        "description": "Profile column value distributions",
        "parameters": {
            "summary_name": "column_value_profile",
            "summary_filename": "column_values",
            "skip_columns": ["onset", "duration"],
            "value_columns": ["response_time"]
        }
    }
]

HED validation:

[
    {
        "operation": "summarize_hed_validation",
        "description": "Validate HED annotations",
        "parameters": {
            "summary_name": "hed_validation_check",
            "summary_filename": "hed_validation",
            "check_for_warnings": true
        }
    }
]

See also