Summarize column values¶

The summarize column values operation provides a summary of the number of times various column values appear in event files across the dataset.

Purpose¶

Use this operation to:

Understand the distribution of categorical values in columns
Identify unexpected or erroneous column values
Count unique values per column across the dataset
Profile datasets before analysis

Parameters¶

The following table lists the parameters required for using the summary.

Parameters for the summarize_column_values operation.

Parameter	Type	Description
summary_name	str	A unique name used to identify this summary.
summary_filename	str	A unique file basename to use for saving this summary.
append_timecode	bool	(Optional: Default false) If True, append a time code to filename.
max_categorical	int	(Optional: Default 50) If given, the text summary shows top max_categorical values. Otherwise the text summary displays all categorical values.
skip_columns	list	(Optional) A list of column names to omit from the summary.
value_columns	list	(Optional) A list of columns to omit the listing unique values.
values_per_line	int	(Optional: Default 5) If given, the text summary displays this number of values per line (default is 5).

In addition to the standard parameters, summary_name and summary_filename required of all summaries, the summarize_column_values operation requires two additional lists to be supplied. The skip_columns list specifies the names of columns to skip entirely in the summary. Typically, the onset, duration, and sample columns are skipped, since they have unique values for each row and their values have limited information.

The summarize_column_values is mainly meant for creating summary information about columns containing a finite number of distinct values. Columns that contain numeric information will usually have distinct entries for each row in a tabular file and are not amenable to such summarization. These columns could be specified as skip_columns, but another option is to designate them as value_columns. The value_columns are reported in the summary, but their distinct values are not reported individually.

For datasets that include multiple tasks, the event values for each task may be distinct. The summarize_column_values operation does not separate by task, but expects the calling programs filter the files by task as desired. The run_remodel program supports selecting files corresponding to a particular task.

Two additional optional parameters are available for specifying aspects of the text summary output. The max_categorical optional parameter specifies how many unique values should be displayed for each column. The values_per_line controls how many categorical column values (with counts) are displayed on each line of the output. By default, 5 values are displayed.

Example¶

The following example shows the JSON for including this operation in a remodeling file.

A JSON file with a single summarize_column_values summarization operation.

[{
   "operation": "summarize_column_values",
   "description": "Summarize the column values in an excerpt.",
   "parameters": {
       "summary_name": "AOMIC_column_values",
       "summary_filename": "AOMIC_column_values",
       "skip_columns": ["onset", "duration"],
       "value_columns": ["response_time", "stop_signal_delay"]
   }
}]

Results¶

A text format summary of the results of executing this operation on the sample remodel event file is shown in the following example.

Sample summarize_column_values operation results in text format.

Summary name: AOMIC_column_values
Summary type: column_values
Summary filename: AOMIC_column_values

Overall summary:
Dataset: Total events=6 Total files=1
   Categorical column values[Events, Files]:
      response_accuracy:
         correct[5, 1] n/a[1, 1]
      response_hand:
         left[2, 1] n/a[1, 1] right[3, 1]
      sex:
         female[4, 1] male[2, 1]
      trial_type:
         go[3, 1] succesful_stop[1, 1] unsuccesful_stop[2, 1]
   Value columns[Events, Files]:
      response_time[6, 1]
      stop_signal_delay[6, 1]

Individual files:

sub-0013_task-stopsignal_acq-seq_events.tsv:
Total events=200
      Categorical column values[Events, Files]:
         response_accuracy:
            correct[5, 1] n/a[1, 1]
         response_hand:
            left[2, 1] n/a[1, 1] right[3, 1]
         sex:
            female[4, 1] male[2, 1]
         trial_type:
            go[3, 1] succesful_stop[1, 1] unsuccesful_stop[2, 1]
      Value columns[Events, Files]:
         response_time[6, 1]
         stop_signal_delay[6, 1]

Because the sample remodel event file only has 6 events, we expect that no value will be represented in more than 6 events. The column names corresponding to value columns just have the event counts in them.

This command was executed with the -i option in run_remodel, results from the individual data files are shown after the overall summary. The individual results are similar to the overall summary because only one data file was processed.

For a more extensive example see the text and JSON format summaries of the sample dataset ds003645s_hed using the summarize_columns_rmdl.json remodeling file.

Notes¶

Essential for understanding categorical column distributions
Use skip_columns for high-cardinality columns (onset, duration, sample)
Use value_columns for numeric columns to avoid listing all unique values
Counts show [events_with_value, files_with_value] format
Helps identify data quality issues and unexpected values
Text output formatting controlled by values_per_line and max_categorical