Custom operations¶

This guide provides comprehensive documentation for developers who want to create custom remodeling operations for Table Remodeler. It covers the architecture, requirements, and best practices for extending the remodeling framework.

Table of contents¶

Architecture overview
Operation class structure
PARAMS dictionary specification
Implementing BaseOp
Implementing summarization operations
Validator integration
Testing custom operations
Best practices

Architecture overview¶

Operations are defined as classes that extend BaseOp regardless of whether they are transformations or summaries. However, summaries must also implement an additional supporting class that extends BaseSummary to hold the summary information.

In order to be executed by the remodeling functions, an operation must appear in the valid_operations dictionary located in remodeler/operations/valid_operations.py.

Key components¶

BaseOp: Abstract base class for all operations
BaseSummary: Abstract base class for summary support classes
Dispatcher: Orchestrates operation execution on datasets
RemodelerValidator: Validates operation specifications against JSON schemas
valid_operations: Registry dictionary mapping operation names to classes

Operation class structure¶

Each operation class must have:

NAME: Class variable (string) specifying the operation name
PARAMS: Class variable containing a JSON schema dictionary of parameters
Constructor: Calls super().__init__(parameters) and sets up operation-specific properties
do_op(): Main method that performs the operation
validate_input_data(): Static method for additional validation beyond JSON schema

Basic operation template¶

from remodeler.operations.base_op import BaseOp

class MyCustomOp(BaseOp):
    """Brief description of what this operation does."""
    
    NAME = "my_custom_operation"
    
    PARAMS = {
        "type": "object",
        "properties": {
            "my_parameter": {
                "type": "string",
                "description": "Description of parameter"
            },
            "optional_param": {
                "type": "integer", 
                "description": "Optional parameter"
            }
        },
        "required": ["my_parameter"],
        "additionalProperties": False
    }
    
    def __init__(self, parameters):
        """Initialize the operation with validated parameters."""
        super().__init__(parameters)
        self.my_parameter = parameters['my_parameter']
        self.optional_param = parameters.get('optional_param', default_value)
    
    def do_op(self, dispatcher, df, name, sidecar=None):
        """
        Execute the operation on a DataFrame.
        
        Parameters:
            dispatcher (Dispatcher): The dispatcher managing operations
            df (pd.DataFrame): The tabular data to transform
            name (str): Identifier for the file (for error messages)
            sidecar (dict): Optional JSON sidecar for HED operations
            
        Returns:
            pd.DataFrame: The transformed DataFrame
        """
        # Your transformation logic here
        # For transformations, return modified DataFrame
        # For summaries, update dispatcher.summary_dict and return original df
        return df
    
    @staticmethod
    def validate_input_data(parameters):
        """
        Perform additional validation beyond JSON schema.
        
        Parameters:
            parameters (dict): The operation parameters
            
        Returns:
            list: List of error message strings (empty if no errors)
        """
        errors = []
        # Add custom validation logic here
        return errors

PARAMS dictionary specification¶

The PARAMS dictionary uses JSON Schema (draft-2020-12) to specify operation parameters.

Basic structure¶

PARAMS = {
    "type": "object",  # Always "object" at top level
    "properties": {
        # Define each parameter here
    },
    "required": [
        # List required parameter names
    ],
    "additionalProperties": False  # Recommended: disallow unexpected parameters
}

Parameter types¶

String parameter¶

"column_name": {
    "type": "string",
    "description": "The name of the column to process"
}

Integer/number parameter¶

"max_count": {
    "type": "integer",
    "minimum": 1,
    "description": "Maximum number of items"
}

Boolean parameter¶

"ignore_missing": {
    "type": "boolean",
    "description": "If true, ignore missing columns"
}

Array (list) parameter¶

"column_names": {
    "type": "array",
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "description": "List of column names"
}

Object (dictionary) parameter¶

"column_mapping": {
    "type": "object",
    "patternProperties": {
        ".*": {
            "type": "string"
        }
    },
    "minProperties": 1,
    "description": "Mapping of old names to new names"
}

Example: rename_columns PARAMS¶

PARAMS = {
    "type": "object",
    "properties": {
        "column_mapping": {
            "type": "object",
            "patternProperties": {
                ".*": {
                    "type": "string"
                }
            },
            "minProperties": 1,
            "description": "Dictionary mapping old column names to new names"
        },
        "ignore_missing": {
            "type": "boolean",
            "description": "If false, raise error for missing columns"
        }
    },
    "required": [
        "column_mapping",
        "ignore_missing"
    ],
    "additionalProperties": False
}

Parameter dependencies¶

Use dependentRequired for conditional requirements:

"dependentRequired": {
    "factor_names": ["factor_values"]  # factor_names requires factor_values
}

Implementing BaseOp¶

Constructor requirements¶

Always call the superclass constructor first:

def __init__(self, parameters):
    super().__init__(parameters)
    # Then initialize operation-specific attributes
    self.column_names = parameters['column_names']

The do_op method¶

Signature:

def do_op(self, dispatcher, df, name, sidecar=None):

Parameters:

dispatcher: Instance of Dispatcher managing the remodeling process
df: Pandas DataFrame representing the tabular file
name: Identifier for the file (usually filename or relative path)
sidecar: Optional JSON sidecar containing HED annotations (for HED operations)

Returns:

Modified DataFrame (for transformations)
Original DataFrame (for summaries - they modify dispatcher.summary_dict instead)

Important Notes:

n/a Handling: The do_op method assumes that n/a values have been replaced with numpy.NaN in the incoming DataFrame. The Dispatcher handles this conversion automatically using prep_data() before operations and post_proc_data() after.
Don’t Modify in Place: Return a new DataFrame rather than modifying the input DataFrame in place.
Error Handling: Raise appropriate exceptions with clear messages for error conditions.

Example implementation¶

def do_op(self, dispatcher, df, name, sidecar=None):
    """Remove specified columns from the DataFrame."""
    return df.drop(self.column_names, axis=1, errors=self.error_handling)

The validate_input_data method¶

Use this static method for validation that cannot be expressed in JSON schema.

@staticmethod
def validate_input_data(parameters):
    """
    Validate parameters beyond JSON schema constraints.
    
    Common use cases:
    - Checking list lengths match
    - Validating value ranges based on other parameters
    - Checking for logical inconsistencies
    
    Returns:
        list: User-friendly error messages (empty if valid)
    """
    errors = []
    
    # Example: Check that two lists have the same length
    if parameters.get("query_names"):
        if len(parameters["query_names"]) != len(parameters["queries"]):
            errors.append(
                "The query_names list must have the same length as queries list."
            )
    
    return errors

Implementing summarization operations¶

Summarization operations require both an operation class (extending BaseOp) and a summary class (extending BaseSummary).

Summarization operation class¶

from remodeler.operations.base_summary import BaseSummary

class MySummarySummary(BaseSummary):
    """Holds summary information for my_summary operation."""
    
    def __init__(self, sum_op):
        """Initialize with operation parameters."""
        super().__init__(sum_op)
        self.summary_data = {}  # Store your summary data
    
    def update_summary(self, summary_dict):
        """
        Update summary with information from one file.
        
        Parameters:
            summary_dict (dict): File-specific information
                Example: {"name": "file.tsv", "column_names": [...]}
        """
        # Extract and store information
        name = summary_dict['name']
        # ... accumulate your summary data
    
    def get_summary_details(self, verbose=True):
        """
        Return the summary information as a dictionary.
        
        Parameters:
            verbose (bool): If True, include detailed information
            
        Returns:
            dict: Summary data formatted for output
        """
        return {
            'summary_type': 'my_summary',
            'data': self.summary_data
            # ... your summary structure
        }

class MySummaryOp(BaseOp):
    """Summarize some aspect of the data."""
    
    NAME = "my_summary"
    
    PARAMS = {
        "type": "object",
        "properties": {
            "summary_name": {
                "type": "string",
                "description": "Unique identifier for this summary"
            },
            "summary_filename": {
                "type": "string",
                "description": "Base filename for saving summary"
            }
        },
        "required": ["summary_name", "summary_filename"],
        "additionalProperties": False
    }
    
    def __init__(self, parameters):
        super().__init__(parameters)
        self.summary_name = parameters['summary_name']
        self.summary_filename = parameters['summary_filename']
    
    def do_op(self, dispatcher, df, name, sidecar=None):
        """Update summary without modifying DataFrame."""
        # Get or create summary object
        summary = dispatcher.summary_dict.get(self.summary_name)
        if not summary:
            summary = MySummarySummary(self)
            dispatcher.summary_dict[self.summary_name] = summary
        
        # Update with file-specific information
        summary.update_summary({
            "name": name,
            "data": self._extract_data(df)
        })
        
        # Return original DataFrame unchanged
        return df
    
    def _extract_data(self, df):
        """Helper method to extract relevant data."""
        # Your data extraction logic
        return extracted_data
    
    @staticmethod
    def validate_input_data(parameters):
        return []  # No additional validation needed

BaseSummary required methods¶

update_summary(summary_dict)¶

Called by the operation’s do_op method for each file processed. Accumulates information across all files.

def update_summary(self, summary_dict):
    file_name = summary_dict['name']
    file_data = summary_dict['data']
    
    # Store file-specific information
    self.file_summaries[file_name] = file_data
    
    # Update overall statistics
    self.total_files += 1
    # ... accumulate other data

get_summary_details(verbose=True)¶

Returns the accumulated summary data as a dictionary. This is called when saving summaries to JSON or text format.

def get_summary_details(self, verbose=True):
    details = {
        'overall': {
            'total_files': self.total_files,
            # ... overall statistics
        }
    }
    
    if verbose:
        details['individual_files'] = self.file_summaries
    
    return details

Validator integration¶

Registering your operation¶

Add your operation to remodeler/operations/valid_operations.py:

from remodeler.operations.my_custom_op import MyCustomOp

valid_operations = {
    # ... existing operations ...
    "my_custom_operation": MyCustomOp,
}

Validation stages¶

The validator processes operations in stages:

Stage 0: Top-level structure (must be array of operations)
Stage 1: Operation dictionary format (must have operation, description, parameters keys)
Stage 2: Operation dictionary values (validate types and check valid operation names)
Later Stages: Nested parameter validation using JSON schema
Final Stage: Call validate_input_data() for each operation

Error messages¶

The validator provides user-friendly error messages indicating:

Operation index in the remodeling file
Operation name
Path to the invalid value
What constraint was violated

Testing custom operations¶

Unit testing structure¶

Create test files in tests/operations/ following the pattern:

import unittest
import pandas as pd
from remodeler.operations.my_custom_op import MyCustomOp
from remodeler import Dispatcher

class TestMyCustomOp(unittest.TestCase):
    """Test cases for MyCustomOp."""
    
    def setUp(self):
        """Set up test fixtures."""
        self.sample_df = pd.DataFrame({
            'col1': [1, 2, 3],
            'col2': ['a', 'b', 'c']
        })
    
    def test_basic_functionality(self):
        """Test basic operation behavior."""
        params = {
            'my_parameter': 'value'
        }
        op = MyCustomOp(params)
        
        # Create a simple dispatcher for testing
        dispatcher = Dispatcher([], data_root=".")
        
        result = op.do_op(dispatcher, self.sample_df, "test_file.tsv")
        
        # Assert expected transformations
        self.assertIsInstance(result, pd.DataFrame)
        # ... more assertions
    
    def test_parameter_validation(self):
        """Test parameter validation."""
        # Test invalid parameters
        errors = MyCustomOp.validate_input_data({
            'my_parameter': 'value',
            'invalid_list': [1, 2]  # Wrong length
        })
        self.assertTrue(len(errors) > 0)
        
        # Test valid parameters
        errors = MyCustomOp.validate_input_data({
            'my_parameter': 'value',
            'valid_list': [1, 2, 3]
        })
        self.assertEqual(len(errors), 0)
    
    def test_error_handling(self):
        """Test error conditions."""
        params = {'my_parameter': 'value'}
        op = MyCustomOp(params)
        dispatcher = Dispatcher([], data_root=".")
        
        # Test with problematic data
        bad_df = pd.DataFrame({'wrong_col': [1, 2, 3]})
        
        with self.assertRaises(KeyError):
            op.do_op(dispatcher, bad_df, "test.tsv")

if __name__ == '__main__':
    unittest.main()

Integration testing¶

Test with complete remodeling workflows:

def test_full_remodeling_workflow(self):
    """Test operation in a complete remodeling pipeline."""
    operations = [
        {
            "operation": "my_custom_operation",
            "description": "Test my operation",
            "parameters": {
                "my_parameter": "test_value"
            }
        }
    ]
    
    dispatcher = Dispatcher(operations, data_root="path/to/test/data")
    dispatcher.run_operations()
    
    # Assert expected results

Best practices¶

Design principles¶

Single Responsibility: Each operation should do one thing well
Immutability: Don’t modify input DataFrames in place
Clear Errors: Provide helpful error messages with context
Documentation: Include docstrings for class and methods
Type Hints: Use type hints for better IDE support

Parameter design¶

Required vs Optional: Make commonly-used parameters required
Sensible Defaults: Provide defaults for optional parameters
Clear Names: Use descriptive parameter names
Consistent: Follow naming conventions from existing operations

Error handling¶

def do_op(self, dispatcher, df, name, sidecar=None):
    # Check preconditions
    if self.column_name not in df.columns:
        raise ValueError(
            f"Column '{self.column_name}' not found in {name}. "
            f"Available columns: {list(df.columns)}"
        )
    
    # Perform operation with try/except for specific errors
    try:
        result = df.copy()
        # ... transformation logic
        return result
    except Exception as e:
        raise RuntimeError(
            f"Error processing {name} with {self.NAME}: {str(e)}"
        )

Performance considerations¶

Vectorize Operations: Use pandas vectorized operations instead of loops
Avoid Copies: Only copy DataFrames when necessary
Lazy Evaluation: Compute expensive operations only when needed
Memory Efficiency: For large datasets, consider memory usage

Documentation¶

Include comprehensive docstrings:

class MyCustomOp(BaseOp):
    """
    Transform data by applying custom logic.
    
    This operation processes tabular data by...
    
    Common use cases:
    - Use case 1
    - Use case 2
    
    Example JSON:
        {
            "operation": "my_custom_operation",
            "description": "Process the data",
            "parameters": {
                "my_parameter": "value"
            }
        }
    
    See Also:
        - related_operation: For alternative approach
        - another_operation: For complementary functionality
    """

Testing checklist¶

Unit tests for basic functionality
Tests for edge cases
Tests for error conditions
Parameter validation tests
Integration tests with Dispatcher
Tests with various DataFrame structures
Tests with n/a values
Performance tests for large datasets (if applicable)

Example: complete custom operation¶

Here’s a complete example of a custom operation that converts column values to uppercase:

from remodeler.operations.base_op import BaseOp
import pandas as pd

class UppercaseColumnsOp(BaseOp):
    """
    Convert specified columns to uppercase.
    
    Example JSON:
        {
            "operation": "uppercase_columns",
            "description": "Convert trial_type to uppercase",
            "parameters": {
                "column_names": ["trial_type", "condition"],
                "ignore_missing": true
            }
        }
    """
    
    NAME = "uppercase_columns"
    
    PARAMS = {
        "type": "object",
        "properties": {
            "column_names": {
                "type": "array",
                "items": {"type": "string"},
                "minItems": 1,
                "description": "List of column names to convert to uppercase"
            },
            "ignore_missing": {
                "type": "boolean",
                "description": "If true, ignore columns not in DataFrame"
            }
        },
        "required": ["column_names", "ignore_missing"],
        "additionalProperties": False
    }
    
    def __init__(self, parameters):
        super().__init__(parameters)
        self.column_names = parameters['column_names']
        self.ignore_missing = parameters['ignore_missing']
    
    def do_op(self, dispatcher, df, name, sidecar=None):
        """Convert specified columns to uppercase."""
        result = df.copy()
        
        for col in self.column_names:
            if col not in result.columns:
                if not self.ignore_missing:
                    raise KeyError(
                        f"Column '{col}' not found in {name}. "
                        f"Available: {list(result.columns)}"
                    )
                continue
            
            # Convert to uppercase, preserving NaN values
            result[col] = result[col].str.upper()
        
        return result
    
    @staticmethod
    def validate_input_data(parameters):
        """No additional validation needed beyond JSON schema."""
        return []

# In remodeler/operations/valid_operations.py
from remodeler.operations.uppercase_columns_op import UppercaseColumnsOp

valid_operations = {
    # ... existing operations ...
    "uppercase_columns": UppercaseColumnsOp,
}

Resources¶

JSON Schema Documentation: https://json-schema.org/
Pandas DataFrame API: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Python unittest: https://docs.python.org/3/library/unittest.html
Table Remodeler Repository: https://github.com/hed-standard/table-remodeler
Operations Reference: operations/index.rst

Getting help¶

Review existing operations in remodeler/operations/ for examples
Check the test suite in tests/operations/ for testing patterns
Open an issue on GitHub for questions or feature requests
See the User Guide for operation usage examples