Skip to content Skip to footer

Building a Simple Data Quality DSL in Python


Building a Simple Data Quality DSL in Python
Image by Author

 

Introduction

 
Data validation code in Python is often a pain to maintain. Business rules get buried in nested if statements, validation logic mixes with error handling, and adding new checks often means sifting through procedural functions to find the right place to insert code. Yes, there are data validation frameworks you can use, but we’ll focus on building something super simple yet useful with Python.

Let’s write a simple Domain-Specific Language (DSL) of sorts by creating a vocabulary specifically for data validation. Instead of writing generic Python code, you build specialized functions and classes that express validation rules in terms that match how you think about the problem.

For data validation, this means rules that read like business requirements: “customer ages must be between 18 and 120” or “email addresses must contain an @ symbol and should have a valid domain.” You’d like the DSL to handle the mechanics of checking data and reporting violations, while you focus on expressing what valid data looks like. The result is validation logic that’s readable, easy to maintain and test, and simple to extend. So, let’s start coding!

🔗 Link to the code on GitHub

 

Why Building a DSL?

 
Consider validating customer data with Python:

def validate_customers(df):
    errors = []
    if df['customer_id'].duplicated().any():
        errors.append("Duplicate IDs")
    if (df['age'] < 0).any():
        errors.append("Negative ages")
    if not df['email'].str.contains('@').all():
        errors.append("Invalid emails")
    return errors

 

This approach hardcodes validation logic, mixes business rules with error handling, and becomes unmaintainable as rules multiply. Instead, we’re looking to write a DSL that separates concerns and creates reusable validation components.

Instead of writing procedural validation functions, a DSL lets you express rules that read like business requirements:

# Traditional approach
if df['age'].min() < 0 or df['age'].max() > 120:
    raise ValueError("Invalid ages found")

# DSL approach  
validator.add_rule(Rule("Valid ages", between('age', 0, 120), "Ages must be 0-120"))

 

The DSL approach separates what you’re validating (business rules) from how violations are handled (error reporting). This makes validation logic testable, reusable, and readable by non-programmers.

 

Creating a Sample Dataset

 
Start by spinning up a sample, realistic e-commerce customer data containing common quality issues:

import pandas as pd

customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 103, 105],
    'email': ['john@gmail.com', 'invalid-email', '', 'sarah@yahoo.com', 'mike@domain.co'],
    'age': [25, -5, 35, 200, 28],
    'total_spent': [250.50, 1200.00, 0.00, -50.00, 899.99],
    'join_date': ['2023-01-15', '2023-13-45', '2023-02-20', '2023-02-20', '']
}) # Note: 2023-13-45 is an intentionally malformed date.

 

This dataset has duplicate customer IDs, invalid email formats, impossible ages, negative spending amounts, and malformed dates. That should work pretty well for testing validation rules.

 

Writing the Validation Logic

 

// Creating the Rule Class

Let’s start by writing a simple Rule class that wraps validation logic:

class Rule:
    def __init__(self, name, condition, error_msg):
        self.name = name
        self.condition = condition
        self.error_msg = error_msg
    
    def check(self, df):
        # The condition function returns True for VALID rows.
        # We use ~ (bitwise NOT) to select the rows that VIOLATE the condition.
        violations = df[~self.condition(df)]
        if not violations.empty:
            return {
                'rule': self.name,
                'message': self.error_msg,
                'violations': len(violations),
                'sample_rows': violations.head(3).index.tolist()
            }
        return None

 

The condition parameter accepts any function that takes a DataFrame and returns a boolean Series indicating valid rows. The tilde operator (~) inverts this Boolean Series to identify violations. When violations exist, the check method returns detailed information including the rule name, error message, violation count, and sample row indices for debugging.

This design separates validation logic from error reporting. The condition function focuses purely on the business rule while the Rule class handles error details consistently.

 

// Adding Multiple Rules

Next, let’s code up a DataValidator class that manages collections of rules:

class DataValidator:
    def __init__(self):
        self.rules = []
    
    def add_rule(self, rule):
        self.rules.append(rule)
        return self # Enables method chaining
    
    def validate(self, df):
        results = []
        for rule in self.rules:
            violation = rule.check(df)
            if violation:
                results.append(violation)
        return results

 

The add_rule method returns self to enable method chaining. The validate method executes all rules independently and collects violation reports. This approach ensures one failing rule doesn’t prevent others from running.

 

// Building Readable Conditions

Recall that when instantiating an object of the Rule class, we also need a condition function. This can be any function that takes in a DataFrame and returns a Boolean Series. While simple lambda functions work, they aren’t very easy to read. So let’s write helper functions to create a readable validation vocabulary:

def not_null(column):
    return lambda df: df[column].notna()

def unique_values(column):
    return lambda df: ~df.duplicated(subset=[column], keep=False)

def between(column, min_val, max_val):
    return lambda df: df[column].between(min_val, max_val)

 

Each helper function returns a lambda that works with pandas Boolean operations.

  • The not_null helper uses pandas’ notna() method to identify non-null values.
  • The unique_values helper uses duplicated(..., keep=False) with a subset parameter to flag all duplicate occurrences, ensuring a more accurate violation count.
  • The between helper uses the pandas between() method which handles range checks automatically.

For pattern matching, regular expressions become straightforward:

import re

def matches_pattern(column, pattern):
    return lambda df: df[column].str.match(pattern, na=False)

 

The na=False parameter ensures missing values are treated as validation failures rather than matches, which is typically the desired behavior for required fields.

 

Building a Data Validator for the Sample Dataset

 
Let’s now build a validator for the customer dataset to see how this DSL works:

validator = DataValidator()

validator.add_rule(Rule(
   "Unique customer IDs", 
   unique_values('customer_id'),
   "Customer IDs must be unique across all records"
))

validator.add_rule(Rule(
   "Valid email format",
   matches_pattern('email', r'^[^@\s]+@[^@\s]+\.[^@\s]+$'),
   "Email addresses must contain @ symbol and domain"
))

validator.add_rule(Rule(
   "Reasonable customer age",
   between('age', 13, 120),
   "Customer age must be between 13 and 120 years"
))

validator.add_rule(Rule(
   "Non-negative spending",
   lambda df: df['total_spent'] >= 0,
   "Total spending amount cannot be negative"
))

 

Each rule follows the same pattern: a descriptive name, a validation condition, and an error message.

  • The first rule uses the unique_values helper function to check for duplicate customer IDs.
  • The second rule applies regular expression pattern matching to validate email formats. The pattern requires at least one character before and after the @ symbol, plus a domain extension.
  • The third rule uses the between helper for range validation, setting reasonable age limits for customers.
  • The final rule uses a lambda function for an inline condition checking that total_spent values are non-negative.

Notice how each rule reads almost like a business requirement. The validator collects these rules and can execute them all against any DataFrame with matching column names:

issues = validator.validate(customers)

for issue in issues:
    print(f"❌ Rule: {issue['rule']}")
    print(f"Problem: {issue['message']}")
    print(f"Affected rows: {issue['sample_rows']}")
    print()

 

The output clearly identifies specific problems and their locations in the dataset, making debugging straightforward. For the sample data, you’ll get the following output:

Validation Results:
❌ Rule: Unique customer IDs
   Problem: Customer IDs must be unique across all records
   Violations: 2
   Affected rows: [2, 3]

❌ Rule: Valid email format
   Problem: Email addresses must contain @ symbol and domain
   Violations: 3
   Affected rows: [1, 2, 4]

❌ Rule: Reasonable customer age
   Problem: Customer age must be between 13 and 120 years
   Violations: 2
   Affected rows: [1, 3]

❌ Rule: Non-negative spending
   Problem: Total spending amount cannot be negative
   Violations: 1
   Affected rows: [3]

 

Adding Cross-Column Validations

 

Real business rules often involve relationships between columns. Custom lambda functions handle complex validation logic:

def high_spender_email_required(df):
    high_spenders = df['total_spent'] > 500
    has_valid_email = df['email'].str.contains('@', na=False)
    # Passes if: (Not a high spender) OR (Has a valid email)
    return ~high_spenders | has_valid_email

validator.add_rule(Rule(
    "High Spenders Need Valid Email",
    high_spender_email_required,
    "Customers spending over $500 must have valid email addresses"
))

 

This rule uses Boolean logic where high-spending customers must have valid emails, but low spenders can have missing contact information. The expression ~high_spenders | has_valid_email translates to “not a high spender OR has valid email,” which allows low spenders to pass validation regardless of email status.

 

Handling Date Validation

 
Date validation requires careful handling since date parsing can fail:

def valid_date_format(column, date_format="%Y-%m-%d"):
    def check_dates(df):
        # pd.to_datetime with errors="coerce" turns invalid dates into NaT (Not a Time)
        parsed_dates = pd.to_datetime(df[column], format=date_format, errors="coerce")
        # A row is valid if the original value is not null AND the parsed date is not NaT
        return df[column].notna() & parsed_dates.notna()
    return check_dates

validator.add_rule(Rule(
    "Valid Join Dates",
    valid_date_format('join_date'),
    "Join dates must follow YYYY-MM-DD format"
))

 

The validation passes only when the original value is not null AND the parsed date is valid (i.e., not NaT). We remove the unnecessary try-except block, relying on errors="coerce" in pd.to_datetime to handle malformed strings gracefully by converting them to NaT, which is then caught by parsed_dates.notna().

 

Writing Decorator Integration Patterns

 
For production pipelines, you can write decorator patterns that provide clean integration:

def validate_dataframe(validator):
    def decorator(func):
        def wrapper(df, *args, **kwargs):
            issues = validator.validate(df)
            if issues:
                error_details = [f"{issue['rule']}: {issue['violations']} violations" for issue in issues]
                raise ValueError(f"Data validation failed: {'; '.join(error_details)}")
            return func(df, *args, **kwargs)
        return wrapper
    return decorator

# Note: 'customer_validator' needs to be defined globally or passed in a real implementation
# Assuming 'customer_validator' is the instance we built earlier
# @validate_dataframe(customer_validator)
def process_customer_data(df):
    return df.groupby('age').agg({'total_spent': 'sum'})

 

This decorator ensures data passes validation before processing begins, preventing corrupted data from propagating through the pipeline. The decorator raises descriptive errors that include specific validation failures. A comment was added to the code snippet to note that customer_validator would need to be accessible to the decorator.

 

Extending the Pattern

 
You can extend the DSL to include other validation rules as needed:

# Statistical outlier detection
def within_standard_deviations(column, std_devs=3):
    # Valid if absolute difference from mean is within N standard deviations
    return lambda df: abs(df[column] - df[column].mean()) <= std_devs * df[column].std()

# Referential integrity across datasets
def foreign_key_exists(column, reference_df, reference_column):
    # Valid if value in column is present in the reference_column of the reference_df
    return lambda df: df[column].isin(reference_df[reference_column])

# Custom business logic
def profit_margin_reasonable(df):
    # Ensures 0 <= margin <= 1
    margin = (df['revenue'] - df['cost']) / df['revenue']
    return (margin >= 0) & (margin <= 1)

 

This is how you can build validation logic as composable functions that return Boolean series.

Here’s an example of how you can use the data validation DSL we’ve built on the sample data, assuming the helper functions are in a module called data_quality_dsl:

import pandas as pd
from data_quality_dsl import DataValidator, Rule, unique_values, between, matches_pattern

# Sample data
df = pd.DataFrame({
    'user_id': [1, 2, 2, 3],
    'email': ['user@test.com', 'invalid', 'user@real.com', ''],
    'age': [25, -5, 30, 150]
})

# Build validator
validator = DataValidator()
validator.add_rule(Rule("Unique users", unique_values('user_id'), "User IDs must be unique"))
validator.add_rule(Rule("Valid emails", matches_pattern('email', r'^[^@]+@[^@]+\.[^@]+$'), "Invalid email format"))
validator.add_rule(Rule("Reasonable ages", between('age', 0, 120), "Age must be 0-120"))

# Run validation
issues = validator.validate(df)
for issue in issues:
    print(f"❌ {issue['rule']}: {issue['violations']} violations")

 

Conclusion

 
This DSL, although simple, works because it aligns with how data professionals think about validation. Rules express business logic in easy-to-understand requirements while allowing us to use pandas for both performance and flexibility.

The separation of concerns makes validation logic testable and maintainable. This approach requires no external dependencies beyond pandas and introduces no learning curve for those already familiar with pandas operations.

This is something I worked on over a couple of evening coding sprints and several cups of coffee (of course!). But you can use this version as a starting point and build something much cooler. Happy coding!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.





Source link

Leave a comment

0.0/5