Wednesday, June 4, 2025

Data Validation Checks During Migration

Check Type

Description

How to Perform / Example

Row Count Validation

Verify the total number of rows in source and target tables match exactly.

SELECT COUNT(*) FROM source_table; 

SELECT COUNT(*) FROM target_table;

Data Completeness

Ensure all mandatory fields are migrated with no missing/null data unless allowed.

Check for nulls in mandatory columns after migration.

Data Type Validation

Confirm data types are preserved or correctly transformed in the target system.

Compare schema definitions before and after migration.

Data Accuracy

Validate that data values match between source and target for key fields.

Sample or full data comparison using checksums or hash functions.

Duplicate Records Check

Ensure duplicates are not introduced or lost during migration.

Use distinct counts or unique constraints validation.

Referential Integrity

Verify foreign key relationships remain intact post-migration.

Check counts of joined tables and constraint enforcement.

Data Transformation Validation

If transformation logic is applied during migration, validate the correctness.

Compare transformed data against expected output.

Null / Default Value Checks

Check if nulls and default values in source are correctly handled in target.

Validate null counts and default value assignments.

Date and Time Validation

Confirm dates/timestamps are correctly migrated and timezone handled.

Compare min/max dates and specific edge cases.

Special Characters and Encoding

Verify no data corruption occurs in text fields (e.g., encoding issues).

Validate text fields for special chars or encoding mismatches.

Index and Constraint Validation

Ensure indexes and constraints (unique, primary keys) are correctly applied.

Compare constraints and test queries on target system.

Checksum / Hash Validation

Use checksums or hashing to compare entire rows or tables for integrity.

Generate hash on source and target datasets and compare.

Sampling Validation

For large datasets, validate by random sampling or critical segments.

Query sample rows from source and target for manual comparison.

Audit Trail Verification

Check logs or audit data for records processed and any migration errors.

Review migration logs and error reports.

Performance Metrics

Measure migration speed and validate no data truncation due to performance.

Monitor ETL performance and error rates.




Data Validation Checks During Migration

(Interview Flashcard Version)

CheckWhat & WhyHow to Verify (Quick Example)
1. Row CountEnsure total rows match exactly.SELECT COUNT(*) FROM source; & target;
2. Null CheckMandatory fields must NOT be null.SELECT COUNT(*) WHERE col IS NULL;
3. Duplicate CheckNo duplicate records introduced.GROUP BY key HAVING COUNT(*) > 1
4. Data Type ValidationData types preserved or mapped right.Compare schemas before and after migration.
5. Data AccuracyData values unchanged or correctly transformed.Sample data or use checksum/hash on rows.
6. Referential IntegrityForeign keys remain intact.Left join child to parent, find orphans.
7. Date/Time ValidationDates/timestamps migrated correctly.Check min/max dates, timezone handling.
8. Encoding CheckNo text corruption (special chars).Review sample text fields for anomalies.
9. Checksum ValidationVerify row-level data integrity.MD5/SHA hash of concatenated columns.
10. Sample Spot CheckManual data spot verification.Random sample queries and manual review.

Bonus Tips for Interview:

  • Always automate these checks in your ETL pipeline.

  • Use logging and reports for audit and error tracking.

  • Perform incremental validation if data migration happens in phases.

  • Explain that data validation ensures no data loss or corruption during migration — vital for business trust and decision-making.


Sample Scripts for Key Validations

1. Row Count Validation (SQL)

-- Source
SELECT COUNT(*) AS source_count FROM source_schema.table_name; -- Target SELECT COUNT(*) AS target_count FROM target_schema.table_name;

2. Null Check on Mandatory Column (SQL)

SELECT COUNT(*) AS null_count
FROM target_schema.table_name WHERE mandatory_column IS NULL;

3. Duplicate Check (SQL)

SELECT column1, COUNT(*) AS dup_count
FROM target_schema.table_name GROUP BY column1 HAVING COUNT(*) > 1;

4. Checksum Validation (SQL)

Generate row-wise checksum/hash and compare counts of distinct hashes:

-- On source system
SELECT MD5(CONCAT_WS('|', col1, col2, col3)) AS row_hash FROM source_schema.table_name ORDER BY row_hash; -- On target system SELECT MD5(CONCAT_WS('|', col1, col2, col3)) AS row_hash FROM target_schema.table_name ORDER BY row_hash;

Compare outputs manually or export for diff tools.

5. Referential Integrity Check (SQL)

-- Find orphan records in child table
SELECT c.* FROM target_schema.child_table c LEFT JOIN target_schema.parent_table p ON c.parent_id = p.id WHERE p.id IS NULL;

6. Sample Random Data Validation (SQL)

SELECT *
FROM target_schema.table_name SAMPLE (100);

Python Script Snippet for Automated Validation (Example with Snowflake Connector)


import snowflake.connector def run_query(conn, query): cursor = conn.cursor() cursor.execute(query) result = cursor.fetchall() cursor.close() return result def validate_row_count(source_conn, target_conn, table): source_query = f"SELECT COUNT(*) FROM {table}" target_query = f"SELECT COUNT(*) FROM {table}" source_count = run_query(source_conn, source_query)[0][0] target_count = run_query(target_conn, target_query)[0][0] print(f"Row count - Source: {source_count}, Target: {target_count}") return source_count == target_count # Example usage source_conn = snowflake.connector.connect( user='source_user', password='source_password', account='source_account', warehouse='source_wh', database='source_db', schema='public' ) target_conn = snowflake.connector.connect( user='target_user', password='target_password', account='target_account', warehouse='target_wh', database='target_db', schema='public' ) table_name = 'customer' if validate_row_count(source_conn, target_conn, table_name): print("Row count validation passed.") else: print("Row count validation failed.")

No comments:

Post a Comment

Python using AI

  Python using AI - Prompts & Codes Tools useful for Python + AI ChatGPT - https://chatgpt.com/ Claude AI - https://claude.ai/new ...