Friday, June 20, 2025

Data Engineering rounds 123

 

Round 1: SQL + Problem Solving

  1. Find 2nd latest transaction per user


SELECT user_id, transaction_date FROM ( SELECT user_id, transaction_date, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date DESC) AS rn FROM transactions ) t WHERE rn = 2;
  1. Rank records inside partitions

Use RANK() or ROW_NUMBER():

SELECT user_id, transaction_amount,
RANK() OVER (PARTITION BY user_id ORDER BY transaction_amount DESC) AS rnk FROM transactions;
  1. Explain why NULLs break NOT IN

If even one NULL is in the list, NOT IN returns no results:

SELECT 1 WHERE 10 NOT IN (1, 2, NULL); -- returns nothing

Reason: NULL means unknown, so 10 != NULL cannot be evaluated, making the result unknown, not true.

Fix: Use NOT EXISTS or filter NULLs first.


Round 2: Data Pipeline Design

  1. Build a pipeline for 100M+ daily events

Key Points:

  • Use a distributed system: Kafka (ingestion), Spark/Beam (processing), S3/GCS or HDFS (storage), and Snowflake/BigQuery (warehouse).

  • Partition data by date/hour, compress with Parquet/ORC.

  • Optimize for scalability and retries.

  • Schedule with Airflow or Dagster.

  1. What if one batch fails mid-way?

  • Implement checkpointing.

  • Use retry logic in orchestrator (Airflow).

  • Store raw data and reprocess from failure point.

  • Use idempotent writes to avoid duplicates.

  1. Schema drift or late-arriving data

  • Use schema registry and schema evolution tools.

  • Store raw data separately for reprocessing.

  • Use late windowing in Spark or Dataflow.

  • Add versioning and data quality validation.


Round 3: System Design / Architecture

  1. Design a scalable data lake

  • Use cloud storage (S3/GCS).

  • Partition data (by date/event type).

  • Use Delta Lake or Apache Iceberg for versioning, ACID.

  • Add metadata catalog (Glue, Hive Metastore).

  • Use Spark or Presto for querying.

  • Include monitoring, retries, alerting.

  1. Track global ride data in real time

  • Use event-driven architecture:

    • Kafka → Stream processor (Flink/Spark Structured Streaming) → NoSQL DB (Cassandra/Redis) or warehouse.

  • Low-latency writes and geo-partitioned storage.

  1. Tradeoffs of batch vs stream
    | Criteria | Batch | Streaming |
    |--------------|------------------------------------|------------------------------------|
    | Latency | High (hours/minutes) | Low (seconds/milliseconds) |
    | Complexity | Easier | Harder (state mgmt, ordering) |
    | Cost | Lower | Higher (always-on infra) |
    | Use Cases | Reporting, historical analysis | Fraud detection, real-time alerts |


Round 4: Stakeholder & Behaviorals

  1. Explain a failed pipeline to a non-tech manager

“We had an issue where part of the data wasn’t loaded due to a missed file drop from the source system. We implemented automated checks to alert us earlier and ensure retry happens immediately, so downstream reports stay accurate.”

  1. Time your data was wrong

“Once we onboarded a new source system, we misaligned date formats, causing incorrect reporting. I ran data audits, corrected the logic, and set up validation checks to prevent this in future pipelines.”

  1. Can analysts trust what you build?

“Yes — I prioritize data quality, version control, logging, and transparency. I document data sources, add unit tests, and implement alerting and data freshness checks to ensure reliability.”


No comments:

Post a Comment

Python using AI

  Python using AI - Prompts & Codes Tools useful for Python + AI ChatGPT - https://chatgpt.com/ Claude AI - https://claude.ai/new ...