Round 1: SQL + Problem Solving
-
Find 2nd latest transaction per user
-
Rank records inside partitions
Use RANK()
or ROW_NUMBER()
:
-
Explain why NULLs break
NOT IN
If even one NULL is in the list, NOT IN
returns no results:
Reason: NULL
means unknown, so 10 != NULL
cannot be evaluated, making the result unknown, not true.
Fix: Use NOT EXISTS
or filter NULLs first.
Round 2: Data Pipeline Design
-
Build a pipeline for 100M+ daily events
Key Points:
-
Use a distributed system: Kafka (ingestion), Spark/Beam (processing), S3/GCS or HDFS (storage), and Snowflake/BigQuery (warehouse).
-
Partition data by date/hour, compress with Parquet/ORC.
-
Optimize for scalability and retries.
-
Schedule with Airflow or Dagster.
-
What if one batch fails mid-way?
-
Implement checkpointing.
-
Use retry logic in orchestrator (Airflow).
-
Store raw data and reprocess from failure point.
-
Use idempotent writes to avoid duplicates.
-
Schema drift or late-arriving data
-
Use schema registry and schema evolution tools.
-
Store raw data separately for reprocessing.
-
Use late windowing in Spark or Dataflow.
-
Add versioning and data quality validation.
Round 3: System Design / Architecture
-
Design a scalable data lake
-
Use cloud storage (S3/GCS).
-
Partition data (by date/event type).
-
Use Delta Lake or Apache Iceberg for versioning, ACID.
-
Add metadata catalog (Glue, Hive Metastore).
-
Use Spark or Presto for querying.
-
Include monitoring, retries, alerting.
-
Track global ride data in real time
-
Use event-driven architecture:
-
Kafka → Stream processor (Flink/Spark Structured Streaming) → NoSQL DB (Cassandra/Redis) or warehouse.
-
-
Low-latency writes and geo-partitioned storage.
-
Tradeoffs of batch vs stream
| Criteria | Batch | Streaming |
|--------------|------------------------------------|------------------------------------|
| Latency | High (hours/minutes) | Low (seconds/milliseconds) |
| Complexity | Easier | Harder (state mgmt, ordering) |
| Cost | Lower | Higher (always-on infra) |
| Use Cases | Reporting, historical analysis | Fraud detection, real-time alerts |
Round 4: Stakeholder & Behaviorals
-
Explain a failed pipeline to a non-tech manager
“We had an issue where part of the data wasn’t loaded due to a missed file drop from the source system. We implemented automated checks to alert us earlier and ensure retry happens immediately, so downstream reports stay accurate.”
-
Time your data was wrong
“Once we onboarded a new source system, we misaligned date formats, causing incorrect reporting. I ran data audits, corrected the logic, and set up validation checks to prevent this in future pipelines.”
-
Can analysts trust what you build?
“Yes — I prioritize data quality, version control, logging, and transparency. I document data sources, add unit tests, and implement alerting and data freshness checks to ensure reliability.”
No comments:
Post a Comment