Sanjana Venkatesh

Experience

Data Engineering, Analytics Engineering, and Applied AI—delivering end-to-end systems with measurable business impact.

Featured
$10M
annual infra savings (cloud modernization)
150M+
records/day processed (streaming)
1,500+
tables migrated with 99%+ accuracy
16 hrs/day
manual ops eliminated (Airflow)

Applied AI & ML Infrastructure

  • Orchestrated ML training workflows across Vertex AI and Snowflake; introduced Git-based CI/CD automation to standardize runs and reduce training time by 2 hours per cycle.
  • Operationalized reproducible experimentation by versioning data extracts, feature definitions, and training artifacts to improve traceability and reviewability in research iterations.
  • Built feature engineering pipelines (Python/SQL) and collaborated with stakeholders to translate research requirements into production-ready ML features.

Analytics & Cost Optimization

  • Developed exploratory analysis in Tableau to surface signal quality issues, guide feature prioritization, and accelerate model iteration.
  • Optimized BigQuery workloads through partitioning and clustering strategies, driving a 38% reduction in query costs while preserving analytical fidelity.

Applied AI: LLM-Enabled Analytics (Research)

  • Prototyped retrieval-augmented analysis patterns (embeddings + semantic retrieval) to enable natural-language exploration of datasets, documentation, and research outputs.
  • Established lightweight evaluation checks (e.g., relevance and consistency) to keep model-assisted outputs auditable and aligned with source data.
Faster ML cycles
2 hours saved per training cycle via CI/CD + standardization
Lower analytics spend
38% BigQuery cost reduction through physical design optimization
Vertex AI Snowflake BigQuery Python SQL MLOps RAG Embeddings LangChain Tableau CI/CD

Cloud Cost Optimization & Reliability

  • Managed AWS infrastructure for big data analytics; optimized EC2 sizing and scheduling, reducing operating costs by $23k annually.
  • Implemented monitoring and alerting standards for pipeline health (latency, failures, SLA breaches), improving operational reliability and time-to-detect.

Data Integration & Metadata Automation

  • Automated ingestion from ThoughtSpot APIs into Snowflake via REST integrations, reducing data latency by 42%.
  • Automated ThoughtSpot worksheet metadata synchronization using Python + Snowflake business glossary inputs (TML processing), improving governance and self-serve discoverability.
AWS EC2 Snowflake Python REST APIs ThoughtSpot Observability

Legacy Infrastructure Modernization

  • Led migration of a legacy Hadoop estate to Google Cloud Platform; modernized 1,500+ Hadoop/MapReduce jobs to Dataproc.
  • Optimized Apache Spark parameters (executors, shuffle management, SSD balancing), delivering $10M annual infrastructure savings.

Large-Scale Data Migration & Governance

  • Directed an 8-engineer program to migrate 1,500+ tables to BigQuery with zero downtime.
  • Implemented SQL validation checks and Data Validation Tool (DVT), achieving 99%+ accuracy and audit-ready reporting.

Pipeline Orchestration & Analytics

  • Productionized 50+ ETL/ELT pipelines using Apache Airflow (GCS → BigQuery), eliminating 16 hours/day of manual operations.
  • Authored 100+ LookML models in Looker to operationalize a governed metrics layer; dashboards enabled $1.2M annual savings.
  • Presented KPI insights and adoption plan to client VPs, enabling cross-functional rollout of governed metrics and self-serve analytics.

Real-Time Data Processing

  • Built a Kafka + PySpark streaming pipeline processing 150M+ raw records/day from 30+ sources for real-time analytics.
GCP Dataproc BigQuery Airflow Kafka PySpark Looker

Want the details in one place?

Download the resume or connect on LinkedIn.