Sanjana Venkatesh

Selected Work

Projects

A selection of data engineering, analytics, and machine learning systems—ranging from large-scale ETL/ELT and real-time streaming to retrieval-augmented generation and predictive maintenance.

Placeholder image for Anime Recommendation System

Anime Recommendation System (RAG)

Aug 2025 – Sep 2025

Built a natural-language recommendation engine using a production-ready Retrieval-Augmented Generation pipeline.

  • Orchestrated RAG with LangChain; embedded titles, genres, synopses, and metadata using OpenAI embeddings.
  • Implemented LanceDB vector store for fast similarity search and top-N retrieval; optimized indexing for low-latency queries.
  • Generated explainable recommendations with rationale and confidence ranking.
LangChainOpenAI EmbeddingsLanceDBPython
Engagement+25%
Latency< 500ms
Experimentation10× faster
Placeholder image for Retail Data Analytics

Retail Data Analytics

Mar 2024 – Jul 2024

Designed an end-to-end ETL pipeline and analytics layer to enable reliable daily refresh SLAs and self-serve reporting.

  • Automated ingestion from Google Cloud Storage to BigQuery using Apache Airflow and Python.
  • Built dbt models with star schema design, data quality tests, and query optimization (partitioning, clustering).
  • Delivered Power BI dashboards with DAX calculations for product and market analysis.
AirflowBigQuerydbtPower BISQL
Latency2h → 12m
Reduction83%
Manual reporting-48%
Placeholder image for Intelligent Fab Monitoring

Intelligent Fab Monitoring & Maintenance System

Jan 2024 – Mar 2024

Real-time geospatial analytics platform for semiconductor fab monitoring (1st Place capstone).

  • Tracked 2M+ RTLS devices; processed 100K+ logs/day with cleaning, geofencing, feature engineering, and time-series decomposition.
  • Trained XGBoost Remaining Useful Life models achieving 82% accuracy; delivered event-driven alerts and KPI dashboards.
  • Reduced downtime by eliminating manual monitoring overhead through real-time analytics.
GCPBigQueryXGBoostGeospatialFlaskReact
Accuracy82%
Devices2M+
Logs/day100K+
Placeholder image for Airbnb Analytics

Airbnb Occupancy & Revenue Analytics

Academic Project

PySpark analytics and predictive modeling for Los Angeles County listings with external feature enrichment.

  • Built distributed pipelines with PySpark and Spark SQL; integrated crime and transit accessibility datasets.
  • Achieved 76% accuracy in price forecasting through feature engineering and model optimization.
PySparkSpark SQLPythonStatistical Analysis
Forecast accuracy76%
RegionLA County
Placeholder image for Covid-19 Violation Monitor

Covid-19 Violation Monitor

Academic Project

Computer vision system detecting mask compliance in surveillance videos for real-time violation monitoring.

  • Used RetinaFace for face detection and Xception classifier for mask vs. no-mask inference.
  • Processed video streams and surfaced violations for rapid response workflows.
PythonOpenCVRetinaFaceXception
Placeholder image for Covid-19 Chatbot

Covid-19 Chatbot

Academic Project

Flask-based NLP chatbot for Covid-19 queries with deep learning intent classification.

  • Implemented Keras Sequential model for intent classification with NLTK preprocessing.
  • Reached 76% accuracy in question answering on benchmarked query sets.
FlaskKerasNLTKNLP
Accuracy76%
Placeholder image for Market Basket Analysis

Customer Market Basket Analysis

Academic Project

Association rule mining for transaction datasets and personalized recommendations based on frequent itemsets.

  • Applied Apriori and FP-Growth to discover item associations and frequent patterns.
  • Built recommendation utilities leveraging discovered rules and customer behavior.
PythonAprioriFP-GrowthJupyter

Want to see the code?

Browse the repositories and featured work on GitHub.