AI / Data
Engineer

2× Google Summer of Code · GCP Professional Data Engineer · Open Source

Hi, I’m Prathamesh. I build and own end-to-end production data systems, ML and robust data pipelines under real-world constraints. I bring strong open-source experience and a formal Data Science background, with an emphasis on systems that scale, fail gracefully, and get used.

Experience /

Data Engineer

Apr 2024 — Current
Gekko
  • AI for Oil Drilling: Architected a hybrid ML system from the ground up, combining classification, entity extraction, and proprietary algorithms to infer drilling operations & parameters and reconstruct activity timelines from unstructured reports with 95%+ accuracy, saving 4 hrs/well across 1000s of wells.

  • Selected a combination of small LLMs (qwen, gpt-oss, gemini), fast fine-tuned transformer models (BERT, GLiNER), and decision trees to meet strict latency and cost constraints. Implemented context-aware corrections and domain-specific guardrails to ensure physically consistent outputs.

  • Data mining: Engineered a pipeline to mine structured insights from PDFs using multi-modal LLMs (gemini-2.5-flash) with pre-defined schemas, few shot examples, and text rules; outperformed SOTA tools like Docling for our use case. Optimized w/ annotation caching, input pruning, etc.

  • Vessel & Fuel Analytics: Designed a data warehousing solution & a Dagster pipeline to process vessel telemetry & geospatial data, saving clients $1.2M annually in fuel costs through clever fuel reporting using PostgreSQL, Dagster, Polars, Power BI.

  • Leadership: Promoted to technical lead, responsible for architectural decisions and team building to drive automation, platform modernization. Led the delivery of 10+ mission-critical projects with a team of 6 engineers.

Python LangChain PyTorch Pydantic FastAPI Dagster PostgreSQL Polars Power BI GCP Docker Git

Data Engineer Intern

Dec 2023 — Mar 2024
New Engen Inc.
  • Assisted with data modelling and implementing data pipelines to extract, load, and transform raw Facebook Ads data for BI and Analytics teams.

  • Migrated complex pipelines from Salesforce Datorama and Adverity to a custom in-house solution using GCP BigQuery, dbt, Apache Airflow, and Python.

Python dbt Google Cloud Platform BigQuery Airflow Adverity SalesForce Intelligence (Datorama) Git

Contributor (Data Engineer)

May 2023 — Nov 2023
Google Summer of Code 2023 @ MetaBrainz
  • Assembled an ETL pipeline from Wikidata to the MusicBrainz database, facilitating a 60% increase in new location data, and slashing manual data feeding by 90% [Details].

  • Independently designed and developed a scalable, production-ready solution using Python (pandas, multiprocessing, requests, sqlalchemy), SQL (PostgreSQL), and Docker [Architecture, Code].

  • Conducted research and experimentation, optimizing SPARQL queries to cater to Wikidata’s graph data structure to improve data quality and extraction efficiency. [Details].

Python PostgreSQL SPARQL Docker Pandas Data Engineering Data Analytics

Contributor (Data Engineer)

May 2022 — Oct 2022
Google Summer of Code 2022 @ MetaBrainz
  • Enriched, cleaned, and combined 27 billion rows of music streaming data using Python (Pandas, Multiprocessing), SQL (PostgreSQL), and Apache Arrow – achieving high efficiency in Python without Spark. [Details]

  • Researched and implemented technologies like Zstandard and Apache Arrow to optimize data lake efficiency, resulting in a 53% reduction in storage and a 9% improvement in read/write speeds.[Details]

  • Performed data analytics and published benchmarks, dashboards, and reports to help collaborating teams better understand and utilize the data to train state-of-the-art Music Recommendation Systems. [Project Summary]

Python PostgreSQL Apache Arrow Pandas Data Engineering Data Analytics

Technical Stack /

Applied AI

Generative AI, NLP, RAG & Vector Search, LLMs & Transformer Models, Fine Tuning, Self Hosting, LangChain / Pydantic AI, FAISS / Milvus, PEFT / LoRA, n8n, PyTorch, OpenAI API, Prompt Engineering, Context Window Management

Data Engineering

Data Warehousing, ETL/ELT Processes, Data Integration, Pyspark, Pandas, Polars, BigQuery, PostgreSQL, Dagster, Apache Airflow, dbt, Apache Arrow

Languages / Tools

Python, SQL, Bash, FastAPI, HTML, CSS, Power BI, Plotly Dash, Streamlit

Cloud / DevOps

Git, Linux, Docker, CI/CD, GitHub Actions, Google Cloud (BigQuery, Vertex AI, Composer, Dataproc, IAM), Azure (App Service, Virtual Machines), AWS (EC2, S3)

Achievements

  • 2× Google Summer of Code

    Selected twice (Top 2% of 43K+ applicants) for Google Summer of Code 2022 & 2023.

  • IEEE Leadership

    Elected President (Student’s Association) and Vice President (IEEE Student Chapter); represented the South East Asia Cluster at IEEE Asia Pacific’s CLAP (2021).

  • Speaking & Writing

    Invited speaker at IIT Madras and other institutions (1,000+ students). Wrote a blog with 35k+ LinkedIn impressions and 3.7k+ views.

Education

BTech. Artificial Intelligence

2020 — 2024

G.H. Raisoni College of Engineering & Management, Pune

  • CGPA: 8.88

BS. Data Science and Applications

2021 — 2025

Indian Institute of Technology, Madras

  • Dropped out to pursue full-time opportunities in Engineering.
  • CGPA: 8.24

Projects /

Blogs /