Nikhil Gupta

I'm

Resume

Results-driven SRE Manager with 16 years of experience in leading high-performance teams and ensuring the reliability, scalability, and performance of mission-critical systems. Adept at implementing best practices in incident management, infrastructure design, and automation to optimize system reliability and uptime. Proven track record of successfully driving projects, managing stakeholders, and fostering a culture of collaboration and innovation.

Summary

Nikhil Gupta

Experienced and hands-on SRE/DevOps Engineering Leader with a proven track record in managing complex microservices environments serving millions of users. Expertise in centralizing tooling, platform development and GenAI/Agentic AI to enhance developers' productivity, improve product quality, and ensure scalability and resiliency. Strong advocate for DevOps practices, zero-touch CI/CD pipelines, continuous deployment, and modernization of tech stacks. Skilled in defining SRE best practices, implementing monitoring and alerting systems, and driving incident management and root cause analysis. Collaborative leader, adept at managing teams, gathering requirements, and driving automation initiatives to optimize operational efficiency. I focus on platform thinking, observability, DR and automation, and GenAI for SRE, while maintaining consistently high AES (engagement) scores year after year.

  • Whitefield, bangalore, India
  • (0091) 7829-4537-68
  • nikhil.vinod.gupta@gmail.com

Education

Bachelors of Engineering

2005 - 2009

SGSITS, Indore (M.P.), India (No. #1 state college of MP)

Achieved the 70th rank in the state-level engineering entrance exam, securing admission to the premier engineering college in Madhya Pradesh.

Higher Secondary (12th Standards)

2003 - 2004

SVHSS, Ashoknagar(MP)

Completed 12th grade in 2004 with 84% and scored 1st rank in the school, emphasizing major subjects in Physics, Chemistry, and Mathematics.

Professional Experience

Senior Engineering Manager (SRE/DevOps)

Dec-2021 - Present

Walmart Global Tech, Bangalore, India

Staff Software Engineer/Architect (SRE/DevOps)

Jan-2020 - Dec-2021

Walmart Global Tech, Bangalore, India

Senior Software Engineer (SRE/DevOps)

Oct-2018 - Jan-2020

Walmart Global Tech, Bangalore, India

Senior DevOps Engineer

Jan-2017 - Oct-2018

Intuit, Bangalore, India

Senior DevOps & Infrastructure Engineer

Jun-2016 - Dec-2016

DataTorrent, Bangalore, India

Lead DevOps Engineer

Sep-2014 - Jun-2016

Amadeus, Bangalore, India

Senior Build & Release Engineer

May-2014 to Sep-2014

Altisource, Bangalore, India

Senior Software Engineer (Build & Release)

Dec-2009 to May-2014

Accenture, Bangalore, India

About

Experienced and hands-on SRE/DevOps Engineering Leader with a proven track record in managing large-scale systems, driving platform initiatives, and leading high-performing teams across multiple domains including Distribution Centres, Walmart Stores & Associates, and Catalog (Online & In-Store). Passionate about creating reusable SRE platforms, improving observability, and driving automation initiatives to optimize operational efficiency.

Experience Summary:

  • Led a team of SREs in a complex microservices environment serving millions of users.
  • Developed and implemented centralized tooling and platform solutions, enhancing developers' productivity and enabling focus on feature development, quality improvement, and scalability.
  • Defined and enforced SRE and DevOps best practices across teams, resulting in increased service quality and uptime.
  • Established and optimized zero-touch CI/CD pipelines with integrated code quality checks, unit tests, integration tests, end-to-end tests, performance tests, approval processes, and logging.
  • Advocated for continuous deployment and delivery, working with developers to achieve higher deployment frequency.
  • Designed and implemented robust alerting, monitoring, and logging practices, tools, and dashboards.
  • Orchestrated tech modernization efforts, successfully migrating from legacy to new tech stacks.
  • Developed deployment architectures to ensure efficient and scalable service deployments.
  • Implemented SLIs, SLOs, and SLAs for microservices, creating SLO dashboards and proactive alerting systems.
  • Designed and integrated dashboards for CICD, service availability, infrastructure availability, cost analysis, and network latency into a centralized SRE dashboard.
  • Implemented HA/DR strategies and automation to achieve defined business goals.
  • Introduced branching strategies to increase deployment frequency while maintaining high-quality standards.
  • Led incident response efforts, conducting thorough RCAs and implementing improvements to minimize mean time to resolution (MTTR).
  • Provided cost optimization suggestions without compromising quality, availability, and resiliency.
  • Implemented infrastructure and configuration as code principles to ensure consistency and reproducibility.
  • Facilitated sprint planning, retrospectives, and Jira dashboard management for the SRE team.
  • Collaborated with Dev Managers and Architects to gather requirements, provide feedback, and address pain points.
  • Managed 24x7 SRE on-call rotations and ensured timely incident response and resolution.
  • Proactively identified and automated repetitive tasks to improve operational efficiency.
  • Led migration initiatives from on-premises infrastructure to cloud platforms.
  • Improved SDLC for legacy products, implementing modern DevOps practices.
  • Organized and conducted DevOps and SRE tech sessions to foster knowledge sharing and skill development.
  • Established and led the DevOps and SRE forum, facilitating collaboration and the sharing of tools, best practices, and pain points across multiple teams.

Recent Key Platforms & Projects

Selected recent platforms and frameworks I have driven at Walmart to reduce toil, improve reliability, and enable teams to move faster at Walmart Scale

AlertBeacon

No-code alert lifecycle platform used as a central entry point for creating, discovering, validating and analysis of alerts across systems.

  • No Code alert creation platform for stacks like K8S, Azure SQL, Kafka, Cassandra and Cosmos.
  • Built-in checks for broken, noisy, or duplicate alerts using incident/notification data.
  • Designed to be reusable across teams without extra onboarding.

DB Inspector

Azure SQL SRE assistant that surfaces bottlenecks and suggests remediation for developers and DBAs to reduce MTTD & MTTR.

  • Shows CPU, blocking sessions, and heavy queries via UI and APIs.
  • Opinionated recommendations for indexing, query patterns, and connection behaviour.
  • Integrates with existing monitoring so teams do not need a new tool stack.

DR Playbook Automation

Framework that generates and maintains DR / failover playbooks for hundreds of applications with one click.

  • Pulls metadata from services, clusters, and databases to auto-build playbooks.
  • Keeps documentation in sync with infra changes and decommissions.
  • Makes DR a living asset instead of a one-time exercise.

One Stop Monitoring

Unified observability Platform to Monitor health across K8S, Kafka, Cassandra, Cosmos, Azure SQL, AlloyDB and more.

  • Generic & Reusable Dashboard for whole Walmart to avoid multiple teams building their dashbaords.
  • Reduces time to understand impact during incidents and peak events.
  • Designed in a way that no onboardig is needed and all new workloads onboarded automatically.

GenAI Deliverables

Recent work where the team upskilled and delivered AI-assisted tooling on top of SRE / DevOps platforms.

Conversational Chatbots & GenAI Agents

Guided the team to design and implement conversational AI chatbots and GenAI agents that sit on top of SRE data and tools.

  • Use-cases include answering platform “how-to”, reasoning about alerts, and recommending fixes.
  • Helped engineers move from basic scripts to agentic flows with tools, memory, and context.
  • Enabled multiple team members to upskill in AI while delivering tangible business value.

Awards & Recognitions

Appreciations received for delivering at scale and building strong, engaged teams.

  • Bravo Award – For going above and beyond in delivering critical SRE/DevOps initiatives and cross-team programs.
  • Team Award – For leading high-performing teams that executed complex projects at Walmart scale.
  • Excellence Award – For sustained excellence in reliability, platform thinking, and stakeholder partnership.
  • Multiple written appreciations from leaders and stakeholders for observability upgrades, DR readiness, and automation frameworks.

Facts

Below Numbers describes about myself

16

Years of Experience

4

Certifications

6

Organisations

7

Awards

Awards & Certifications

Below is the list of certifications and Awards.

  • All
  • Awards
  • Certifications

Excellence Award (Walmart)

Team Award (Walmart)

Bravo Award (Walmart)

Spot Award (Walmart)

ATLAS Conversion Award (Walmart)

Star of Quarter Award (Intuit)

Spot Award (Amadeus)

CKA (Certified Kubernetes Administrator)

Introduction to GenAI

Python 3

Responsible AI

Liquibase

Python Basics

Tech Skills

I have worked on many tools and technologies. I am always open to learn any new tool or tech based on the requirement. Below is my current tech skill set

Scripting/Programming (Python, Shell)
GenAI (LLM, Agents, Tools, RAG, ChatBots)
Web (Flask, HTML, CSS, JS)
CI/CD (Jenkins, Bamboo)
Config Management (Ansible, Chef)
Cloud (Azure, AWS)
APM (Dynatrace)
Continuous Monitoring & Alerting (Prometheus, Grafana)
Messaging (Kafka)
Containers and Orchestration (Docker, Kubernetes, Docker Swarm)
Source Code Management (GitHub, BitBucket, Perforce, SVN)
Slack Chatbots
OS - Linux(Mainly), Windows
Build Tools (Maven, Ant, Gradle)
Databases (DB2 LUW, Azure SQL, Cassandra, Cosmos, Alloy DB)
Logging (Splunk, OpneObserve)
Service Mesh (Istio)
HA/DR Strategy and Automation

Management Skills

As an SRE (Site Reliability Engineering) Manager, effective leadership skills are crucial for successfully managing and leading a team. I have gained below leadership skill during my tenure as SRE Manager

Coaching and Mentoring
Team Building
Stakeholder Management
Conflict Resolution
Decision Making
Collaborated with cross-functional teams
Strategic Thinking
Prioritization

Testimonials

Below testimonials are from my linkedIn Profile.

Contact

You can reach out to me :

LinkedIn:

nikhil-vinod-gupta

Call:

+91 78 2945 3768