Report Abuse

Candidate Information

Full Name
Shivdevkumar TR
Age
32
Experience
10.5
Job Type
Part-time

Contact Details

Phone
Address
Tirupur
State
TamilNadu
Country
India

About candidate

About you
Hi, I’m Shivdev, currently working as a Senior Site Reliability Engineer (SRE) Lead at Copart, a global car auction company operating across the US, Europe, and the Middle East. In my role, I’m responsible for ensuring the reliability, scalability, and performance of mission-critical services across a highly distributed environment consisting of 200+ yards and 800+ microservices. I lead a team of 5 engineers and also act as an individual contributor, with a strong focus on automation, monitoring, and incident management. My daily tasks involve driving incident resolution by collaborating with development and DevOps teams to identify root causes and implement fixes, alongside creating automation solutions to reduce manual toil. I have led automation initiatives such as network troubleshooting, VPN access automation, and service restarts using Python, Ansible, and PowerShell. I have a deep understanding of cloud technologies, especially AWS, where I have hands-on experience with services like EC2, VPC, S3, and IAM. I also support our CI/CD pipelines using Jenkins, Spinnaker, and Docker, and handle Kubernetes-based containerized workloads. Additionally, I ensure seamless observability and monitoring with tools like Prometheus, Grafana, Dynatrace, New Relic, and CloudWatch. A key part of my role involves defining and managing SLOs, SLIs, and error budgets to drive service reliability. I’ve been instrumental in automating Kubernetes upgrades, patch management, and application deployments at scale, ensuring robust operations across on-prem data centers and cloud infrastructure (AWS and GCP). I’m also responsible for providing on-call support during major releases or production incidents, ensuring minimal downtime and effective triage of critical issues.
What are you looking for in a new role?
In a SRE role, I’m looking for opportunities to leverage my experience in managing infrastructure reliability, automation, and incident management. I want to contribute to improving system performance and ensuring availability while continuously learning new technologies. Additionally, I seek flexibility to manage both responsibilities and skills development.
What you are interested in working with us?
I’m interested in working with your team because of the opportunity to apply my expertise in SRE practices to a dynamic and innovative environment. Your company’s focus on scaling infrastructure and maintaining high reliability aligns with my skills in automation, monitoring, and incident response. I’m eager to contribute to ensuring seamless operations while also learning and growing alongside the team.
What has been most challenging experience in a past role?
One of the most challenging experiences in my past role was managing a major production outage due to a misconfigured load balancer. The situation caused widespread downtime across multiple services.During a critical auction event, our load balancer configuration error led to a system-wide outage. As the on-call SRE, I was responsible for quickly identifying the root cause and restoring service. I immediately coordinated with the network and development teams to trace the issue, identified the misconfigured settings, and rolled back the changes. I then implemented automated monitoring and failover configurations to prevent similar incidents in the future. The system was restored in under 30 minutes, minimizing business impact, and we enhanced our overall incident response process. This challenge taught me the importance of proactive monitoring and detailed root cause analysis.

Cover latter

Shivdevkumar T R

Lead SRE Engineer

Hyderabad | /in/shivdevkumar/ | +91 8903432228 | shivdkum@gmail.com

Dedicated and experienced Lead Site Reliability Engineer with a proven track record of optimizing system

performance and ensuring high availability of mission-critical applications. Proficient in implementing and

managing monitoring solutions and Site Reliability Engineering (SRE) practices. Proficient in Python

programming for automation and scripting tasks.

Work History

2021-07 - Current

Lead Site Reliability Engineer

Copart, Hyderabad

 Lead a team of engineers in developing and maintaining internal application.

 Hands-on Backend development work experience.

 Experience in AWS EC2, S3, ELB, Auto Scale, CloudFormation.

 Automated adhoc troubleshooting steps for resolving the alerts.

 Mentored junior engineers, sharing knowledge of best practices for site reliability

engineering methodologies.

 Managed and optimized infrastructure on AWS Cloud and On-Prem along with

Kubernetes, Docker, and VMware, resulting in improved resource utilization and

cost savings by 15%

 Automated deployment and configuration processes with Terraform, Jenkins,

Ansible, and Puppet, reducing manual effort and increasing efficiency

 Ensured high availability of services by developing comprehensive disaster

recovery plans and backup procedures.

 Evaluated new technologies and tools to enhance overall system performance,

stability, and security.

 Handled Incident bridge calls to resolve the incident and performed root cause

analysis/postmortem and came up with solutions to mitigate potential issue in

future.

2019-01 - 2021-06 Site Reliability Engineer

Copart, Dallas

 Implemented comprehensive monitoring and logging solutions with New Relic

and Prometheus on both on-premises and cloud environments, enabling real-

time visibility into system health

 Conducted regular performance tuning and capacity planning to ensure system

reliability and scalability

 Demonstrated proficiency in cloud concepts and best practices, including

elasticity, scalability, security, and cost optimization.

2017-07 - 2018-12 Systems Engineer Intern

Copart, Dallas

 Assisted in the design and implementation of CI/CD pipelines using Jenkins,

accelerating software delivery cycles.

 Contributed to the configuration and management of Kubernetes clusters to

support containerized applications.

 Supported the development and maintenance of automation scripts using

Python for various infrastructure tasks.2014-01 - 2016-07 Technical Support Engineer

Cisco, Bangalore

 Provided technical support and troubleshooting assistance to customers on

networking products and solutions.

 Collaborated with cross-functional teams to resolve complex technical issues

and ensure customer satisfaction

 Developed and delivered technical training sessions for internal teams and

customers.

2013-03 - 2013012 Customer Support Engineer

Sutherland Global Services, Chennai

 Provided remote assistance to clients, ensuring timely resolution of software and

hardware concerns.

 Mentored junior members of the team on best practices in issue resolution

techniques.

 Served as an escalation point for challenging technical inquiries, demonstrating

expertise in product knowledge and problem-solving abilities.

 Conducted root cause analysis of technical issues, implementing preventive

measures for future occurrences.

Skills

Devops Tools

: Kubernetes, Jenkins, Docker, Spinnaker, Github, VMware

Programming

: Python, Golang

Monitoring

: New Relic, Grafana, Prometheus

Infra/Config

: Terraform, Ansible, Puppet, REST API

Database

: SQL, Postgres, MongoDB

Cloud

: AWS, GCP

Certifications

2024-06

Certified Kubernetes Administrator (CKA)

2024-08

AWS Certified Solutions Architect Associate

Education

2016-2018

MS in Computer Science

University of Texas - Dallas

GPA: 3.67

2008-2012

B.Tech in Information Technology

Anna University - Chennai

GPA: 3.86