Site Reliability Engineer II
Owned deployment automation, Kubernetes workload reliability, GitLab CI/CD workflows, rollback safety, and Linux-based operational automation for internal application environments. Focused on improving release consistency, reducing manual deployment effort, and strengthening production reliability.
- CI/CD automation
- Kubernetes reliability
- Release engineering
- Production support
- Linux automation
- Operational efficiency
- Supported production reliability for Kubernetes-based microservices running on Linux infrastructure, working across AKS, Helm, GitLab CI/CD, Docker, NGINX/Ingress, Azure networking, DNS, TLS, and container runtime issues.
- Optimized Kubernetes workloads by tuning Helm charts, CPU/memory requests and limits, liveness/readiness probes, HPA autoscaling, deployment strategies, pod scheduling, and ephemeral storage usage to improve stability and reduce infrastructure waste.
- Built and maintained GitLab CI/CD pipelines using YAML, Bash, Python, Docker, artifact promotion, rollback stages, environment variables, secrets, and deployment approvals for repeatable application releases.
- Diagnosed production incidents involving CrashLoopBackOff, ImagePullBackOff, OOMKilled pods, failed readiness probes, storage pressure, node resource exhaustion, service discovery failures, ingress routing, DNS resolution, SSL/TLS certificates, and Azure infrastructure dependencies.
- Developed Python and Bash automation for log cleanup, health checks, service restarts, pod remediation, deployment validation, patching workflows, certificate checks, and operational runbooks to reduce manual toil.
- Automated Linux server configuration and environment provisioning using Ansible, shell scripting, systemd, cron, SSH, package management, file permissions, and configuration templates to reduce drift across environments.
- Used observability and troubleshooting tools including Grafana, Prometheus-style metrics, application logs, Kubernetes events, kubectl, journalctl, systemctl, curl, tcpdump, nslookup, and Linux performance commands to investigate availability and latency issues.
- Participated in on-call incident response, triage, RCA, post-incident reviews, alert tuning, and reliability improvements for high-availability internal applications and distributed services.
- Improved deployment safety by implementing pre-deployment checks, smoke tests, rollback workflows, release gates, and pipeline validation to reduce failed deployments and production impact.
- Collaborated with development, platform, security, and cloud infrastructure teams to troubleshoot CI/CD failures, Kubernetes platform issues, networking dependencies, access/RBAC problems, and service reliability risks.
- AKS
- Kubernetes
- Helm
- GitLab CI/CD
- Docker
- NGINX / Ingress
- Azure Networking
- DNS
- TLS
- Linux
- Ansible
- systemd
- Bash
- Python
- HPA / Autoscaling
- Grafana
- Prometheus
- kubectl
- RBAC
- On-call / Incident Response