Selector Software logo

Senior Site Reliability Engineer

Selector Software
Full-time
Remote
India

About Us

Selector is building an operational intelligence platform for digital infrastructure. By adopting an AI/ML-based analytics approach, the platform provides actionable multi-dimensional insights to network, cloud, and application operators. It enables operations teams to meet their KPIs through seamless collaboration, search-driven conversational user experience, and automated data engineering pipelines.


Our solutions are used by leading Telecoms, Media Providers, Retail, and Professional Sports organizations across the world. Our novel approach and rapidly expanding footprint put us in the unique position for continued growth to become a category leader. To fuel our growth, we are seeking passionate, high-energy, results-oriented individuals to join our team.


Our mission is to deliver world-class solutions on behalf of the large enterprise. Supported by leading investors, Selector is uniquely positioned to deliver a world-class solution to address large enterprise requirements across the globe.


Job Overview

We are seeking a highly skilled Senior Site Reliability Engineer to join our Engineering team in India. This role is a split-duty position comprising both customer-facing responsibilities and internal platform reliability initiatives.


As a Senior SRE, you will play a critical role in deploying, maintaining, and improving the reliability and scalability of Selector’s platform across on-premises and SaaS environments. You will collaborate closely with Platform Engineering, DevOps, and customer teams to ensure seamless deployments, strong system performance, and continuous platform improvement.


Key Responsibilities

  • Serve as a senior technical expert in deploying and maintaining Selector’s operational analytics platform across on-premises and SaaS environments.
  • Lead complex customer installations, including deployments in air-gapped and highly regulated environments.
  • Partner directly with customers via Zoom/Teams to troubleshoot, triage services, and resolve installation or performance nuances.
  • Author, review, and maintain Infrastructure as Code (IaC) using Terraform/OpenTofu, ensuring scalable and maintainable infrastructure design.
  • Deploy and manage containerized applications using Kubernetes (including RKE) and Kustomize in production environments.
  • Triage and resolve issues across distributed systems, Kafka pipelines, CI/CD workflows (Jenkins), and Google Cloud infrastructure.
  • Provide structured, actionable feedback to Platform Engineering and DevOps teams to improve reliability, scalability, and performance.
  • Participate in and help mature on-call processes, ensuring high availability and operational excellence.
  • Perform root cause analysis for production incidents and implement long-term corrective and preventative solutions.
  • Research, evaluate, and implement new tools or architectural improvements to address infrastructure and operational challenges.
  • Mentor junior engineers and promote SRE best practices across reliability, observability, and automation.
  • Improve internal tooling, automation, and operational workflows to enhance developer productivity and system stability.


Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
  • 7+ years of hands-on experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
  • Strong experience with Git/GitHub for version control and collaborative development workflows.
  • Deep hands-on experience managing Kubernetes clusters in production environments (RKE experience preferred).
  • Strong experience with Infrastructure as Code tools such as Terraform or OpenTofu.
  • Experience working with Google Cloud Platform (GCP) in production environments.
  • Experience with CI/CD pipelines and tooling such as Jenkins.
  • Experience working with Kafka or other distributed streaming platforms.
  • Proficiency in Python for scripting, automation, and troubleshooting.
  • Strong expertise in diagnosing and resolving issues in distributed systems.
  • Experience working directly with enterprise customers in technical, customer-facing roles.
  • Strong written and verbal communication skills with the ability to explain complex technical concepts clearly.
  • Experience working in air-gapped or secure enterprise environments is highly preferred.
  • Demonstrated ability to lead initiatives, mentor engineers, and drive reliability improvements across teams.


Benefits & Perks

  • Health Insurance (GMC): Comprehensive medical coverage for employees and dependents, including hospitalization and maternity benefits.
  • Personal Accident Insurance (GPA): Coverage for accidental injury, both on and off duty.
  • Life Insurance (Term Plan): Life insurance coverage for eligible employees.
  • Provident Fund (PF): Company contribution as per statutory requirements.
  • Gratuity: As per the Payment of Gratuity Act.
  • Paid Time Off: Sick Leave, Earned Leave, and Maternity Leave in line with company policy and applicable laws.
  • Holidays: National and regional holidays as per the annual holiday calendar.