Site Reliability Engineer - Observability
Cluepoints
Company Description
At CluePoints, we’re redefining how clinical trials are run. As the premier provider of Risk-Based Quality Management (RBQM) and Data Quality Oversight software, we harness advanced statistics, artificial intelligence, and machine learning to ensure the quality, accuracy, and integrity of clinical trial data, helping life sciences organizations bring safer, more effective treatments to patients faster.
We’re proud to be an ambitious, fast-growing technology scale-up with a dynamic and diverse international team representing more than 20 nationalities. Collaboration, flexibility, and continuous learning are part of our DNA.
At CluePoints, you’ll find a culture where you can grow, make an impact, and have fun along the way.Guided by our values of Care, Passion, and Smart Disruption, we’re united by a shared mission: to create smarter ways to run efficient clinical trials and deliver AI-powered insights that improve human outcomes worldwide.
Role:The Site Reliability Engineer, Observability & RUM is responsible for improving end-to-end observability across our platforms and customer-facing applications, with a particular focus on frontend and Real User Monitoring (RUM). This role combines core SRE practices with ownership of monitoring, logging, tracing, alerting, and user-experience telemetry in production.
You will help evolve our observability capabilities across Azure and Kubernetes environments, improve incident detection and diagnosis, and support decisions around managed versus self-managed observability tooling. You will partner closely with Engineering, Support, QA, and Security teams to ensure systems ship with actionable telemetry, dashboards, alerts, and operational runbooks.
- 5+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Observability Engineering roles.
- Strong hands-on experience with observability and monitoring platforms, including several of the following:Elastic, Grafana, Prometheus, OpenTelemetry, Sentry, monitoring agents, and managed APM/observability platforms.
- Experience implementing and supporting Real User Monitoring (RUM) and frontend/application observability in production environments.
- Ability to work across frontend, backend, and platform teams to improve telemetry, alerting, and incident diagnosis.
- Experience evaluating or operating managed observability platforms and understanding the trade-offs versus self-managed stacks.
- Experience supporting ML, AI, or LLM-backed services in production (RAG, LangSmith, Arize Phoenix, LangChain, LangGraph, Azure OpenAI, OpenAI, or Anthropic APIs).
- Own and improveReal User Monitoring (RUM) for customer-facing applications, including browser performance, client-side errors, user journeys, and frontend service dependencies.
- Partner with frontend, product, and engineering teams to improve visibility into user experience, JavaScript/runtime failures, page performance, and customer-impacting issues.
- Establish and maintain end-to-end observabilityacross frontend, backend, infrastructure, and Kubernetes environments using metrics, logs, traces, dashboards, and alerting.
- Evaluate, implement, and operate managed and self-managed observability solutions, helping guide the evolution of the observability stack.
Support and improve observability tooling such as Sentry, Elastic, Grafana, Prometheus, OpenTelemetry, monitoring agents, and related APM platforms.
Define and maintain SLIs, SLOs, and alerting strategies that improve service reliability, reduce noise, and enable faster detection of production issues.
Lead or support incident detection, alert triage, live production troubleshooting, and service restoration across outage, latency, batch, file transfer, and degradation scenarios, in partnership with Support and Production teams.