Site Reliability Engineer
Coralogix
Site Reliability Engineer
- Engineering
- Remote, Europe
- Senior
Description
Coralogix is a modern, full-stack observability platform transforming how businesses process and understand their data. Our unique architecture powers in-stream analytics without reliance on expensive indexing or hot storage. We specialize in comprehensive monitoring of logs, metrics, trace and security events with features such as APM, RUM, SIEM, Kubernetes monitoring and more, all enhancing operational efficiency and reducing observability spend by up to 70%.
We are seeking a skilled Site Reliability Engineer (SRE) with a strong background in Elasticsearch/OpenSearch to join our team. The ideal candidate will manage and optimize large-scale Elasticsearch/OpenSearch clusters, ensuring the infrastructure's stability, performance, and scalability. You'll work closely with development and operations teams to build robust and efficient systems.
Key Responsibilities:
- Manage & Monitor: Oversee the performance, reliability, and availability of large-scale Elasticsearch/OpenSearch clusters.
- Optimize & Scale: Implement best practices for scaling, indexing, and querying to ensure optimal performance.
- Automate & Streamline: Develop and maintain automated performance testing or benchmarking, monitoring, and alerting for Elasticsearch/OpenSearch clusters.
- Troubleshoot & Resolve: Quickly identify and resolve issues related to cluster health, data integrity, performance bottlenecks, and search accuracy.
- Collaborate: Work closely with development, DevOps, and other teams to design and implement enhancements to cluster architecture, stability, performance, and data management flows.
Requirements
- Experience: Proven experience as an SRE or in a similar role, with specific expertise in managing Elasticsearch or OpenSearch clusters.
- Technical Skills:
- Strong knowledge of Elasticsearch/OpenSearch architecture, including index management, sharding, and replication.
- Experience with performance tuning, scaling, and cluster optimization.
- Understanding of JVM concepts and ability to code with Java or Scala, Python, Go.
- Familiarity with monitoring tools (e.g., Prometheus, Grafana)
- Experience with configuration management and automation tools (e.g., Ansible, Terraform, Kubernetes).
- Problem Solving: Ability to diagnose and troubleshoot complex performance and stability issues in large-scale distributed systems.
- Communication: Strong verbal and written communication skills to collaborate across teams and document processes clearly.
Preferred Skills:
- Familiarity with other other distributed systems (e.g., Apache Solr, Kafka).
- Knowledge of CI/CD pipelines and experience with DevOps practices.
- Experience with cloud providers (AWS, Azure, GCP).