포지션 상세

All Live Better,

No.1 헬스&뷰티 스토어 올리브영은 20년 이상 축적된 경험과 데이터, 협력 체계를 바탕으로 이제는 국내를 넘어 글로벌 시장을 리딩하는 라이프스타일 플랫폼이자 새로운 옴니채널 사업자로 진화하고 있습니다.

올리브영은 앞으로도 끊임없이 변화하고 혁신하며 전 세계인의 건강한 아름다움을 큐레이팅하는 '글로벌 No.1 옴니채널 라이프스타일 플랫폼'을 향한 비전을 달성할 것입니다.

올리브영의 독보적인 오프라인 비즈니스를 기반으로 온라인 서비스를 결합하여 온·오프의 경계를 허무는 End-to-End 플랫폼을 함께 만들어가실 역량 있는 분을 모십니다.

주요업무

Opportunities

• Design, build, and maintain highly available, scalable, and resilient backend infrastructure that powers critical system components.
• Partner with product managers and software engineers to ensure seamless integration of reliability and performance into core commerce functionality.
• Automate everything — from deployment pipelines and monitoring to incident response and infrastructure management.
• Implement and refine full-cycle CI/CD pipelines, ensuring rapid and stable deployments while maintaining service reliability.
• Take ownership of production systems by proactively identifying and resolving performance bottlenecks, and driving operational excellence.
• Continuously improve system observability and monitoring, leveraging metrics, logging, and tracing to enhance incident detection and resolution.
• Conduct postmortems and blameless retrospectives, applying lessons learned to prevent future incidents.
• Lead and architect scalable, self-healing systems to support multi-region, high-traffic applications.
• Mentor engineers and advocate for best practices in reliability engineering, helping shape a culture of resilience and continuous improvement.

• 다양한 시스템 구성 요소와 끊임없이 통신하는 고가용성(High Availability) 및 확장성 있는 백엔드 인프라를 설계하고 구축합니다.
• 프로덕트 매니저 및 소프트웨어 엔지니어들과 협력하여 안정적이고 신뢰성 높은 커머스 기능을 제공합니다.
• 배포 자동화 및 운영 자동화를 통해 인프라 관리의 효율성을 극대화합니다.
• 풀 사이클 CI/CD 환경을 구축 및 개선하여 안정적인 서비스 배포를 보장합니다.
• 서비스의 신뢰성을 유지하고 장애를 신속하게 해결하며 Operational Excellence를 달성합니다.
• 모니터링 및 관찰 가능성(Observability) 개선을 통해 장애 감지 및 대응 시간을 단축합니다.
• 장애 발생 시 생산적인 포스트모템을 수행하고 문제를 분석하여 장기적인 해결책을 마련합니다.
• 멀티 리전, 대규모 트래픽을 처리할 수 있는 안정적이고 확장 가능한 시스템을 설계합니다.
• 엔지니어들을 멘토링하며 신뢰성 엔지니어링(SRE) 베스트 프랙티스를 전파하고, 신뢰성 중심의 문화를 조성합니다.

자격요건

Qualifications

• 7+ years of experience in software development, DevOps, or site reliability engineering.
• Proficiency in one or more modern programming languages (e.g., Python, Go, Java, or similar).
• Experience with cloud-native development (AWS, GCP, or Azure) and containerization technologies (Docker, Kubernetes).
• Strong understanding of modern web service architectures, distributed systems, and microservices.
• Passion for automation, observability, and performance tuning to improve reliability and scalability.
• Experience with infrastructure as code (IaC) tools such as Terraform, CloudFormation, or Helm.
• Expertise in monitoring and alerting with tools like Prometheus, Grafana, Datadog, or New Relic.
• Strong leadership in cross-functional collaboration, decision-making, and system design.

• 7+년 이상의 소프트웨어 개발, DevOps, 또는 사이트 신뢰성 엔지니어링 경험.
• Python, Go, Java 등 현대적 프로그래밍 언어 중 하나에 대한 전문성.
• AWS, GCP, Azure 등의 클라우드 환경에서의 개발 및 운영 경험.
• Kubernetes, Docker 등 컨테이너 오케스트레이션 및 클라우드 네이티브 기술 경험.
• 마이크로서비스 및 분산 시스템 아키텍처에 대한 깊은 이해.
• 자동화, 성능 최적화 및 장애 대응을 통한 서비스 신뢰성 향상 경험.
• Terraform, CloudFormation, Helm 등의 Infrastructure as Code (IaC) 툴 경험.
• Prometheus, Grafana, Datadog, New Relic 등의 모니터링 및 로깅 시스템 경험.
• 다양한 직무의 동료들과 협업하여 신뢰성 및 가용성을 향상시킨 경험.

Site Reliability Engineer (글로벌SRE)

포지션 상세

주요업무

자격요건

기술 스택 • 툴

태그

마감일

근무지역