Lead Observability Engineer

Gravity IT Resources
Apply Now
Observability also helps teams improve their understanding of how customers use Sherwin digital products. Product teams use that awareness to influence future development. The Lead Observability Engineer contributes to the overall strategic vision of the organization for Observability capabilities, processes, patterns, and tooling. This role will lead efforts working closely with the Product and Infrastructure teams to ensure that all aspects of the telemetry from applications, business events, appliances and infrastructure are accurately received, tagged, and reported. The role involves leading efforts to maintain observability platform, and ensure it is optimized and operating within SLA’s and SLO’s. Acting as SME in Observability practices for the enterprise and providing services/solutions across the enterprise that enables businesses to achieve and sustain a higher SLA by improving quality of software, reducing problem determination/down time and over all enhancing the end user experience.
Essential Functions
Strategy & Planning
- Socializing the Observability capabilities, processes, and Technology with the various application groups.
- Working with various product and business groups to help determine SLIs, SLOs and SLAs for products, applications, and services offered to the customer. Establishing strategies, processes, and tooling to adhere to the SLAs.
- Lead efforts to provide self-service capabilities to analyze and visualize Observability data providing End to End visibility to Products and application performance (this will include Dashboards, Alerting, automated incident response capabilities etc,.).
- Providing strategic roadmap for Observability maturity at Sherwin including recommendations on tooling, capabilities to support the ever-growing enterprise needs and new products.
- Create, support, and sustain methods and procedures to measure outcomes of Observability practices.
- Provide ability for developers to use tools to identify symptoms and diagnose application issues by providing them requisite access levels and training
- Develop and document Observability standards, procedures, and best practices for using the tool, provide education in the tools use.
- Clearly communicate to IT and business stakeholders regarding performance-related recommendations and tradeoffs.
- Partner with QA team, assisting with creating and refining effective performance test objectives, test plans, and scenarios that help the organization achieve quality requirements for applications.
Acquisition & Deployment
- Work with business to provide guidance for developing KPI’s in support of strategize business initiatives.
- Establish measurements for KPI’s and related business transactions of interest and develop executive dashboards required to observe application, user behavior, and user-interaction for business-critical functions.
- Work with development and architecture teams to manage Observability data collection, analysis, and visualization for critical applications through the lifecycle of the application.
- Working on continuous improvements of Observability capabilities, providing technical guidance to development teams and aid in triaging production problems
- Independently utilizes Observability tools to detect, isolate, and resolve issues effecting positive user experience and user interaction with the applications.
- Assist in major application and/or security incident troubleshooting.
- Contribute to aspects of the solution delivery lifecycle in prototyping, capacity modeling, performance driven design, profiling, performance testing, availability management, and troubleshooting.
- Guide operations and support team on building and refining application behavior data capture and reporting for Production systems, and corresponding processes.
Operational Management
- Provide and design cross-team training opportunities.
- Improve knowledge and skills in Enterprise DevOps team to become more competent and able to accept greater responsibilities.
- Install and configure software products. Ensure compatibility between target product, operating system, and other resident software. Apply maintenance according to best practices.
- Lead in capacity planning and performance management activities.
- Contribute to the development of service level goals and objectives.
- Develop and prepare metrics that measure services rendered.
- Identify opportunities to improve service levels and/or minimize support efforts.
- Perform standard configuration, management, and maintenance tasks in support of web resources.
- Mentor and/or provide guidance to all members of the team.
Incidental Functions
- Participate in disaster planning/mitigation/recovery.
- Conduct Product Proof-of- Concepts.
- Assist with other projects as may be required to contribute to the efficiency and effectiveness of the group and other business/technical entities.
- Assist and participate with Change Management preparations and implementations, providing technical subject matter expertise.
- Attend, and periodically lead meetings in participation with the team.
- Participate in hiring activities and fulfilling affirmative action obligations and ensuring compliance with the equal employment opportunity policy.
- Provide periodic 24/7 on-call support of specific functions.
- Minimal travel as required
- Work outside the standard office 7.5 hour workday may be required.
Position Requirements
Formal Education & Certification
- Bachelor’s degree (or foreign equivalent) in a Computer Science, Computer Engineering, or Information Technology field of study (e.g., Information Technology, Electronics and Instrumentation Engineering, Computer Systems Management, Mathematics) or equivalent experience.
Knowledge & Experience
- 8+ years IT experience.
- 5+ years hands on development experience with object-oriented programming (such as Java, C+, etc.)
- 3+ years of experience in Observability or Application monitoring and Log aggregation.
- Skilled in architecting, installing, configuring, and using Monitoring and Log aggregation tools.
- Demonstrated knowledge and experience in implementing Open Telemetry framework.
- A solid understanding of many different types of application infrastructures both in the UNIX and Windows environment.
- Experience working with multiple monitoring tools.
- Experience setting up alerting thresholds for application performance settings.
- Experience with various software development methodologies such as waterfall, agile, scrum, Kanban … etc.
- Experience working with development groups and application architects.
- Proven track record operating in multiple stake holder environment and successfully handle delivery across multiple locations