To Apply for this Job Click Here
Site Reliability Engineering Architect (Remote EST)
Location: Charlotte, NC (Remote, EST preferred)
Type: 12 month contract (with potential to extend)
Position Overview
The Site Reliability Engineering (SRE) Architect is a senior technical leader responsible for designing and evolving automation first AI augmented reliability platforms for large scale cloud and hybrid environments. This role defines how systems detect, decide, and act with minimal human intervention, setting the technical direction, standards, and guardrails that reduce toil while improving resilience and delivery velocity. An automation first mindset is required; observability alone is not sufficient—signals must drive automated or AI assisted action.
Core Responsibilities
Reliability Architecture & Operational Design
- Define reference architectures that prioritize automated and AI assisted fault isolation, graceful degradation, and recovery.
- Embed reliability, security, and governance into operational workflows and platforms—reducing complexity, human dependency, and operational risk.
- Establish standards so that every operational signal has a defined automated or AI assisted response path (not just a dashboard alert).
Automation Platforms & Workflow Engineering
- Architect event driven automation spanning detection decisioning execution (e.g., health checks enrichment safe remediation).
- Replace ticket driven/manual runbooks with executable, testable automation and standardized patterns across incident response, change, and platform ops.
- Ensure automation is resilient, observable, and auditable, with clear approval paths for higher risk actions.
AI Driven & Agent Based Operations
- Design and own internal AI driven operational platforms that retrieve context, reason over signals, and invoke controlled actions across services.
- Define guardrails, approvals, observability, and auditability for AI initiated actions; integrate AI decisioning directly into workflows.
- Enable agent coordination and capability discovery for safe execution in production.
Observability, Signal Processing, & Decision Systems
- Evolve observability from dashboards and decision systems that feed automation.
- Build signal pipelines correlating metrics, logs, traces, and events to reduce noise and alert fatigue and to trigger context aware remediation.
- Leverage existing tools (e.g., Dynatrace—DQL/APIs/AIOps/extensions, Zabbix, PagerDuty, Alertbot) to produce actionable context rather than standalone alerts. (Tools from original environment)
Cloud & Platform Monitoring Enablement
- Support and extend AWS monitoring (e.g., CloudWatch, ECS) with automation hooks and AI assisted triage.
- Align low code enterprise automation (e.g., Power Platform) with code first systems, preventing platform sprawl while accelerating safe, governed workflows.
Leadership & Technical Influence
- Serve as architectural authority for reliability, automation, and AI driven operations; mentor senior engineers and uplift organizational maturity.
- Partner with application, middleware, infrastructure, security, and compliance teams to deliver scalable, safety critical operational platforms.
- Challenge designs that increase operational risk, toil, or manual dependency; champion automation first solutions.
Required Qualifications
- 5+ years in SRE, Platform Engineering, DevOps, or Infrastructure Engineering supporting complex distributed systems.
- Proven experience designing and operating automation heavy platforms (event driven workflows, orchestration, policy/guardrails).
- Strong programming & automation skills (e.g., Python) and workflow orchestration/event driven systems experience.
- Practical experience integrating AI or intelligent decision systems into production operations (3–5 years AI/ML preferred).
- Deep understanding of failure modes, blast radius management, and risk aware automation.
Important: Candidates with observability only backgrounds—without deep, hands on automation/workflow engineering—will not be a fit.
Preferred Qualifications
- Experience designing or implementing agent based or AI assisted operational systems; familiarity with modern AI platforms and model integration for ops use cases.
- Experience with control plane architectures for automation and intelligent systems; enterprise automation governance.
- Knowledge of cost aware reliability (FinOps) and zero trust principles; relevant cloud/platform certifications.
- AWS ML experience is a strong plus.
Tooling Landscape (as applicable)
- Observability & APM: Dynatrace (DQL, APIs, AIOps, extensions), Zabbix, Alertbot, Foglight; on call & incident: PagerDuty.
- Cloud: AWS (CloudWatch, ECS).
- Scripting & Automation: Shell, PowerShell, YAML, Python; Power Platform for governed low code automation.
Success Metrics
- Reduction in manual toil and human intervention across operations.
- Increased adoption of automated and AI assisted remediation.
- Faster detection, triage, and resolution of incidents.
- Improved SLO attainment and change success rates.
- Scalable operational platforms that support growth without proportional increases in operational effort/headcount.
To Apply for this Job Click Here
Equal Employment Opportunity Statement
Gravity IT Resources is an Equal Opportunity Employer. We are committed to creating an inclusive environment for all employees and applicants. We do not discriminate on the basis of race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, age, disability, genetic information, veteran status, or any other legally protected characteristic. All employment decisions are based on qualifications, merit, and business needs.
Share This Job
Share This Job
Refer A Candidate
Recommend a candidate and receive a referral bonus as a thank-you for helping us find top talent.
Upload Your Resume
Share your resume, and we’ll match you with opportunities that fit your skills and goals.