Outside IR35 Role - London, Hybrid
Senior Platform Engineer - Agentic AI,
Infrastructure as Code building and operating an iternal Agentic AI Platform.
Responsible for building new Workflow automation platform capabilities, keeping the platform operationally healthy, and maintaining the infrastructure-as-code and documentation that underpins it.
Build
Implement new platform capabilities from architectural designs, translating security, governance, and infrastructure requirements into production-grade infrastructure-as-code
Design and build the platform security and secrets management layer, ensuring all workloads operate with least-privilege credentials and certificates issued through a governed PKI hierarchyImplement and enforce security policy across the cluster using admission control, covering workload configuration, image standards, network traffic, and resource constraints
Build and establish the platform observability stack, providing consistent log aggregation, metrics, distributed tracing, and alerting across all platform componentsDesign and implement GitOps delivery automation, ensuring all platform changes flow through version-controlled, auditable pipelines with drift reconciliation
Operate
Own the day-to-day operational health of the platform: monitor for issues, respond to incidents, conduct root-cause analysis, and implement lasting remediationMaintain the health of platform data services -- database cluster, job queue, and object storage -- including backup schedules, failover testing, and capacity managementMonitor and tune autoscaling and resource configuration as workload patterns evolve, ensuring the platform scales responsively without over-provisioningManage secrets rotation, certificate lifecycle, policy drift detection, and identity configuration as ongoing operational responsibilitiesParticipate in planned high-stakes operational procedures -- such as secrets infrastructure initialisation and rotation events -- applying disciplined, documented execution
Experience required:
Kubernetes at depth -- production cluster operation (RKE2, EKS, GKE, or equivalent); Helm, RBAC design, multi-namespace workload management
Secrets management at scale -- production deployment of a secrets management platform (HashiCorp Vault or equivalent), covering PKI, dynamic credentials, and workload secrets injection
Policy-as-code -- admission control policy authoring and enforcement in production Kubernetes environments (OPA/Rego, Kyverno, or equivalent)GitOps -- Fleet, ArgoCD, Flux, or equivalent at production scale; declarative drift reconciliation, rollback strategy, multi-environment targeting
API gateway engineering -- production deployment and operation of an API or AI gateway (Kong, Envoy, or equivalent); rate limiting, plugin/policy authoring, route managementLinux platform engineering -- networking fundamentals, TLS and PKI, CSI storage operations, container runtime