Platform Engineering: Building Internal Developer Platforms That Actually Get Adopted
A practical guide to platform engineering and building internal developer platforms. Covers golden paths, self-service infrastructure, Backstage, Crossplane, and measuring platform success.
Platform Engineering: Building Internal Developer Platforms That Actually Get Adopted
Software engineering teams are drowning in infrastructure complexity. A developer who just wants to ship a feature now needs to understand Kubernetes manifests, Terraform modules, CI/CD pipelines, service mesh configurations, secret management, observability stacks, and security policies before they can get anything into production. The cognitive load is unsustainable. Teams that once deployed in hours now spend days navigating tooling decisions and configuration sprawl.
Platform engineering addresses this by building an Internal Developer Platform (IDP) that abstracts away infrastructure complexity while maintaining the guardrails organizations need. The platform team builds the paved roads, and product developers drive on them. Done well, an IDP dramatically reduces time-to-production, eliminates entire classes of misconfiguration, and lets developers focus on business logic instead of YAML. Gartner predicts that by 2026, 80% of software engineering organizations will have established platform teams. The shift is well underway.
This guide covers the practical work of building an IDP that developers actually want to use - from golden paths and service catalogs to self-service infrastructure, GitOps delivery, and measuring whether your platform is succeeding.
What Platform Engineering Solves
The core problem is cognitive load. A study by Team Topologies found that the number of tools, technologies, and responsibilities a typical development team manages has increased by over 300% in the last decade. Developers are expected to be experts in application code, cloud infrastructure, container orchestration, networking, security, and observability simultaneously. This is not a skills problem - it is a structural problem.
The symptoms are easy to spot:
- Developers copy-paste Kubernetes manifests from other teams and hope they work
- Every new service takes weeks to get into production because of "infrastructure setup"
- Security and compliance reviews are bottlenecks because configurations vary wildly across teams
- The same infrastructure bugs get rediscovered by different teams independently
- Senior engineers spend most of their time answering infrastructure questions instead of building
Platform engineering solves this by introducing a dedicated team that treats infrastructure as a product. The platform team's customers are the product development teams. Like any product team, they conduct user research, prioritize features, iterate on feedback, and measure adoption.
What a platform team provides:
- Standardized environments that developers provision in minutes, not days
- Golden path templates that encode best practices for common service patterns
- Self-service infrastructure that eliminates ticket-driven provisioning
- Built-in security and compliance so developers get secure-by-default configurations
- Unified observability so every service ships with logging, metrics, and tracing from day one
The critical distinction between platform engineering and traditional infrastructure/DevOps teams is the product mindset. A traditional ops team responds to tickets. A platform team builds products that eliminate the need for tickets.
Golden Paths vs Guardrails
Golden paths and guardrails are complementary concepts that platform teams must implement together.
Golden paths are the recommended, paved way to accomplish common tasks. They are opinionated by design. A golden path for deploying a new microservice might include a specific language runtime, a standard project structure, pre-configured CI/CD, Kubernetes manifests, and observability instrumentation. Developers can choose to leave the golden path, but doing so requires more effort and responsibility.
Guardrails are the boundaries that all services must stay within regardless of whether they follow the golden path. Guardrails include security policies (no containers running as root), compliance requirements (all data encrypted at rest), and operational standards (every service must expose health checks).
Golden Path Templates
The most effective golden paths start with project scaffolding. When a developer creates a new service, the template generates everything they need to go from zero to production.
Here is a golden path template using Cookiecutter for a Node.js microservice:
golden-path-node-service/
cookiecutter.json
{{cookiecutter.service_name}}/
src/
index.ts
routes/
health.ts
middleware/
auth.ts
logging.ts
k8s/
base/
deployment.yaml
service.yaml
hpa.yaml
kustomization.yaml
overlays/
staging/
kustomization.yaml
production/
kustomization.yaml
.github/
workflows/
ci.yaml
deploy.yaml
Dockerfile
package.json
tsconfig.json
.eslintrc.json
catalog-info.yaml
The cookiecutter.json defines the parameters developers provide:
{
"service_name": "my-service",
"description": "A short description of the service",
"team_name": "backend",
"owner_email": "team-backend@company.com",
"port": "3000",
"needs_database": ["none", "postgresql", "redis", "both"],
"needs_queue": ["none", "rabbitmq", "kafka"],
"deployment_environments": ["staging-only", "staging-and-production"]
}The generated Kubernetes deployment encodes your organization's standards:
# {{cookiecutter.service_name}}/k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ cookiecutter.service_name }}
labels:
app.kubernetes.io/name: {{ cookiecutter.service_name }}
app.kubernetes.io/managed-by: platform-team
team: {{ cookiecutter.team_name }}
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: {{ cookiecutter.service_name }}
template:
metadata:
labels:
app.kubernetes.io/name: {{ cookiecutter.service_name }}
team: {{ cookiecutter.team_name }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "{{ cookiecutter.port }}"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: {{ cookiecutter.service_name }}
securityContext:
runAsNonRoot: true
fsGroup: 1000
containers:
- name: {{ cookiecutter.service_name }}
image: registry.company.com/{{ cookiecutter.team_name }}/{{ cookiecutter.service_name }}:latest
ports:
- containerPort: {{ cookiecutter.port }}
protocol: TCP
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health/live
port: {{ cookiecutter.port }}
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /health/ready
port: {{ cookiecutter.port }}
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: SERVICE_NAME
value: {{ cookiecutter.service_name }}
- name: LOG_LEVEL
value: "info"
- name: NODE_ENV
valueFrom:
fieldRef:
fieldPath: metadata.namespaceEvery golden path template should also generate a catalog-info.yaml for Backstage registration:
# {{cookiecutter.service_name}}/catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: {{ cookiecutter.service_name }}
description: {{ cookiecutter.description }}
annotations:
github.com/project-slug: company/{{ cookiecutter.service_name }}
backstage.io/techdocs-ref: dir:.
argocd/app-name: {{ cookiecutter.service_name }}
grafana/dashboard-selector: service={{ cookiecutter.service_name }}
tags:
- nodejs
- typescript
links:
- url: https://grafana.company.com/d/{{ cookiecutter.service_name }}
title: Grafana Dashboard
icon: dashboard
spec:
type: service
lifecycle: production
owner: team-{{ cookiecutter.team_name }}
system: {{ cookiecutter.team_name }}-platform
providesApis:
- {{ cookiecutter.service_name }}-apiGuardrail Implementation
Guardrails are enforced through policy engines. Here is a Kyverno policy that enforces key standards across all deployments:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: platform-guardrails
annotations:
policies.kyverno.io/title: Platform Engineering Guardrails
policies.kyverno.io/description: >-
Enforces minimum standards for all workloads deployed
to the cluster.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-labels
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
validate:
message: "All workloads must have team and app.kubernetes.io/name labels."
pattern:
metadata:
labels:
team: "?*"
app.kubernetes.io/name: "?*"
- name: require-resource-limits
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
validate:
message: "All containers must specify CPU and memory resource limits."
pattern:
spec:
template:
spec:
containers:
- resources:
limits:
cpu: "?*"
memory: "?*"
requests:
cpu: "?*"
memory: "?*"
- name: restrict-privileged
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Privileged containers are not allowed."
pattern:
spec:
containers:
- securityContext:
privileged: falseThe Platform Engineering Stack
A mature IDP is composed of several interconnected layers. Each layer addresses a specific concern, and together they provide a cohesive developer experience.
The reference architecture:
Developer Experience Layer
- Backstage (Service Catalog + Software Templates + TechDocs)
- Developer Portal (custom UI for self-service)
- CLI tools (scaffolding, local development)
Infrastructure Abstraction Layer
- Crossplane (declarative infrastructure API)
- Terraform (infrastructure provisioning)
- Helm/Kustomize (application packaging)
Delivery Layer
- ArgoCD (GitOps continuous delivery)
- GitHub Actions (CI pipelines)
- Container Registry (image storage and scanning)
Observability Layer
- Grafana (dashboards and alerting)
- Prometheus (metrics collection)
- Loki (log aggregation)
- Tempo (distributed tracing)
Security Layer
- OPA/Kyverno (policy enforcement)
- Vault (secret management)
- Sigstore (supply chain security)
- Trivy (vulnerability scanning)
The key integration points are:
- Backstage templates call the infrastructure abstraction layer to provision resources
- GitOps picks up the generated manifests and deploys them
- Observability is pre-wired into every golden path template
- Security policies are enforced at multiple layers (admission control, CI pipeline, runtime)
Building with Backstage
Backstage is the CNCF project that serves as the foundation for most IDPs. It provides a service catalog, software templates, TechDocs, and a plugin ecosystem. Think of it as the storefront for your platform.
Setting Up Backstage
Bootstrap a Backstage instance:
# Install the Backstage CLI
npx @backstage/create-app@latest
# Follow the prompts
# App name: company-developer-portal
# Select database: PostgreSQL (for production)
cd company-developer-portal
# Start in development mode
yarn devConfigure the app-config.yaml for your organization:
# app-config.yaml
app:
title: Company Developer Portal
baseUrl: http://localhost:3000
organization:
name: Company
backend:
baseUrl: http://localhost:7007
database:
client: pg
connection:
host: ${POSTGRES_HOST}
port: ${POSTGRES_PORT}
user: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
integrations:
github:
- host: github.com
token: ${GITHUB_TOKEN}
catalog:
import:
entityFilename: catalog-info.yaml
pullRequestBranchName: backstage-integration
rules:
- allow: [Component, System, API, Resource, Location, Template]
locations:
- type: url
target: https://github.com/company/backstage-catalog/blob/main/catalog-info.yaml
- type: url
target: https://github.com/company/software-templates/blob/main/all-templates.yaml
proxy:
endpoints:
/argocd/api:
target: https://argocd.company.com/api/v1
headers:
Cookie:
$env: ARGOCD_AUTH_TOKEN
/grafana/api:
target: https://grafana.company.com
headers:
Authorization: Bearer ${GRAFANA_TOKEN}Creating Software Templates
Software templates are the heart of Backstage's self-service capability. They define a multi-step wizard that collects parameters from the developer and then executes actions to create repositories, register services, and provision infrastructure.
# templates/node-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: node-microservice
title: Node.js Microservice
description: >
Creates a production-ready Node.js microservice with TypeScript,
Express, Kubernetes manifests, CI/CD pipelines, and observability
pre-configured.
tags:
- nodejs
- typescript
- recommended
spec:
owner: team-platform
type: service
parameters:
- title: Service Details
required:
- name
- description
- owner
properties:
name:
title: Service Name
type: string
description: Unique name for the service (lowercase, hyphens only)
pattern: "^[a-z][a-z0-9-]*$"
ui:autofocus: true
description:
title: Description
type: string
description: A brief description of what this service does
owner:
title: Owner Team
type: string
description: The team that owns this service
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
- title: Infrastructure Options
properties:
database:
title: Database
type: string
description: Select a database if your service needs one
default: none
enum:
- none
- postgresql
- redis
- mongodb
enumNames:
- None
- PostgreSQL
- Redis
- MongoDB
needsQueue:
title: Message Queue
type: boolean
default: false
description: Does this service need a message queue?
- title: Deployment Configuration
properties:
environments:
title: Deployment Environments
type: array
items:
type: string
enum:
- staging
- production
uniqueItems: true
default:
- staging
replicaCount:
title: Production Replica Count
type: integer
default: 3
minimum: 2
maximum: 10
steps:
- id: fetch-template
name: Fetch Service Template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
description: ${{ parameters.description }}
owner: ${{ parameters.owner }}
database: ${{ parameters.database }}
needsQueue: ${{ parameters.needsQueue }}
environments: ${{ parameters.environments }}
replicaCount: ${{ parameters.replicaCount }}
- id: publish-repo
name: Create GitHub Repository
action: publish:github
input:
allowedHosts: ["github.com"]
repoUrl: github.com?owner=company&repo=${{ parameters.name }}
description: ${{ parameters.description }}
defaultBranch: main
protectDefaultBranch: true
requiredApprovingReviewCount: 1
topics:
- microservice
- nodejs
- platform-managed
- id: create-argocd-app
name: Register with ArgoCD
action: argocd:create-resources
input:
appName: ${{ parameters.name }}
argoInstance: main
namespace: ${{ parameters.owner }}
repoUrl: https://github.com/company/${{ parameters.name }}.git
path: k8s/overlays/staging
- id: provision-database
name: Provision Database
if: ${{ parameters.database !== 'none' }}
action: http:backstage:request
input:
method: POST
path: /api/proxy/crossplane/compositions
headers:
Content-Type: application/json
body:
apiVersion: database.platform.company.com/v1alpha1
kind: DatabaseClaim
metadata:
name: ${{ parameters.name }}-db
namespace: ${{ parameters.owner }}
spec:
engine: ${{ parameters.database }}
size: small
- id: register-catalog
name: Register in Backstage Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps['publish-repo'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps['publish-repo'].output.remoteUrl }}
- title: Open in Backstage
icon: catalog
entityRef: ${{ steps['register-catalog'].output.entityRef }}Building a Custom Backstage Plugin
When you need functionality beyond what existing plugins provide, Backstage's plugin architecture makes it straightforward to build your own. Here is a plugin that shows infrastructure cost for each service:
# Generate the plugin scaffold
cd company-developer-portal
yarn new --select plugin
# Enter plugin ID: cost-insights-custom// plugins/cost-insights-custom/src/components/ServiceCostCard.tsx
import React, { useEffect, useState } from 'react';
import {
InfoCard,
Progress,
ResponseErrorPanel,
} from '@backstage/core-components';
import { useEntity } from '@backstage/plugin-catalog-react';
import { useApi, configApiRef } from '@backstage/core-plugin-api';
import {
Table,
TableBody,
TableCell,
TableHead,
TableRow,
Typography,
Chip,
} from '@material-ui/core';
interface CostBreakdown {
resource: string;
type: string;
monthlyCost: number;
trend: 'up' | 'down' | 'stable';
}
interface ServiceCost {
serviceName: string;
totalMonthlyCost: number;
previousMonthlyCost: number;
breakdown: CostBreakdown[];
lastUpdated: string;
}
export const ServiceCostCard = () => {
const { entity } = useEntity();
const config = useApi(configApiRef);
const [cost, setCost] = useState<ServiceCost | null>(null);
const [loading, setLoading] = useState(true);
const [error, setError] = useState<Error | null>(null);
const serviceName = entity.metadata.name;
const backendUrl = config.getString('backend.baseUrl');
useEffect(() => {
const fetchCost = async () => {
try {
const response = await fetch(
`${backendUrl}/api/proxy/cost-api/services/${serviceName}/cost`,
);
if (!response.ok) {
throw new Error(`Failed to fetch cost data: ${response.statusText}`);
}
const data: ServiceCost = await response.json();
setCost(data);
} catch (err) {
setError(err as Error);
} finally {
setLoading(false);
}
};
fetchCost();
}, [serviceName, backendUrl]);
if (loading) return <Progress />;
if (error) return <ResponseErrorPanel error={error} />;
if (!cost) return <Typography>No cost data available</Typography>;
const percentChange =
((cost.totalMonthlyCost - cost.previousMonthlyCost) /
cost.previousMonthlyCost) *
100;
return (
<InfoCard title="Infrastructure Cost" subheader={`Updated: ${cost.lastUpdated}`}>
<Typography variant="h4">
${cost.totalMonthlyCost.toFixed(2)}/month
</Typography>
<Chip
label={`${percentChange > 0 ? '+' : ''}${percentChange.toFixed(1)}% vs last month`}
color={percentChange > 10 ? 'secondary' : 'default'}
size="small"
style={{ marginBottom: 16 }}
/>
<Table size="small">
<TableHead>
<TableRow>
<TableCell>Resource</TableCell>
<TableCell>Type</TableCell>
<TableCell align="right">Monthly Cost</TableCell>
</TableRow>
</TableHead>
<TableBody>
{cost.breakdown.map((item) => (
<TableRow key={item.resource}>
<TableCell>{item.resource}</TableCell>
<TableCell>{item.type}</TableCell>
<TableCell align="right">${item.monthlyCost.toFixed(2)}</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</InfoCard>
);
};Register the plugin on the entity page:
// packages/app/src/components/catalog/EntityPage.tsx
import { ServiceCostCard } from '@internal/plugin-cost-insights-custom';
// Add to the service entity page
const serviceEntityPage = (
<EntityLayout>
<EntityLayout.Route path="/" title="Overview">
<Grid container spacing={3}>
{/* existing cards */}
<Grid item md={6}>
<ServiceCostCard />
</Grid>
</Grid>
</EntityLayout.Route>
</EntityLayout>
);Self-Service Infrastructure with Crossplane
Crossplane extends Kubernetes with the ability to provision and manage cloud infrastructure using the same declarative YAML that developers already know. Instead of writing Terraform and waiting for an ops ticket, developers submit a Kubernetes resource and Crossplane provisions the cloud resources.
Crossplane Architecture
Crossplane introduces three key concepts:
- Providers connect Crossplane to cloud APIs (AWS, GCP, Azure, etc.)
- Managed Resources are individual cloud resources (an RDS instance, an S3 bucket)
- Compositions combine multiple managed resources into higher-level abstractions
- Claims (XRCs) are the developer-facing API for requesting composed resources
Installing Crossplane and Providers
# Install Crossplane into your Kubernetes cluster
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
helm install crossplane \
crossplane-stable/crossplane \
--namespace crossplane-system \
--create-namespace \
--set args='{"--enable-composition-revisions"}'
# Install the AWS provider
kubectl apply -f - <<EOF
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: provider-aws
spec:
package: xpkg.upbound.io/upbound/provider-family-aws:v1.7.0
EOF
# Configure AWS credentials
kubectl create secret generic aws-creds \
-n crossplane-system \
--from-file=creds=./aws-credentials.txt
kubectl apply -f - <<EOF
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
name: default
spec:
credentials:
source: Secret
secretRef:
namespace: crossplane-system
name: aws-creds
key: creds
EOFProvisioning a PostgreSQL Database
First, define a Composite Resource Definition (XRD) and Composition that abstracts away the details:
# platform/database/definition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: xdatabases.platform.company.com
spec:
group: platform.company.com
names:
kind: XDatabase
plural: xdatabases
claimNames:
kind: DatabaseClaim
plural: databaseclaims
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["postgresql", "mysql"]
description: Database engine type
size:
type: string
enum: ["small", "medium", "large"]
description: >
small = db.t3.medium (2 vCPU, 4 GB).
medium = db.r6g.large (2 vCPU, 16 GB).
large = db.r6g.xlarge (4 vCPU, 32 GB).
version:
type: string
default: "15"
required:
- engine
- size
status:
type: object
properties:
endpoint:
type: string
port:
type: integer
secretName:
type: string# platform/database/composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: database-aws
labels:
provider: aws
crossplane.io/xrd: xdatabases.platform.company.com
spec:
compositeTypeRef:
apiVersion: platform.company.com/v1alpha1
kind: XDatabase
resources:
- name: subnet-group
base:
apiVersion: rds.aws.upbound.io/v1beta1
kind: SubnetGroup
spec:
forProvider:
region: us-east-1
description: "Platform-managed database subnet group"
subnetIds:
- subnet-0abc123def456
- subnet-0def789abc012
patches:
- fromFieldPath: metadata.name
toFieldPath: metadata.name
transforms:
- type: string
string:
fmt: "%s-subnet-group"
- name: security-group
base:
apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroup
spec:
forProvider:
region: us-east-1
vpcId: vpc-0abc123def456
description: "Platform-managed database security group"
patches:
- fromFieldPath: metadata.name
toFieldPath: metadata.name
transforms:
- type: string
string:
fmt: "%s-sg"
- name: security-group-rule
base:
apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroupRule
spec:
forProvider:
region: us-east-1
type: ingress
fromPort: 5432
toPort: 5432
protocol: tcp
cidrBlocks:
- "10.0.0.0/8"
patches:
- fromFieldPath: metadata.name
toFieldPath: spec.forProvider.securityGroupIdSelector.matchLabels[db-name]
- name: rds-instance
base:
apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
spec:
forProvider:
region: us-east-1
allocatedStorage: 20
autoMinorVersionUpgrade: true
backupRetentionPeriod: 7
deletionProtection: true
multiAz: true
publiclyAccessible: false
storageEncrypted: true
storageType: gp3
skipFinalSnapshot: false
autoGeneratePassword: true
masterUsername: admin
masterUserPasswordSecretRef:
namespace: crossplane-system
key: password
writeConnectionSecretToRef:
namespace: crossplane-system
patches:
- fromFieldPath: spec.engine
toFieldPath: spec.forProvider.engine
- fromFieldPath: spec.version
toFieldPath: spec.forProvider.engineVersion
- fromFieldPath: spec.size
toFieldPath: spec.forProvider.instanceClass
transforms:
- type: map
map:
small: db.t3.medium
medium: db.r6g.large
large: db.r6g.xlarge
- type: ToCompositeFieldPath
fromFieldPath: status.atProvider.endpoint
toFieldPath: status.endpoint
- type: ToCompositeFieldPath
fromFieldPath: status.atProvider.port
toFieldPath: status.port
connectionDetails:
- name: endpoint
fromFieldPath: status.atProvider.endpoint
- name: port
fromFieldPath: status.atProvider.port
type: FromFieldPath
- name: username
fromFieldPath: spec.forProvider.masterUsername
type: FromFieldPath
- name: password
fromConnectionSecretKey: attribute.passwordNow a developer can provision a database with a simple claim:
# developer submits this
apiVersion: platform.company.com/v1alpha1
kind: DatabaseClaim
metadata:
name: orders-db
namespace: orders-team
spec:
engine: postgresql
size: small
version: "15"Provisioning an S3 Bucket
# platform/storage/composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: bucket-aws
labels:
provider: aws
spec:
compositeTypeRef:
apiVersion: platform.company.com/v1alpha1
kind: XBucket
resources:
- name: s3-bucket
base:
apiVersion: s3.aws.upbound.io/v1beta2
kind: Bucket
spec:
forProvider:
region: us-east-1
patches:
- fromFieldPath: metadata.name
toFieldPath: metadata.name
- name: bucket-versioning
base:
apiVersion: s3.aws.upbound.io/v1beta1
kind: BucketVersioning
spec:
forProvider:
region: us-east-1
versioningConfiguration:
- status: Enabled
patches:
- fromFieldPath: metadata.name
toFieldPath: spec.forProvider.bucketSelector.matchLabels[bucket-name]
- name: bucket-encryption
base:
apiVersion: s3.aws.upbound.io/v1beta1
kind: BucketServerSideEncryptionConfiguration
spec:
forProvider:
region: us-east-1
rule:
- applyServerSideEncryptionByDefault:
- sseAlgorithm: aws:kms
patches:
- fromFieldPath: metadata.name
toFieldPath: spec.forProvider.bucketSelector.matchLabels[bucket-name]
- name: bucket-public-access-block
base:
apiVersion: s3.aws.upbound.io/v1beta1
kind: BucketPublicAccessBlock
spec:
forProvider:
region: us-east-1
blockPublicAcls: true
blockPublicPolicy: true
ignorePublicAcls: true
restrictPublicBuckets: true
patches:
- fromFieldPath: metadata.name
toFieldPath: spec.forProvider.bucketSelector.matchLabels[bucket-name]Every bucket provisioned through the platform automatically gets versioning, encryption, and public access blocking. Developers cannot accidentally create a public, unencrypted bucket.
GitOps-Driven Deployments
ArgoCD serves as the delivery mechanism for the platform. When Backstage creates a new service or a developer pushes changes, ArgoCD picks up the updated manifests from Git and deploys them.
ArgoCD Application Definition
# argocd/applications/orders-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: orders-service
namespace: argocd
labels:
team: orders
managed-by: platform
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: platform-deploys
notifications.argoproj.io/subscribe.on-sync-failed.slack: platform-alerts
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: orders-team
source:
repoURL: https://github.com/company/orders-service.git
targetRevision: main
path: k8s/overlays/staging
destination:
server: https://kubernetes.default.svc
namespace: orders
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
- ApplyOutOfSyncOnly=true
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m0s
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicasApplicationSets for Multi-Environment
Instead of maintaining individual Application manifests per service per environment, use ApplicationSets to generate them dynamically:
# argocd/applicationsets/all-services.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-services
namespace: argocd
spec:
goTemplate: true
goTemplateOptions: ["missingkey=error"]
generators:
- matrix:
generators:
- git:
repoURL: https://github.com/company/gitops-config.git
revision: main
files:
- path: "services/*/config.json"
- list:
elements:
- environment: staging
cluster: https://staging-cluster.company.com
autoSync: true
- environment: production
cluster: https://production-cluster.company.com
autoSync: false
template:
metadata:
name: "{{ .name }}-{{ .environment }}"
namespace: argocd
labels:
team: "{{ .team }}"
environment: "{{ .environment }}"
managed-by: platform
spec:
project: "{{ .team }}-project"
source:
repoURL: "https://github.com/company/{{ .name }}.git"
targetRevision: "{{ if eq .environment \"production\" }}release{{ else }}main{{ end }}"
path: "k8s/overlays/{{ .environment }}"
destination:
server: "{{ .cluster }}"
namespace: "{{ .namespace }}"
syncPolicy:
automated:
prune: "{{ .autoSync }}"
selfHeal: "{{ .autoSync }}"
syncOptions:
- CreateNamespace=trueEach service provides a simple config file:
{
"name": "orders-service",
"team": "orders",
"namespace": "orders",
"tier": "critical"
}Automated Rollbacks
Configure ArgoCD to automatically roll back failed deployments using analysis runs with Argo Rollouts:
# k8s/base/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: orders-service
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 2m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: orders-service
- setWeight: 30
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: orders-service
- setWeight: 60
- pause: { duration: 5m }
- setWeight: 100
rollbackWindow:
revisions: 2
selector:
matchLabels:
app: orders-service
template:
metadata:
labels:
app: orders-service
spec:
containers:
- name: orders-service
image: registry.company.com/orders/orders-service:v1.2.3
ports:
- containerPort: 3000
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(
http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[5m]
)) /
sum(rate(
http_requests_total{
service="{{args.service-name}}"
}[5m]
))Developer Self-Service Portal
Beyond Backstage's UI, a mature platform provides API-driven provisioning and ChatOps integration so developers can interact with the platform from wherever they work.
Platform API
A lightweight API that wraps Crossplane and ArgoCD, exposing simple operations to developers:
// platform-api/src/routes/services.ts
import { Router, Request, Response } from 'express';
import { KubernetesClient } from '../clients/kubernetes';
import { GitHubClient } from '../clients/github';
import { validateServiceRequest } from '../validators/service';
import { auditLog } from '../middleware/audit';
const router = Router();
const k8s = new KubernetesClient();
const github = new GitHubClient();
interface CreateServiceRequest {
name: string;
team: string;
template: 'node-service' | 'python-service' | 'go-service';
database?: 'postgresql' | 'redis' | 'mongodb';
environments: string[];
}
router.post(
'/services',
auditLog('service.create'),
async (req: Request, res: Response) => {
const body = req.body as CreateServiceRequest;
const validation = validateServiceRequest(body);
if (!validation.valid) {
return res.status(400).json({ errors: validation.errors });
}
try {
// Step 1: Create the repository from template
const repo = await github.createFromTemplate({
templateRepo: `golden-path-${body.template}`,
name: body.name,
owner: 'company',
description: `${body.name} - owned by ${body.team}`,
private: true,
});
// Step 2: Submit Crossplane claims for infrastructure
const resources: string[] = [];
if (body.database) {
await k8s.apply({
apiVersion: 'platform.company.com/v1alpha1',
kind: 'DatabaseClaim',
metadata: {
name: `${body.name}-db`,
namespace: body.team,
},
spec: {
engine: body.database,
size: 'small',
},
});
resources.push(`database:${body.database}`);
}
// Step 3: Create ArgoCD applications for each environment
for (const env of body.environments) {
await k8s.apply({
apiVersion: 'argoproj.io/v1alpha1',
kind: 'Application',
metadata: {
name: `${body.name}-${env}`,
namespace: 'argocd',
labels: {
team: body.team,
environment: env,
'managed-by': 'platform-api',
},
},
spec: {
project: `${body.team}-project`,
source: {
repoURL: repo.clone_url,
targetRevision: env === 'production' ? 'release' : 'main',
path: `k8s/overlays/${env}`,
},
destination: {
server: 'https://kubernetes.default.svc',
namespace: body.team,
},
syncPolicy: {
automated: env !== 'production' ? { prune: true, selfHeal: true } : undefined,
},
},
});
}
return res.status(201).json({
service: body.name,
repository: repo.html_url,
environments: body.environments,
resources,
status: 'provisioning',
estimatedReady: '5-10 minutes',
});
} catch (error) {
console.error('Service creation failed:', error);
return res.status(500).json({ error: 'Service creation failed' });
}
},
);
// Endpoint to check provisioning status
router.get('/services/:name/status', async (req: Request, res: Response) => {
const { name } = req.params;
try {
const argoApps = await k8s.listArgoApplications(name);
const claims = await k8s.listCrossplaneClaims(name);
const status = {
service: name,
deployments: argoApps.map((app: any) => ({
environment: app.metadata.labels.environment,
syncStatus: app.status?.sync?.status || 'Unknown',
healthStatus: app.status?.health?.status || 'Unknown',
})),
infrastructure: claims.map((claim: any) => ({
resource: claim.metadata.name,
kind: claim.kind,
ready: claim.status?.conditions?.find(
(c: any) => c.type === 'Ready',
)?.status === 'True',
})),
};
return res.json(status);
} catch (error) {
return res.status(500).json({ error: 'Failed to fetch status' });
}
});
export default router;Slack/ChatOps Integration
Let developers provision and manage services directly from Slack:
// platform-bot/src/handlers/slash-commands.ts
import { App, SlashCommand, AckFn } from '@slack/bolt';
import { PlatformAPIClient } from '../clients/platform-api';
const platformApi = new PlatformAPIClient();
export function registerCommands(app: App) {
// /platform create-service orders-service --team orders --template node-service --db postgresql
app.command('/platform', async ({ command, ack, respond }) => {
await ack();
const args = parseCommand(command.text);
switch (args.action) {
case 'create-service':
await handleCreateService(args, command, respond);
break;
case 'status':
await handleStatus(args, command, respond);
break;
case 'promote':
await handlePromote(args, command, respond);
break;
default:
await respond({
text: `Unknown action: ${args.action}. Available commands: create-service, status, promote`,
});
}
});
}
async function handleCreateService(
args: ParsedCommand,
command: SlashCommand,
respond: Function,
) {
const requiredArgs = ['name', 'team', 'template'];
const missing = requiredArgs.filter((arg) => !args.options[arg]);
if (missing.length > 0) {
await respond({
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `Missing required arguments: ${missing.join(', ')}\n\nUsage: \`/platform create-service --name my-service --team my-team --template node-service [--db postgresql]\``,
},
},
],
});
return;
}
await respond({
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `Creating service *${args.options.name}*...\nTemplate: ${args.options.template}\nTeam: ${args.options.team}\nDatabase: ${args.options.db || 'none'}`,
},
},
],
});
try {
const result = await platformApi.createService({
name: args.options.name,
team: args.options.team,
template: args.options.template,
database: args.options.db,
environments: ['staging'],
});
await respond({
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `Service *${result.service}* created successfully.\n\nRepository: ${result.repository}\nEnvironments: ${result.environments.join(', ')}\nResources: ${result.resources.join(', ') || 'none'}\n\nEstimated ready: ${result.estimatedReady}`,
},
},
{
type: 'actions',
elements: [
{
type: 'button',
text: { type: 'plain_text', text: 'View in Backstage' },
url: `https://backstage.company.com/catalog/default/component/${result.service}`,
},
{
type: 'button',
text: { type: 'plain_text', text: 'Check Status' },
action_id: `check_status_${result.service}`,
},
],
},
],
});
} catch (error) {
await respond({
text: `Failed to create service: ${(error as Error).message}`,
});
}
}
async function handlePromote(
args: ParsedCommand,
command: SlashCommand,
respond: Function,
) {
const serviceName = args.options.name;
// Production promotions require approval
await respond({
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*Production Promotion Request*\nService: ${serviceName}\nRequested by: <@${command.user_id}>\n\nThis requires approval from a team lead.`,
},
},
{
type: 'actions',
elements: [
{
type: 'button',
text: { type: 'plain_text', text: 'Approve' },
style: 'primary',
action_id: `approve_promote_${serviceName}`,
},
{
type: 'button',
text: { type: 'plain_text', text: 'Deny' },
style: 'danger',
action_id: `deny_promote_${serviceName}`,
},
],
},
],
});
}
interface ParsedCommand {
action: string;
options: Record<string, string>;
}
function parseCommand(text: string): ParsedCommand {
const parts = text.trim().split(/\s+/);
const action = parts[0];
const options: Record<string, string> = {};
for (let i = 1; i < parts.length; i++) {
if (parts[i].startsWith('--') && i + 1 < parts.length) {
const key = parts[i].replace('--', '');
options[key] = parts[i + 1];
i++;
}
}
// Also support positional name
if (!options.name && parts[1] && !parts[1].startsWith('--')) {
options.name = parts[1];
}
return { action, options };
}Security and Compliance as Platform Features
A well-built platform makes security the path of least resistance. Instead of security being a review gate that slows teams down, it is embedded into the platform itself.
Policy-as-Code with OPA Gatekeeper
# policies/require-security-context.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredsecuritycontext
spec:
crd:
spec:
names:
kind: K8sRequiredSecurityContext
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredsecuritycontext
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.securityContext.runAsNonRoot
msg := sprintf(
"Container '%v' must set securityContext.runAsNonRoot to true",
[container.name]
)
}
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.securityContext.readOnlyRootFilesystem
msg := sprintf(
"Container '%v' must set securityContext.readOnlyRootFilesystem to true",
[container.name]
)
}
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
container.securityContext.allowPrivilegeEscalation
msg := sprintf(
"Container '%v' must not allow privilege escalation",
[container.name]
)
}
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.securityContext.capabilities.drop
msg := sprintf(
"Container '%v' must drop all capabilities",
[container.name]
)
}
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredSecurityContext
metadata:
name: must-have-security-context
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet", "DaemonSet"]
excludedNamespaces:
- kube-system
- crossplane-system
- argocdAutomated Secret Management with Vault
Integrate HashiCorp Vault into the platform so developers never handle raw secrets:
# platform/vault/secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: platform-vault
spec:
provider:
vault:
server: https://vault.company.com
path: secret
version: v2
auth:
kubernetes:
mountPath: kubernetes
role: platform-external-secrets
serviceAccountRef:
name: external-secrets
namespace: external-secrets
---
# Developer-facing: request a secret from Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: orders-service-secrets
namespace: orders
spec:
refreshInterval: 1h
secretStoreRef:
name: platform-vault
kind: ClusterSecretStore
target:
name: orders-service-secrets
creationPolicy: Owner
data:
- secretKey: DATABASE_URL
remoteRef:
key: teams/orders/orders-service
property: database_url
- secretKey: API_KEY
remoteRef:
key: teams/orders/orders-service
property: api_keySupply Chain Security with Sigstore
Enforce that only signed and verified images run in the cluster:
# policies/require-signed-images.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
rules:
- name: verify-cosign-signature
match:
any:
- resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- "registry.company.com/*"
attestors:
- entries:
- keyless:
subject: "https://github.com/company/*"
issuer: "https://token.actions.githubusercontent.com"
rekor:
url: https://rekor.sigstore.dev
mutateDigest: true
verifyDigest: trueAdd signing to your CI pipeline:
# .github/workflows/build-sign.yaml
name: Build and Sign Image
on:
push:
branches: [main]
permissions:
contents: read
id-token: write
packages: write
jobs:
build-and-sign:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to registry
uses: docker/login-action@v3
with:
registry: registry.company.com
username: ${{ secrets.REGISTRY_USER }}
password: ${{ secrets.REGISTRY_PASSWORD }}
- name: Build and push
id: build
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: registry.company.com/${{ github.repository }}:${{ github.sha }}
sbom: true
provenance: true
- name: Install cosign
uses: sigstore/cosign-installer@v3
- name: Sign the image
run: |
cosign sign --yes \
registry.company.com/${{ github.repository }}@${{ steps.build.outputs.digest }}
env:
COSIGN_EXPERIMENTAL: 1
- name: Verify the signature
run: |
cosign verify \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
--certificate-identity-regexp="https://github.com/company/*" \
registry.company.com/${{ github.repository }}@${{ steps.build.outputs.digest }}
- name: Run Trivy vulnerability scan
uses: aquasecurity/trivy-action@master
with:
image-ref: registry.company.com/${{ github.repository }}:${{ github.sha }}
format: sarif
output: trivy-results.sarif
severity: CRITICAL,HIGH
exit-code: 1
- name: Upload scan results
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: trivy-results.sarifMeasuring Platform Success
A platform without metrics is a platform without direction. You need to measure whether the platform is actually improving developer productivity and satisfaction.
Key Metrics to Track
DORA Metrics (as influenced by the platform):
- Deployment Frequency - How often teams deploy. The platform should increase this.
- Lead Time for Changes - Time from commit to production. The platform should reduce this.
- Change Failure Rate - Percentage of deployments causing failures. Golden paths should reduce this.
- Mean Time to Recovery (MTTR) - How quickly teams recover from failures. Platform observability should reduce this.
Platform-Specific Metrics:
- Time to First Deploy - How long it takes a new service to reach staging from scratch
- Infrastructure Provisioning Time - Time from request to ready
- Golden Path Adoption Rate - Percentage of new services using golden path templates
- Self-Service Ratio - Percentage of infrastructure provisioned without a support ticket
- Developer Net Promoter Score (NPS) - Quarterly survey of developer satisfaction
- Support Ticket Volume - Decrease in infrastructure-related support tickets
Metrics Dashboard Configuration
Set up a Grafana dashboard that tracks platform health:
# monitoring/platform-metrics-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: platform-metrics-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
platform-metrics.json: |
{
"dashboard": {
"title": "Platform Engineering Metrics",
"uid": "platform-eng-metrics",
"tags": ["platform", "engineering-metrics"],
"timezone": "browser",
"refresh": "5m",
"panels": [
{
"title": "Deployment Frequency (Last 30 Days)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(increase(argocd_app_sync_total{phase=\"Succeeded\"}[1d])) by (name)",
"legendFormat": "{{ name }}"
}
]
},
{
"title": "Lead Time for Changes",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"targets": [
{
"expr": "avg(platform_lead_time_seconds) / 3600",
"legendFormat": "Hours"
}
],
"fieldConfig": {
"defaults": {
"unit": "h",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 24 },
{ "color": "red", "value": 72 }
]
}
}
}
},
{
"title": "Change Failure Rate",
"type": "gauge",
"gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 },
"targets": [
{
"expr": "sum(argocd_app_sync_total{phase=\"Failed\"}) / sum(argocd_app_sync_total) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 25 }
]
}
}
}
},
{
"title": "Infrastructure Provisioning Time",
"type": "histogram",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(crossplane_claim_ready_duration_seconds_bucket[7d])) by (le, kind))",
"legendFormat": "p95 - {{ kind }}"
},
{
"expr": "histogram_quantile(0.50, sum(rate(crossplane_claim_ready_duration_seconds_bucket[7d])) by (le, kind))",
"legendFormat": "p50 - {{ kind }}"
}
]
},
{
"title": "Golden Path Adoption",
"type": "piechart",
"gridPos": { "h": 8, "w": 6, "x": 0, "y": 8 },
"targets": [
{
"expr": "count(kube_deployment_labels{label_managed_by=\"platform-team\"})",
"legendFormat": "Golden Path"
},
{
"expr": "count(kube_deployment_labels) - count(kube_deployment_labels{label_managed_by=\"platform-team\"})",
"legendFormat": "Custom"
}
]
},
{
"title": "Self-Service Ratio (Last 30 Days)",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 8 },
"targets": [
{
"expr": "sum(platform_self_service_provisions_total) / (sum(platform_self_service_provisions_total) + sum(platform_manual_provisions_total)) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 60 },
{ "color": "green", "value": 80 }
]
}
}
}
},
{
"title": "Support Tickets Trend",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 6, "y": 12 },
"targets": [
{
"expr": "sum(increase(platform_support_tickets_total[7d])) by (category)",
"legendFormat": "{{ category }}"
}
]
},
{
"title": "Active Services by Team",
"type": "bargauge",
"gridPos": { "h": 8, "w": 6, "x": 18, "y": 12 },
"targets": [
{
"expr": "count(argocd_app_info{health_status=\"Healthy\"}) by (project)",
"legendFormat": "{{ project }}"
}
]
}
]
}
}Custom Prometheus Metrics
Instrument your platform API to expose custom metrics:
// platform-api/src/metrics/platform-metrics.ts
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const register = new Registry();
export const serviceCreationCounter = new Counter({
name: 'platform_service_creations_total',
help: 'Total number of services created through the platform',
labelNames: ['template', 'team', 'status'],
registers: [register],
});
export const provisioningDuration = new Histogram({
name: 'platform_provisioning_duration_seconds',
help: 'Time taken to provision infrastructure resources',
labelNames: ['resource_type', 'provider'],
buckets: [30, 60, 120, 300, 600, 900, 1800],
registers: [register],
});
export const selfServiceProvisions = new Counter({
name: 'platform_self_service_provisions_total',
help: 'Infrastructure provisioned through self-service',
labelNames: ['resource_type'],
registers: [register],
});
export const manualProvisions = new Counter({
name: 'platform_manual_provisions_total',
help: 'Infrastructure provisioned through manual tickets',
labelNames: ['resource_type'],
registers: [register],
});
export const activeServices = new Gauge({
name: 'platform_active_services',
help: 'Number of active services managed by the platform',
labelNames: ['team', 'environment'],
registers: [register],
});
export const leadTime = new Histogram({
name: 'platform_lead_time_seconds',
help: 'Time from commit to production deployment',
labelNames: ['team', 'service'],
buckets: [600, 1800, 3600, 7200, 14400, 28800, 86400],
registers: [register],
});
export { register };Common Anti-Patterns
Building an IDP is as much about avoiding pitfalls as it is about choosing the right tools. These are the most common mistakes platform teams make.
Building Too Much Too Soon
The most frequent failure mode is building an elaborate platform before understanding what developers actually need. Teams spend months building a sophisticated self-service portal, only to discover that developers needed better CI/CD pipelines first.
What to do instead: Start with the biggest pain point. If developers complain about slow deployments, fix deployments first. If they struggle to provision databases, start there. Use a "thin slice" approach - build a minimal solution for one use case, get feedback, iterate, then expand.
Phase 1 (Month 1-2): Golden path for one service type
Basic CI/CD standardization
Service catalog in Backstage
Phase 2 (Month 3-4): Self-service database provisioning
Automated environment creation
Observability integration
Phase 3 (Month 5-6): Multi-cloud support
Security policy automation
Cost visibility
Phase 4 (Month 7+): Advanced workflows
Custom developer tools
Platform analytics
Not Treating the Platform as a Product
Platform teams that operate like infrastructure teams - responding to tickets and building what they think developers need - consistently fail. The platform is a product, and developers are the customers.
Product practices that platform teams should adopt:
- User research. Interview developers quarterly. Shadow them as they onboard new services. Understand their frustrations firsthand.
- Feature prioritization. Use a framework (RICE, ICE, or similar) to prioritize platform features based on developer impact.
- Feedback loops. Run monthly retrospectives with platform users. Track feature requests and bug reports in a public backlog.
- Documentation. Maintain developer-facing documentation that explains how to use the platform, not how the platform works internally.
- Onboarding experience. Measure and optimize the "time to hello world" for new developers joining the organization.
Mandating Instead of Attracting
Forcing developers to use your platform by blocking alternative approaches breeds resentment and shadow IT. Developers who feel forced will find workarounds.
The attraction model works better:
- Make the golden path genuinely easier than the alternative
- Provide escape hatches for teams that need custom solutions
- Celebrate teams that adopt the platform early
- Let adoption metrics speak for themselves
If developers are not adopting your platform voluntarily, the platform is not good enough yet. Mandates mask product problems.
Ignoring Developer Feedback
Every platform team has heard "we built it but nobody uses it." This happens when the platform is designed around what the platform team thinks is elegant rather than what developers actually need.
Concrete feedback mechanisms:
- In-portal feedback widgets (thumbs up/down on every page)
- Bi-weekly office hours where developers can ask questions and share frustrations
- Anonymous surveys with specific, actionable questions
- Instrumentation that shows which features are used and which are ignored
- A public roadmap that developers can comment on and vote for features
Building Everything In-House
The opposite extreme of buying everything is building everything. Platform teams sometimes build custom solutions for problems that mature open-source tools already solve.
Decision framework:
| Consideration | Build | Buy/Adopt |
|---|---|---|
| Core differentiator for your org | Yes | No |
| Mature OSS solution exists | No | Yes |
| Requires deep integration with internal systems | Yes | Maybe |
| Team has expertise to maintain it | Required | Less critical |
| Estimated build time exceeds 1 quarter | Reconsider | Likely better |
Use Backstage instead of building a custom developer portal. Use Crossplane or Terraform instead of building a custom provisioning layer. Build the glue code and the developer experience layer on top of existing tools.
Getting Started
If you are standing up a platform team or improving an existing IDP, here is a practical starting sequence:
-
Audit current pain points. Survey 10-15 developers across different teams. Ask: "What takes longer than it should? What do you have to ask for help with?" Categorize responses by frequency and severity.
-
Pick one golden path. Choose the most common service type in your organization and build a complete golden path from repository creation to production deployment. Instrument it. Measure time-to-first-deploy before and after.
-
Stand up Backstage. Even a basic Backstage instance with a service catalog adds immediate value. Developers can discover services, find owners, and access documentation from one place.
-
Introduce one self-service capability. Whether it is database provisioning, environment creation, or secret management, pick the infrastructure request that generates the most support tickets and automate it.
-
Measure and iterate. Track the metrics described in this guide. Share them transparently with the organization. Let the numbers justify continued investment in the platform.
The organizations that succeed with platform engineering share one trait: they treat it as a long-term product investment, not a one-time infrastructure project. The platform is never done - it evolves with the needs of the developers it serves.