Platform Engineering: Building Internal Developer Platforms That Actually Get Adopted

Software engineering teams are drowning in infrastructure complexity. A developer who just wants to ship a feature now needs to understand Kubernetes manifests, Terraform modules, CI/CD pipelines, service mesh configurations, secret management, observability stacks, and security policies before they can get anything into production. The cognitive load is unsustainable. Teams that once deployed in hours now spend days navigating tooling decisions and configuration sprawl.

Platform engineering addresses this by building an Internal Developer Platform (IDP) that abstracts away infrastructure complexity while maintaining the guardrails organizations need. The platform team builds the paved roads, and product developers drive on them. Done well, an IDP dramatically reduces time-to-production, eliminates entire classes of misconfiguration, and lets developers focus on business logic instead of YAML. Gartner predicts that by 2026, 80% of software engineering organizations will have established platform teams. The shift is well underway.

This guide covers the practical work of building an IDP that developers actually want to use - from golden paths and service catalogs to self-service infrastructure, GitOps delivery, and measuring whether your platform is succeeding.

What Platform Engineering Solves

The core problem is cognitive load. A study by Team Topologies found that the number of tools, technologies, and responsibilities a typical development team manages has increased by over 300% in the last decade. Developers are expected to be experts in application code, cloud infrastructure, container orchestration, networking, security, and observability simultaneously. This is not a skills problem - it is a structural problem.

The symptoms are easy to spot:

Developers copy-paste Kubernetes manifests from other teams and hope they work
Every new service takes weeks to get into production because of "infrastructure setup"
Security and compliance reviews are bottlenecks because configurations vary wildly across teams
The same infrastructure bugs get rediscovered by different teams independently
Senior engineers spend most of their time answering infrastructure questions instead of building

Platform engineering solves this by introducing a dedicated team that treats infrastructure as a product. The platform team's customers are the product development teams. Like any product team, they conduct user research, prioritize features, iterate on feedback, and measure adoption.

What a platform team provides:

Standardized environments that developers provision in minutes, not days
Golden path templates that encode best practices for common service patterns
Self-service infrastructure that eliminates ticket-driven provisioning
Built-in security and compliance so developers get secure-by-default configurations
Unified observability so every service ships with logging, metrics, and tracing from day one

The critical distinction between platform engineering and traditional infrastructure/DevOps teams is the product mindset. A traditional ops team responds to tickets. A platform team builds products that eliminate the need for tickets.

Golden Paths vs Guardrails

Golden paths and guardrails are complementary concepts that platform teams must implement together.

Golden paths are the recommended, paved way to accomplish common tasks. They are opinionated by design. A golden path for deploying a new microservice might include a specific language runtime, a standard project structure, pre-configured CI/CD, Kubernetes manifests, and observability instrumentation. Developers can choose to leave the golden path, but doing so requires more effort and responsibility.

Guardrails are the boundaries that all services must stay within regardless of whether they follow the golden path. Guardrails include security policies (no containers running as root), compliance requirements (all data encrypted at rest), and operational standards (every service must expose health checks).

Golden Path Templates

The most effective golden paths start with project scaffolding. When a developer creates a new service, the template generates everything they need to go from zero to production.

Here is a golden path template using Cookiecutter for a Node.js microservice:

golden-path-node-service/
  cookiecutter.json
  {{cookiecutter.service_name}}/
    src/
      index.ts
      routes/
        health.ts
      middleware/
        auth.ts
        logging.ts
    k8s/
      base/
        deployment.yaml
        service.yaml
        hpa.yaml
        kustomization.yaml
      overlays/
        staging/
          kustomization.yaml
        production/
          kustomization.yaml
    .github/
      workflows/
        ci.yaml
        deploy.yaml
    Dockerfile
    package.json
    tsconfig.json
    .eslintrc.json
    catalog-info.yaml

The cookiecutter.json defines the parameters developers provide:

{
  "service_name": "my-service",
  "description": "A short description of the service",
  "team_name": "backend",
  "owner_email": "team-backend@company.com",
  "port": "3000",
  "needs_database": ["none", "postgresql", "redis", "both"],
  "needs_queue": ["none", "rabbitmq", "kafka"],
  "deployment_environments": ["staging-only", "staging-and-production"]
}

The generated Kubernetes deployment encodes your organization's standards:

# {{cookiecutter.service_name}}/k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ cookiecutter.service_name }}
  labels:
    app.kubernetes.io/name: {{ cookiecutter.service_name }}
    app.kubernetes.io/managed-by: platform-team
    team: {{ cookiecutter.team_name }}
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: {{ cookiecutter.service_name }}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: {{ cookiecutter.service_name }}
        team: {{ cookiecutter.team_name }}
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "{{ cookiecutter.port }}"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: {{ cookiecutter.service_name }}
      securityContext:
        runAsNonRoot: true
        fsGroup: 1000
      containers:
        - name: {{ cookiecutter.service_name }}
          image: registry.company.com/{{ cookiecutter.team_name }}/{{ cookiecutter.service_name }}:latest
          ports:
            - containerPort: {{ cookiecutter.port }}
              protocol: TCP
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /health/live
              port: {{ cookiecutter.port }}
            initialDelaySeconds: 10
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /health/ready
              port: {{ cookiecutter.port }}
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: SERVICE_NAME
              value: {{ cookiecutter.service_name }}
            - name: LOG_LEVEL
              value: "info"
            - name: NODE_ENV
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

Every golden path template should also generate a catalog-info.yaml for Backstage registration:

# {{cookiecutter.service_name}}/catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: {{ cookiecutter.service_name }}
  description: {{ cookiecutter.description }}
  annotations:
    github.com/project-slug: company/{{ cookiecutter.service_name }}
    backstage.io/techdocs-ref: dir:.
    argocd/app-name: {{ cookiecutter.service_name }}
    grafana/dashboard-selector: service={{ cookiecutter.service_name }}
  tags:
    - nodejs
    - typescript
  links:
    - url: https://grafana.company.com/d/{{ cookiecutter.service_name }}
      title: Grafana Dashboard
      icon: dashboard
spec:
  type: service
  lifecycle: production
  owner: team-{{ cookiecutter.team_name }}
  system: {{ cookiecutter.team_name }}-platform
  providesApis:
    - {{ cookiecutter.service_name }}-api

Guardrail Implementation

Guardrails are enforced through policy engines. Here is a Kyverno policy that enforces key standards across all deployments:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: platform-guardrails
  annotations:
    policies.kyverno.io/title: Platform Engineering Guardrails
    policies.kyverno.io/description: >-
      Enforces minimum standards for all workloads deployed
      to the cluster.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-labels
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
      validate:
        message: "All workloads must have team and app.kubernetes.io/name labels."
        pattern:
          metadata:
            labels:
              team: "?*"
              app.kubernetes.io/name: "?*"
    - name: require-resource-limits
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
      validate:
        message: "All containers must specify CPU and memory resource limits."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      limits:
                        cpu: "?*"
                        memory: "?*"
                      requests:
                        cpu: "?*"
                        memory: "?*"
    - name: restrict-privileged
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Privileged containers are not allowed."
        pattern:
          spec:
            containers:
              - securityContext:
                  privileged: false

The Platform Engineering Stack

A mature IDP is composed of several interconnected layers. Each layer addresses a specific concern, and together they provide a cohesive developer experience.

The reference architecture:

Developer Experience Layer
  - Backstage (Service Catalog + Software Templates + TechDocs)
  - Developer Portal (custom UI for self-service)
  - CLI tools (scaffolding, local development)

Infrastructure Abstraction Layer
  - Crossplane (declarative infrastructure API)
  - Terraform (infrastructure provisioning)
  - Helm/Kustomize (application packaging)

Delivery Layer
  - ArgoCD (GitOps continuous delivery)
  - GitHub Actions (CI pipelines)
  - Container Registry (image storage and scanning)

Observability Layer
  - Grafana (dashboards and alerting)
  - Prometheus (metrics collection)
  - Loki (log aggregation)
  - Tempo (distributed tracing)

Security Layer
  - OPA/Kyverno (policy enforcement)
  - Vault (secret management)
  - Sigstore (supply chain security)
  - Trivy (vulnerability scanning)

The key integration points are:

Backstage templates call the infrastructure abstraction layer to provision resources
GitOps picks up the generated manifests and deploys them
Observability is pre-wired into every golden path template
Security policies are enforced at multiple layers (admission control, CI pipeline, runtime)

Building with Backstage

Backstage is the CNCF project that serves as the foundation for most IDPs. It provides a service catalog, software templates, TechDocs, and a plugin ecosystem. Think of it as the storefront for your platform.

Setting Up Backstage

Bootstrap a Backstage instance:

# Install the Backstage CLI
npx @backstage/create-app@latest
 
# Follow the prompts
# App name: company-developer-portal
# Select database: PostgreSQL (for production)
 
cd company-developer-portal
 
# Start in development mode
yarn dev

Configure the app-config.yaml for your organization:

# app-config.yaml
app:
  title: Company Developer Portal
  baseUrl: http://localhost:3000
 
organization:
  name: Company
 
backend:
  baseUrl: http://localhost:7007
  database:
    client: pg
    connection:
      host: ${POSTGRES_HOST}
      port: ${POSTGRES_PORT}
      user: ${POSTGRES_USER}
      password: ${POSTGRES_PASSWORD}
 
integrations:
  github:
    - host: github.com
      token: ${GITHUB_TOKEN}
 
catalog:
  import:
    entityFilename: catalog-info.yaml
    pullRequestBranchName: backstage-integration
  rules:
    - allow: [Component, System, API, Resource, Location, Template]
  locations:
    - type: url
      target: https://github.com/company/backstage-catalog/blob/main/catalog-info.yaml
    - type: url
      target: https://github.com/company/software-templates/blob/main/all-templates.yaml
 
proxy:
  endpoints:
    /argocd/api:
      target: https://argocd.company.com/api/v1
      headers:
        Cookie:
          $env: ARGOCD_AUTH_TOKEN
 
    /grafana/api:
      target: https://grafana.company.com
      headers:
        Authorization: Bearer ${GRAFANA_TOKEN}

Creating Software Templates

Software templates are the heart of Backstage's self-service capability. They define a multi-step wizard that collects parameters from the developer and then executes actions to create repositories, register services, and provision infrastructure.

# templates/node-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: node-microservice
  title: Node.js Microservice
  description: >
    Creates a production-ready Node.js microservice with TypeScript,
    Express, Kubernetes manifests, CI/CD pipelines, and observability
    pre-configured.
  tags:
    - nodejs
    - typescript
    - recommended
spec:
  owner: team-platform
  type: service
 
  parameters:
    - title: Service Details
      required:
        - name
        - description
        - owner
      properties:
        name:
          title: Service Name
          type: string
          description: Unique name for the service (lowercase, hyphens only)
          pattern: "^[a-z][a-z0-9-]*$"
          ui:autofocus: true
        description:
          title: Description
          type: string
          description: A brief description of what this service does
        owner:
          title: Owner Team
          type: string
          description: The team that owns this service
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
 
    - title: Infrastructure Options
      properties:
        database:
          title: Database
          type: string
          description: Select a database if your service needs one
          default: none
          enum:
            - none
            - postgresql
            - redis
            - mongodb
          enumNames:
            - None
            - PostgreSQL
            - Redis
            - MongoDB
        needsQueue:
          title: Message Queue
          type: boolean
          default: false
          description: Does this service need a message queue?
 
    - title: Deployment Configuration
      properties:
        environments:
          title: Deployment Environments
          type: array
          items:
            type: string
            enum:
              - staging
              - production
          uniqueItems: true
          default:
            - staging
        replicaCount:
          title: Production Replica Count
          type: integer
          default: 3
          minimum: 2
          maximum: 10
 
  steps:
    - id: fetch-template
      name: Fetch Service Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          description: ${{ parameters.description }}
          owner: ${{ parameters.owner }}
          database: ${{ parameters.database }}
          needsQueue: ${{ parameters.needsQueue }}
          environments: ${{ parameters.environments }}
          replicaCount: ${{ parameters.replicaCount }}
 
    - id: publish-repo
      name: Create GitHub Repository
      action: publish:github
      input:
        allowedHosts: ["github.com"]
        repoUrl: github.com?owner=company&repo=${{ parameters.name }}
        description: ${{ parameters.description }}
        defaultBranch: main
        protectDefaultBranch: true
        requiredApprovingReviewCount: 1
        topics:
          - microservice
          - nodejs
          - platform-managed
 
    - id: create-argocd-app
      name: Register with ArgoCD
      action: argocd:create-resources
      input:
        appName: ${{ parameters.name }}
        argoInstance: main
        namespace: ${{ parameters.owner }}
        repoUrl: https://github.com/company/${{ parameters.name }}.git
        path: k8s/overlays/staging
 
    - id: provision-database
      name: Provision Database
      if: ${{ parameters.database !== 'none' }}
      action: http:backstage:request
      input:
        method: POST
        path: /api/proxy/crossplane/compositions
        headers:
          Content-Type: application/json
        body:
          apiVersion: database.platform.company.com/v1alpha1
          kind: DatabaseClaim
          metadata:
            name: ${{ parameters.name }}-db
            namespace: ${{ parameters.owner }}
          spec:
            engine: ${{ parameters.database }}
            size: small
 
    - id: register-catalog
      name: Register in Backstage Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['publish-repo'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml
 
  output:
    links:
      - title: Repository
        url: ${{ steps['publish-repo'].output.remoteUrl }}
      - title: Open in Backstage
        icon: catalog
        entityRef: ${{ steps['register-catalog'].output.entityRef }}

Building a Custom Backstage Plugin

When you need functionality beyond what existing plugins provide, Backstage's plugin architecture makes it straightforward to build your own. Here is a plugin that shows infrastructure cost for each service:

# Generate the plugin scaffold
cd company-developer-portal
yarn new --select plugin
# Enter plugin ID: cost-insights-custom

// plugins/cost-insights-custom/src/components/ServiceCostCard.tsx
import React, { useEffect, useState } from 'react';
import {
  InfoCard,
  Progress,
  ResponseErrorPanel,
} from '@backstage/core-components';
import { useEntity } from '@backstage/plugin-catalog-react';
import { useApi, configApiRef } from '@backstage/core-plugin-api';
import {
  Table,
  TableBody,
  TableCell,
  TableHead,
  TableRow,
  Typography,
  Chip,
} from '@material-ui/core';
 
interface CostBreakdown {
  resource: string;
  type: string;
  monthlyCost: number;
  trend: 'up' | 'down' | 'stable';
}
 
interface ServiceCost {
  serviceName: string;
  totalMonthlyCost: number;
  previousMonthlyCost: number;
  breakdown: CostBreakdown[];
  lastUpdated: string;
}
 
export const ServiceCostCard = () => {
  const { entity } = useEntity();
  const config = useApi(configApiRef);
  const [cost, setCost] = useState<ServiceCost | null>(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState<Error | null>(null);
 
  const serviceName = entity.metadata.name;
  const backendUrl = config.getString('backend.baseUrl');
 
  useEffect(() => {
    const fetchCost = async () => {
      try {
        const response = await fetch(
          `${backendUrl}/api/proxy/cost-api/services/${serviceName}/cost`,
        );
        if (!response.ok) {
          throw new Error(`Failed to fetch cost data: ${response.statusText}`);
        }
        const data: ServiceCost = await response.json();
        setCost(data);
      } catch (err) {
        setError(err as Error);
      } finally {
        setLoading(false);
      }
    };
    fetchCost();
  }, [serviceName, backendUrl]);
 
  if (loading) return <Progress />;
  if (error) return <ResponseErrorPanel error={error} />;
  if (!cost) return <Typography>No cost data available</Typography>;
 
  const percentChange =
    ((cost.totalMonthlyCost - cost.previousMonthlyCost) /
      cost.previousMonthlyCost) *
    100;
 
  return (
    <InfoCard title="Infrastructure Cost" subheader={`Updated: ${cost.lastUpdated}`}>
      <Typography variant="h4">
        ${cost.totalMonthlyCost.toFixed(2)}/month
      </Typography>
      <Chip
        label={`${percentChange > 0 ? '+' : ''}${percentChange.toFixed(1)}% vs last month`}
        color={percentChange > 10 ? 'secondary' : 'default'}
        size="small"
        style={{ marginBottom: 16 }}
      />
      <Table size="small">
        <TableHead>
          <TableRow>
            <TableCell>Resource</TableCell>
            <TableCell>Type</TableCell>
            <TableCell align="right">Monthly Cost</TableCell>
          </TableRow>
        </TableHead>
        <TableBody>
          {cost.breakdown.map((item) => (
            <TableRow key={item.resource}>
              <TableCell>{item.resource}</TableCell>
              <TableCell>{item.type}</TableCell>
              <TableCell align="right">${item.monthlyCost.toFixed(2)}</TableCell>
            </TableRow>
          ))}
        </TableBody>
      </Table>
    </InfoCard>
  );
};

// packages/app/src/components/catalog/EntityPage.tsx
import { ServiceCostCard } from '@internal/plugin-cost-insights-custom';
 
// Add to the service entity page
const serviceEntityPage = (
  <EntityLayout>
    <EntityLayout.Route path="/" title="Overview">
      <Grid container spacing={3}>
        {/* existing cards */}
        <Grid item md={6}>
          <ServiceCostCard />
        </Grid>
      </Grid>
    </EntityLayout.Route>
  </EntityLayout>
);

Self-Service Infrastructure with Crossplane

Crossplane extends Kubernetes with the ability to provision and manage cloud infrastructure using the same declarative YAML that developers already know. Instead of writing Terraform and waiting for an ops ticket, developers submit a Kubernetes resource and Crossplane provisions the cloud resources.

Crossplane Architecture

Crossplane introduces three key concepts:

Providers connect Crossplane to cloud APIs (AWS, GCP, Azure, etc.)
Managed Resources are individual cloud resources (an RDS instance, an S3 bucket)
Compositions combine multiple managed resources into higher-level abstractions
Claims (XRCs) are the developer-facing API for requesting composed resources

Installing Crossplane and Providers

# Install Crossplane into your Kubernetes cluster
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
 
helm install crossplane \
  crossplane-stable/crossplane \
  --namespace crossplane-system \
  --create-namespace \
  --set args='{"--enable-composition-revisions"}'
 
# Install the AWS provider
kubectl apply -f - <<EOF
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/upbound/provider-family-aws:v1.7.0
EOF
 
# Configure AWS credentials
kubectl create secret generic aws-creds \
  -n crossplane-system \
  --from-file=creds=./aws-credentials.txt
 
kubectl apply -f - <<EOF
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: creds
EOF

Provisioning a PostgreSQL Database

First, define a Composite Resource Definition (XRD) and Composition that abstracts away the details:

# platform/database/definition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xdatabases.platform.company.com
spec:
  group: platform.company.com
  names:
    kind: XDatabase
    plural: xdatabases
  claimNames:
    kind: DatabaseClaim
    plural: databaseclaims
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                engine:
                  type: string
                  enum: ["postgresql", "mysql"]
                  description: Database engine type
                size:
                  type: string
                  enum: ["small", "medium", "large"]
                  description: >
                    small = db.t3.medium (2 vCPU, 4 GB).
                    medium = db.r6g.large (2 vCPU, 16 GB).
                    large = db.r6g.xlarge (4 vCPU, 32 GB).
                version:
                  type: string
                  default: "15"
              required:
                - engine
                - size
            status:
              type: object
              properties:
                endpoint:
                  type: string
                port:
                  type: integer
                secretName:
                  type: string

# platform/database/composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: database-aws
  labels:
    provider: aws
    crossplane.io/xrd: xdatabases.platform.company.com
spec:
  compositeTypeRef:
    apiVersion: platform.company.com/v1alpha1
    kind: XDatabase
  resources:
    - name: subnet-group
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: SubnetGroup
        spec:
          forProvider:
            region: us-east-1
            description: "Platform-managed database subnet group"
            subnetIds:
              - subnet-0abc123def456
              - subnet-0def789abc012
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: metadata.name
          transforms:
            - type: string
              string:
                fmt: "%s-subnet-group"
 
    - name: security-group
      base:
        apiVersion: ec2.aws.upbound.io/v1beta1
        kind: SecurityGroup
        spec:
          forProvider:
            region: us-east-1
            vpcId: vpc-0abc123def456
            description: "Platform-managed database security group"
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: metadata.name
          transforms:
            - type: string
              string:
                fmt: "%s-sg"
 
    - name: security-group-rule
      base:
        apiVersion: ec2.aws.upbound.io/v1beta1
        kind: SecurityGroupRule
        spec:
          forProvider:
            region: us-east-1
            type: ingress
            fromPort: 5432
            toPort: 5432
            protocol: tcp
            cidrBlocks:
              - "10.0.0.0/8"
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: spec.forProvider.securityGroupIdSelector.matchLabels[db-name]
 
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta2
        kind: Instance
        spec:
          forProvider:
            region: us-east-1
            allocatedStorage: 20
            autoMinorVersionUpgrade: true
            backupRetentionPeriod: 7
            deletionProtection: true
            multiAz: true
            publiclyAccessible: false
            storageEncrypted: true
            storageType: gp3
            skipFinalSnapshot: false
            autoGeneratePassword: true
            masterUsername: admin
            masterUserPasswordSecretRef:
              namespace: crossplane-system
              key: password
          writeConnectionSecretToRef:
            namespace: crossplane-system
      patches:
        - fromFieldPath: spec.engine
          toFieldPath: spec.forProvider.engine
        - fromFieldPath: spec.version
          toFieldPath: spec.forProvider.engineVersion
        - fromFieldPath: spec.size
          toFieldPath: spec.forProvider.instanceClass
          transforms:
            - type: map
              map:
                small: db.t3.medium
                medium: db.r6g.large
                large: db.r6g.xlarge
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.endpoint
          toFieldPath: status.endpoint
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.port
          toFieldPath: status.port
      connectionDetails:
        - name: endpoint
          fromFieldPath: status.atProvider.endpoint
        - name: port
          fromFieldPath: status.atProvider.port
          type: FromFieldPath
        - name: username
          fromFieldPath: spec.forProvider.masterUsername
          type: FromFieldPath
        - name: password
          fromConnectionSecretKey: attribute.password

Now a developer can provision a database with a simple claim:

# developer submits this
apiVersion: platform.company.com/v1alpha1
kind: DatabaseClaim
metadata:
  name: orders-db
  namespace: orders-team
spec:
  engine: postgresql
  size: small
  version: "15"

Provisioning an S3 Bucket

# platform/storage/composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: bucket-aws
  labels:
    provider: aws
spec:
  compositeTypeRef:
    apiVersion: platform.company.com/v1alpha1
    kind: XBucket
  resources:
    - name: s3-bucket
      base:
        apiVersion: s3.aws.upbound.io/v1beta2
        kind: Bucket
        spec:
          forProvider:
            region: us-east-1
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: metadata.name
 
    - name: bucket-versioning
      base:
        apiVersion: s3.aws.upbound.io/v1beta1
        kind: BucketVersioning
        spec:
          forProvider:
            region: us-east-1
            versioningConfiguration:
              - status: Enabled
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: spec.forProvider.bucketSelector.matchLabels[bucket-name]
 
    - name: bucket-encryption
      base:
        apiVersion: s3.aws.upbound.io/v1beta1
        kind: BucketServerSideEncryptionConfiguration
        spec:
          forProvider:
            region: us-east-1
            rule:
              - applyServerSideEncryptionByDefault:
                  - sseAlgorithm: aws:kms
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: spec.forProvider.bucketSelector.matchLabels[bucket-name]
 
    - name: bucket-public-access-block
      base:
        apiVersion: s3.aws.upbound.io/v1beta1
        kind: BucketPublicAccessBlock
        spec:
          forProvider:
            region: us-east-1
            blockPublicAcls: true
            blockPublicPolicy: true
            ignorePublicAcls: true
            restrictPublicBuckets: true
      patches:
        - fromFieldPath: metadata.name
          toFieldPath: spec.forProvider.bucketSelector.matchLabels[bucket-name]

Every bucket provisioned through the platform automatically gets versioning, encryption, and public access blocking. Developers cannot accidentally create a public, unencrypted bucket.

GitOps-Driven Deployments

ArgoCD serves as the delivery mechanism for the platform. When Backstage creates a new service or a developer pushes changes, ArgoCD picks up the updated manifests from Git and deploys them.

ArgoCD Application Definition

# argocd/applications/orders-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: orders-service
  namespace: argocd
  labels:
    team: orders
    managed-by: platform
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.slack: platform-deploys
    notifications.argoproj.io/subscribe.on-sync-failed.slack: platform-alerts
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: orders-team
  source:
    repoURL: https://github.com/company/orders-service.git
    targetRevision: main
    path: k8s/overlays/staging
  destination:
    server: https://kubernetes.default.svc
    namespace: orders
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m0s
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas

ApplicationSets for Multi-Environment

Instead of maintaining individual Application manifests per service per environment, use ApplicationSets to generate them dynamically:

# argocd/applicationsets/all-services.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://github.com/company/gitops-config.git
              revision: main
              files:
                - path: "services/*/config.json"
          - list:
              elements:
                - environment: staging
                  cluster: https://staging-cluster.company.com
                  autoSync: true
                - environment: production
                  cluster: https://production-cluster.company.com
                  autoSync: false
  template:
    metadata:
      name: "{{ .name }}-{{ .environment }}"
      namespace: argocd
      labels:
        team: "{{ .team }}"
        environment: "{{ .environment }}"
        managed-by: platform
    spec:
      project: "{{ .team }}-project"
      source:
        repoURL: "https://github.com/company/{{ .name }}.git"
        targetRevision: "{{ if eq .environment \"production\" }}release{{ else }}main{{ end }}"
        path: "k8s/overlays/{{ .environment }}"
      destination:
        server: "{{ .cluster }}"
        namespace: "{{ .namespace }}"
      syncPolicy:
        automated:
          prune: "{{ .autoSync }}"
          selfHeal: "{{ .autoSync }}"
        syncOptions:
          - CreateNamespace=true

Each service provides a simple config file:

{
  "name": "orders-service",
  "team": "orders",
  "namespace": "orders",
  "tier": "critical"
}

Automated Rollbacks

Configure ArgoCD to automatically roll back failed deployments using analysis runs with Argo Rollouts:

# k8s/base/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: orders-service
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: orders-service
        - setWeight: 30
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: orders-service
        - setWeight: 60
        - pause: { duration: 5m }
        - setWeight: 100
      rollbackWindow:
        revisions: 2
  selector:
    matchLabels:
      app: orders-service
  template:
    metadata:
      labels:
        app: orders-service
    spec:
      containers:
        - name: orders-service
          image: registry.company.com/orders/orders-service:v1.2.3
          ports:
            - containerPort: 3000
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 30s
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}",
                status=~"2.."
              }[5m]
            )) /
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}"
              }[5m]
            ))

Developer Self-Service Portal

Beyond Backstage's UI, a mature platform provides API-driven provisioning and ChatOps integration so developers can interact with the platform from wherever they work.

Platform API

A lightweight API that wraps Crossplane and ArgoCD, exposing simple operations to developers:

// platform-api/src/routes/services.ts
import { Router, Request, Response } from 'express';
import { KubernetesClient } from '../clients/kubernetes';
import { GitHubClient } from '../clients/github';
import { validateServiceRequest } from '../validators/service';
import { auditLog } from '../middleware/audit';
 
const router = Router();
const k8s = new KubernetesClient();
const github = new GitHubClient();
 
interface CreateServiceRequest {
  name: string;
  team: string;
  template: 'node-service' | 'python-service' | 'go-service';
  database?: 'postgresql' | 'redis' | 'mongodb';
  environments: string[];
}
 
router.post(
  '/services',
  auditLog('service.create'),
  async (req: Request, res: Response) => {
    const body = req.body as CreateServiceRequest;
 
    const validation = validateServiceRequest(body);
    if (!validation.valid) {
      return res.status(400).json({ errors: validation.errors });
    }
 
    try {
      // Step 1: Create the repository from template
      const repo = await github.createFromTemplate({
        templateRepo: `golden-path-${body.template}`,
        name: body.name,
        owner: 'company',
        description: `${body.name} - owned by ${body.team}`,
        private: true,
      });
 
      // Step 2: Submit Crossplane claims for infrastructure
      const resources: string[] = [];
 
      if (body.database) {
        await k8s.apply({
          apiVersion: 'platform.company.com/v1alpha1',
          kind: 'DatabaseClaim',
          metadata: {
            name: `${body.name}-db`,
            namespace: body.team,
          },
          spec: {
            engine: body.database,
            size: 'small',
          },
        });
        resources.push(`database:${body.database}`);
      }
 
      // Step 3: Create ArgoCD applications for each environment
      for (const env of body.environments) {
        await k8s.apply({
          apiVersion: 'argoproj.io/v1alpha1',
          kind: 'Application',
          metadata: {
            name: `${body.name}-${env}`,
            namespace: 'argocd',
            labels: {
              team: body.team,
              environment: env,
              'managed-by': 'platform-api',
            },
          },
          spec: {
            project: `${body.team}-project`,
            source: {
              repoURL: repo.clone_url,
              targetRevision: env === 'production' ? 'release' : 'main',
              path: `k8s/overlays/${env}`,
            },
            destination: {
              server: 'https://kubernetes.default.svc',
              namespace: body.team,
            },
            syncPolicy: {
              automated: env !== 'production' ? { prune: true, selfHeal: true } : undefined,
            },
          },
        });
      }
 
      return res.status(201).json({
        service: body.name,
        repository: repo.html_url,
        environments: body.environments,
        resources,
        status: 'provisioning',
        estimatedReady: '5-10 minutes',
      });
    } catch (error) {
      console.error('Service creation failed:', error);
      return res.status(500).json({ error: 'Service creation failed' });
    }
  },
);
 
// Endpoint to check provisioning status
router.get('/services/:name/status', async (req: Request, res: Response) => {
  const { name } = req.params;
 
  try {
    const argoApps = await k8s.listArgoApplications(name);
    const claims = await k8s.listCrossplaneClaims(name);
 
    const status = {
      service: name,
      deployments: argoApps.map((app: any) => ({
        environment: app.metadata.labels.environment,
        syncStatus: app.status?.sync?.status || 'Unknown',
        healthStatus: app.status?.health?.status || 'Unknown',
      })),
      infrastructure: claims.map((claim: any) => ({
        resource: claim.metadata.name,
        kind: claim.kind,
        ready: claim.status?.conditions?.find(
          (c: any) => c.type === 'Ready',
        )?.status === 'True',
      })),
    };
 
    return res.json(status);
  } catch (error) {
    return res.status(500).json({ error: 'Failed to fetch status' });
  }
});
 
export default router;

Slack/ChatOps Integration

Let developers provision and manage services directly from Slack:

// platform-bot/src/handlers/slash-commands.ts
import { App, SlashCommand, AckFn } from '@slack/bolt';
import { PlatformAPIClient } from '../clients/platform-api';
 
const platformApi = new PlatformAPIClient();
 
export function registerCommands(app: App) {
  // /platform create-service orders-service --team orders --template node-service --db postgresql
  app.command('/platform', async ({ command, ack, respond }) => {
    await ack();
    const args = parseCommand(command.text);
 
    switch (args.action) {
      case 'create-service':
        await handleCreateService(args, command, respond);
        break;
      case 'status':
        await handleStatus(args, command, respond);
        break;
      case 'promote':
        await handlePromote(args, command, respond);
        break;
      default:
        await respond({
          text: `Unknown action: ${args.action}. Available commands: create-service, status, promote`,
        });
    }
  });
}
 
async function handleCreateService(
  args: ParsedCommand,
  command: SlashCommand,
  respond: Function,
) {
  const requiredArgs = ['name', 'team', 'template'];
  const missing = requiredArgs.filter((arg) => !args.options[arg]);
 
  if (missing.length > 0) {
    await respond({
      blocks: [
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `Missing required arguments: ${missing.join(', ')}\n\nUsage: \`/platform create-service --name my-service --team my-team --template node-service [--db postgresql]\``,
          },
        },
      ],
    });
    return;
  }
 
  await respond({
    blocks: [
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: `Creating service *${args.options.name}*...\nTemplate: ${args.options.template}\nTeam: ${args.options.team}\nDatabase: ${args.options.db || 'none'}`,
        },
      },
    ],
  });
 
  try {
    const result = await platformApi.createService({
      name: args.options.name,
      team: args.options.team,
      template: args.options.template,
      database: args.options.db,
      environments: ['staging'],
    });
 
    await respond({
      blocks: [
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `Service *${result.service}* created successfully.\n\nRepository: ${result.repository}\nEnvironments: ${result.environments.join(', ')}\nResources: ${result.resources.join(', ') || 'none'}\n\nEstimated ready: ${result.estimatedReady}`,
          },
        },
        {
          type: 'actions',
          elements: [
            {
              type: 'button',
              text: { type: 'plain_text', text: 'View in Backstage' },
              url: `https://backstage.company.com/catalog/default/component/${result.service}`,
            },
            {
              type: 'button',
              text: { type: 'plain_text', text: 'Check Status' },
              action_id: `check_status_${result.service}`,
            },
          ],
        },
      ],
    });
  } catch (error) {
    await respond({
      text: `Failed to create service: ${(error as Error).message}`,
    });
  }
}
 
async function handlePromote(
  args: ParsedCommand,
  command: SlashCommand,
  respond: Function,
) {
  const serviceName = args.options.name;
 
  // Production promotions require approval
  await respond({
    blocks: [
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: `*Production Promotion Request*\nService: ${serviceName}\nRequested by: <@${command.user_id}>\n\nThis requires approval from a team lead.`,
        },
      },
      {
        type: 'actions',
        elements: [
          {
            type: 'button',
            text: { type: 'plain_text', text: 'Approve' },
            style: 'primary',
            action_id: `approve_promote_${serviceName}`,
          },
          {
            type: 'button',
            text: { type: 'plain_text', text: 'Deny' },
            style: 'danger',
            action_id: `deny_promote_${serviceName}`,
          },
        ],
      },
    ],
  });
}
 
interface ParsedCommand {
  action: string;
  options: Record<string, string>;
}
 
function parseCommand(text: string): ParsedCommand {
  const parts = text.trim().split(/\s+/);
  const action = parts[0];
  const options: Record<string, string> = {};
 
  for (let i = 1; i < parts.length; i++) {
    if (parts[i].startsWith('--') && i + 1 < parts.length) {
      const key = parts[i].replace('--', '');
      options[key] = parts[i + 1];
      i++;
    }
  }
 
  // Also support positional name
  if (!options.name && parts[1] && !parts[1].startsWith('--')) {
    options.name = parts[1];
  }
 
  return { action, options };
}

Security and Compliance as Platform Features

A well-built platform makes security the path of least resistance. Instead of security being a review gate that slows teams down, it is embedded into the platform itself.

Policy-as-Code with OPA Gatekeeper

# policies/require-security-context.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredsecuritycontext
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredSecurityContext
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredsecuritycontext
 
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.securityContext.runAsNonRoot
          msg := sprintf(
            "Container '%v' must set securityContext.runAsNonRoot to true",
            [container.name]
          )
        }
 
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.securityContext.readOnlyRootFilesystem
          msg := sprintf(
            "Container '%v' must set securityContext.readOnlyRootFilesystem to true",
            [container.name]
          )
        }
 
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          container.securityContext.allowPrivilegeEscalation
          msg := sprintf(
            "Container '%v' must not allow privilege escalation",
            [container.name]
          )
        }
 
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.securityContext.capabilities.drop
          msg := sprintf(
            "Container '%v' must drop all capabilities",
            [container.name]
          )
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredSecurityContext
metadata:
  name: must-have-security-context
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
      - kube-system
      - crossplane-system
      - argocd

Automated Secret Management with Vault

Integrate HashiCorp Vault into the platform so developers never handle raw secrets:

# platform/vault/secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: platform-vault
spec:
  provider:
    vault:
      server: https://vault.company.com
      path: secret
      version: v2
      auth:
        kubernetes:
          mountPath: kubernetes
          role: platform-external-secrets
          serviceAccountRef:
            name: external-secrets
            namespace: external-secrets
---
# Developer-facing: request a secret from Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: orders-service-secrets
  namespace: orders
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: platform-vault
    kind: ClusterSecretStore
  target:
    name: orders-service-secrets
    creationPolicy: Owner
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: teams/orders/orders-service
        property: database_url
    - secretKey: API_KEY
      remoteRef:
        key: teams/orders/orders-service
        property: api_key

Supply Chain Security with Sigstore

Enforce that only signed and verified images run in the cluster:

# policies/require-signed-images.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
    - name: verify-cosign-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - "registry.company.com/*"
          attestors:
            - entries:
                - keyless:
                    subject: "https://github.com/company/*"
                    issuer: "https://token.actions.githubusercontent.com"
                    rekor:
                      url: https://rekor.sigstore.dev
          mutateDigest: true
          verifyDigest: true

Add signing to your CI pipeline:

# .github/workflows/build-sign.yaml
name: Build and Sign Image
on:
  push:
    branches: [main]
 
permissions:
  contents: read
  id-token: write
  packages: write
 
jobs:
  build-and-sign:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
 
      - name: Login to registry
        uses: docker/login-action@v3
        with:
          registry: registry.company.com
          username: ${{ secrets.REGISTRY_USER }}
          password: ${{ secrets.REGISTRY_PASSWORD }}
 
      - name: Build and push
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: registry.company.com/${{ github.repository }}:${{ github.sha }}
          sbom: true
          provenance: true
 
      - name: Install cosign
        uses: sigstore/cosign-installer@v3
 
      - name: Sign the image
        run: |
          cosign sign --yes \
            registry.company.com/${{ github.repository }}@${{ steps.build.outputs.digest }}
        env:
          COSIGN_EXPERIMENTAL: 1
 
      - name: Verify the signature
        run: |
          cosign verify \
            --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
            --certificate-identity-regexp="https://github.com/company/*" \
            registry.company.com/${{ github.repository }}@${{ steps.build.outputs.digest }}
 
      - name: Run Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: registry.company.com/${{ github.repository }}:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          severity: CRITICAL,HIGH
          exit-code: 1
 
      - name: Upload scan results
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: trivy-results.sarif

Measuring Platform Success

A platform without metrics is a platform without direction. You need to measure whether the platform is actually improving developer productivity and satisfaction.

Key Metrics to Track

DORA Metrics (as influenced by the platform):

Deployment Frequency - How often teams deploy. The platform should increase this.
Lead Time for Changes - Time from commit to production. The platform should reduce this.
Change Failure Rate - Percentage of deployments causing failures. Golden paths should reduce this.
Mean Time to Recovery (MTTR) - How quickly teams recover from failures. Platform observability should reduce this.

Platform-Specific Metrics:

Time to First Deploy - How long it takes a new service to reach staging from scratch
Infrastructure Provisioning Time - Time from request to ready
Golden Path Adoption Rate - Percentage of new services using golden path templates
Self-Service Ratio - Percentage of infrastructure provisioned without a support ticket
Developer Net Promoter Score (NPS) - Quarterly survey of developer satisfaction
Support Ticket Volume - Decrease in infrastructure-related support tickets

Metrics Dashboard Configuration

Set up a Grafana dashboard that tracks platform health:

# monitoring/platform-metrics-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: platform-metrics-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  platform-metrics.json: |
    {
      "dashboard": {
        "title": "Platform Engineering Metrics",
        "uid": "platform-eng-metrics",
        "tags": ["platform", "engineering-metrics"],
        "timezone": "browser",
        "refresh": "5m",
        "panels": [
          {
            "title": "Deployment Frequency (Last 30 Days)",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
            "targets": [
              {
                "expr": "sum(increase(argocd_app_sync_total{phase=\"Succeeded\"}[1d])) by (name)",
                "legendFormat": "{{ name }}"
              }
            ]
          },
          {
            "title": "Lead Time for Changes",
            "type": "stat",
            "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
            "targets": [
              {
                "expr": "avg(platform_lead_time_seconds) / 3600",
                "legendFormat": "Hours"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "h",
                "thresholds": {
                  "steps": [
                    { "color": "green", "value": null },
                    { "color": "yellow", "value": 24 },
                    { "color": "red", "value": 72 }
                  ]
                }
              }
            }
          },
          {
            "title": "Change Failure Rate",
            "type": "gauge",
            "gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 },
            "targets": [
              {
                "expr": "sum(argocd_app_sync_total{phase=\"Failed\"}) / sum(argocd_app_sync_total) * 100"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "min": 0,
                "max": 100,
                "thresholds": {
                  "steps": [
                    { "color": "green", "value": null },
                    { "color": "yellow", "value": 10 },
                    { "color": "red", "value": 25 }
                  ]
                }
              }
            }
          },
          {
            "title": "Infrastructure Provisioning Time",
            "type": "histogram",
            "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
            "targets": [
              {
                "expr": "histogram_quantile(0.95, sum(rate(crossplane_claim_ready_duration_seconds_bucket[7d])) by (le, kind))",
                "legendFormat": "p95 - {{ kind }}"
              },
              {
                "expr": "histogram_quantile(0.50, sum(rate(crossplane_claim_ready_duration_seconds_bucket[7d])) by (le, kind))",
                "legendFormat": "p50 - {{ kind }}"
              }
            ]
          },
          {
            "title": "Golden Path Adoption",
            "type": "piechart",
            "gridPos": { "h": 8, "w": 6, "x": 0, "y": 8 },
            "targets": [
              {
                "expr": "count(kube_deployment_labels{label_managed_by=\"platform-team\"})",
                "legendFormat": "Golden Path"
              },
              {
                "expr": "count(kube_deployment_labels) - count(kube_deployment_labels{label_managed_by=\"platform-team\"})",
                "legendFormat": "Custom"
              }
            ]
          },
          {
            "title": "Self-Service Ratio (Last 30 Days)",
            "type": "stat",
            "gridPos": { "h": 4, "w": 6, "x": 6, "y": 8 },
            "targets": [
              {
                "expr": "sum(platform_self_service_provisions_total) / (sum(platform_self_service_provisions_total) + sum(platform_manual_provisions_total)) * 100"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    { "color": "red", "value": null },
                    { "color": "yellow", "value": 60 },
                    { "color": "green", "value": 80 }
                  ]
                }
              }
            }
          },
          {
            "title": "Support Tickets Trend",
            "type": "timeseries",
            "gridPos": { "h": 8, "w": 12, "x": 6, "y": 12 },
            "targets": [
              {
                "expr": "sum(increase(platform_support_tickets_total[7d])) by (category)",
                "legendFormat": "{{ category }}"
              }
            ]
          },
          {
            "title": "Active Services by Team",
            "type": "bargauge",
            "gridPos": { "h": 8, "w": 6, "x": 18, "y": 12 },
            "targets": [
              {
                "expr": "count(argocd_app_info{health_status=\"Healthy\"}) by (project)",
                "legendFormat": "{{ project }}"
              }
            ]
          }
        ]
      }
    }

Custom Prometheus Metrics

Instrument your platform API to expose custom metrics:

// platform-api/src/metrics/platform-metrics.ts
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
 
const register = new Registry();
 
export const serviceCreationCounter = new Counter({
  name: 'platform_service_creations_total',
  help: 'Total number of services created through the platform',
  labelNames: ['template', 'team', 'status'],
  registers: [register],
});
 
export const provisioningDuration = new Histogram({
  name: 'platform_provisioning_duration_seconds',
  help: 'Time taken to provision infrastructure resources',
  labelNames: ['resource_type', 'provider'],
  buckets: [30, 60, 120, 300, 600, 900, 1800],
  registers: [register],
});
 
export const selfServiceProvisions = new Counter({
  name: 'platform_self_service_provisions_total',
  help: 'Infrastructure provisioned through self-service',
  labelNames: ['resource_type'],
  registers: [register],
});
 
export const manualProvisions = new Counter({
  name: 'platform_manual_provisions_total',
  help: 'Infrastructure provisioned through manual tickets',
  labelNames: ['resource_type'],
  registers: [register],
});
 
export const activeServices = new Gauge({
  name: 'platform_active_services',
  help: 'Number of active services managed by the platform',
  labelNames: ['team', 'environment'],
  registers: [register],
});
 
export const leadTime = new Histogram({
  name: 'platform_lead_time_seconds',
  help: 'Time from commit to production deployment',
  labelNames: ['team', 'service'],
  buckets: [600, 1800, 3600, 7200, 14400, 28800, 86400],
  registers: [register],
});
 
export { register };

Common Anti-Patterns

Building an IDP is as much about avoiding pitfalls as it is about choosing the right tools. These are the most common mistakes platform teams make.

Building Too Much Too Soon

The most frequent failure mode is building an elaborate platform before understanding what developers actually need. Teams spend months building a sophisticated self-service portal, only to discover that developers needed better CI/CD pipelines first.

What to do instead: Start with the biggest pain point. If developers complain about slow deployments, fix deployments first. If they struggle to provision databases, start there. Use a "thin slice" approach - build a minimal solution for one use case, get feedback, iterate, then expand.

Phase 1 (Month 1-2):  Golden path for one service type
                       Basic CI/CD standardization
                       Service catalog in Backstage

Phase 2 (Month 3-4):  Self-service database provisioning
                       Automated environment creation
                       Observability integration

Phase 3 (Month 5-6):  Multi-cloud support
                       Security policy automation
                       Cost visibility

Phase 4 (Month 7+):   Advanced workflows
                       Custom developer tools
                       Platform analytics

Not Treating the Platform as a Product

Platform teams that operate like infrastructure teams - responding to tickets and building what they think developers need - consistently fail. The platform is a product, and developers are the customers.

Product practices that platform teams should adopt:

User research. Interview developers quarterly. Shadow them as they onboard new services. Understand their frustrations firsthand.
Feature prioritization. Use a framework (RICE, ICE, or similar) to prioritize platform features based on developer impact.
Feedback loops. Run monthly retrospectives with platform users. Track feature requests and bug reports in a public backlog.
Documentation. Maintain developer-facing documentation that explains how to use the platform, not how the platform works internally.
Onboarding experience. Measure and optimize the "time to hello world" for new developers joining the organization.

Mandating Instead of Attracting

Forcing developers to use your platform by blocking alternative approaches breeds resentment and shadow IT. Developers who feel forced will find workarounds.

The attraction model works better:

Make the golden path genuinely easier than the alternative
Provide escape hatches for teams that need custom solutions
Celebrate teams that adopt the platform early
Let adoption metrics speak for themselves

If developers are not adopting your platform voluntarily, the platform is not good enough yet. Mandates mask product problems.

Ignoring Developer Feedback

Every platform team has heard "we built it but nobody uses it." This happens when the platform is designed around what the platform team thinks is elegant rather than what developers actually need.

Concrete feedback mechanisms:

In-portal feedback widgets (thumbs up/down on every page)
Bi-weekly office hours where developers can ask questions and share frustrations
Anonymous surveys with specific, actionable questions
Instrumentation that shows which features are used and which are ignored
A public roadmap that developers can comment on and vote for features

Building Everything In-House

The opposite extreme of buying everything is building everything. Platform teams sometimes build custom solutions for problems that mature open-source tools already solve.

Decision framework:

Consideration	Build	Buy/Adopt
Core differentiator for your org	Yes	No
Mature OSS solution exists	No	Yes
Requires deep integration with internal systems	Yes	Maybe
Team has expertise to maintain it	Required	Less critical
Estimated build time exceeds 1 quarter	Reconsider	Likely better

Use Backstage instead of building a custom developer portal. Use Crossplane or Terraform instead of building a custom provisioning layer. Build the glue code and the developer experience layer on top of existing tools.

Getting Started

If you are standing up a platform team or improving an existing IDP, here is a practical starting sequence:

Audit current pain points. Survey 10-15 developers across different teams. Ask: "What takes longer than it should? What do you have to ask for help with?" Categorize responses by frequency and severity.
Pick one golden path. Choose the most common service type in your organization and build a complete golden path from repository creation to production deployment. Instrument it. Measure time-to-first-deploy before and after.
Stand up Backstage. Even a basic Backstage instance with a service catalog adds immediate value. Developers can discover services, find owners, and access documentation from one place.
Introduce one self-service capability. Whether it is database provisioning, environment creation, or secret management, pick the infrastructure request that generates the most support tickets and automate it.
Measure and iterate. Track the metrics described in this guide. Share them transparently with the organization. Let the numbers justify continued investment in the platform.

The organizations that succeed with platform engineering share one trait: they treat it as a long-term product investment, not a one-time infrastructure project. The platform is never done - it evolves with the needs of the developers it serves.