rutgerblom.com

Building the Foundation for a VCF Automation All Apps Landing Zone with Terraform

May 31, 2026
In a previous post I looked at guardrails in VCF Automation 9.1. This post is related, but the angle is different.

Instead of describing the boundaries conceptually, I want to look at how the first layers of a VCF Automation All Apps landing zone can start to take shape using Terraform.

I say first layers deliberately. I am not trying to claim that the entire All Apps consumption model can be represented as one clean Terraform module today. Some parts of the model may still be better handled through the UI, API, Kubernetes provider or other automation.

The foundation I am looking at here includes the organization, identity provider configuration, initial access, regional quota, organization networking, regional networking and content sources. Together, these are the pieces that start to give a tenant a boundary, capacity, network attachment and something useful to consume.

The rest of the landing zone builds on top of that.

What are we building?

The goal is not to create the most complete production-grade implementation. The goal is to show the sequence of objects that turn an empty All Apps organization into something that starts to look like a consumable landing zone.

At a high level, the flow looks like this:
Create the organization → Add an identity provider → Add initial organization access → Assign regional quota → Enable organization networking → Configure regional networking → Add optional shared subnet → Create content library → Create a namespace
This is not meant to describe the full platform build order. In a real environment, many provider-side constructs already exist before a tenant landing zone is created. The sequence here is simply the order I use to explain the tenant-facing foundation: start with the organization, attach identity and access, assign regional capacity, connect the organization to existing network foundation, and then move toward content and namespace consumption.

I have put the lab files used for this article in a GitHub repository. The examples are not meant to be a production module, but they show the exact resources I used while testing the landing zone flow.

Provider configuration

The first part is the Terraform provider configuration.

In my lab examples I normally use simple variable-based provider configuration. This keeps credentials and endpoints outside the actual resource definitions.
versions.tf
terraform { required_providers { vcfa = { source = "vmware/vcfa" version = "~> 1.1.0" } kubernetes = { source = "hashicorp/kubernetes" version = "~> 2.38" } } }
I use two vcfa provider configurations in the lab. The default provider uses the Provider Management context, mapped to the System organization. This is used for provider-side resources such as organizations, regional quota and organization networking.
provider.tf
provider "vcfa" { url = var.vcfa_url org = "System" auth_type = "integrated" user = var.vcfa_user password = var.vcfa_password allow_unverified_ssl = var.allow_unverified_ssl }
The second provider configuration uses the tenant organization context. I use this later when the example moves into the tenant consumption path, specifically when retrieving the kubeconfig used for Supervisor Namespace creation.
provider.tf
provider "vcfa" { alias = "tenant_blue" url = var.vcfa_url org = var.org_name auth_type = "integrated" user = "tenant-admin" password = var.tenant_admin_password allow_unverified_ssl = var.allow_unverified_ssl }
Creating the All Apps organization

The organization is the first concrete object in the landing zone foundation.

It provides the tenant boundary and gives the consumer its own organization context and portal. Other parts of the landing zone, such as identity configuration, networking scope, projects and content visibility, are then attached to this organization.
main.tf
resource "vcfa_org" "lz" { name = var.org_name display_name = var.org_display_name description = "Terraform lab organization for VCF Automation All Apps landing zone validation" is_enabled = true }
Adding an identity provider

An organization without an identity provider is not very useful, so the next step is to connect it to one.

In this example I use OIDC. The exact identity provider is not the important part. The important part is that the IdP configuration becomes part of the landing zone definition instead of something configured manually after the fact.

This assumes that the OIDC endpoint is reachable from VCF Automation and that the certificate chain presented by the identity provider is already trusted by VCF Automation.
oidc.tf
resource "vcfa_org_oidc" "lz" { org_id = vcfa_org.lz.id enabled = var.oidc_enabled prefer_id_token = false client_id = var.oidc_client_id client_secret = var.oidc_client_secret max_clock_skew_seconds = 60 wellknown_endpoint = var.oidc_wellknown_endpoint claims_mapping { email = "email" subject = "sub" first_name = "given_name" last_name = "family_name" full_name = "name" groups = "groups" } }
This only configures the organization’s OIDC IdP settings. It does not, by itself, give anyone access to the organization.

Adding initial organization access

Configuring an identity provider only solves the authentication part. The organization still needs an access assignment before anyone can actually use it.

I did not find a first-class resource in the current VCFA provider for assigning IdP users or groups to organization roles. For this lab, I therefore create an initial local organization administrator as a bootstrap access path.
org-access.tf
data "vcfa_role" "org_admin" { org_id = vcfa_org.lz.id name = "Organization Administrator" } resource "vcfa_org_local_user" "tenant_admin" { org_id = vcfa_org.lz.id username = "tenant-admin" password = var.tenant_admin_password role_ids = [ data.vcfa_role.org_admin.id ] }
This is not the access model I would normally use in production. There I would prefer IdP groups and role assignments based on those groups. For the purpose of this lab, the local user simply gives the new organization an initial administrator without relying on the provider administrator account.

Assigning regional quota

Once the organization exists and identity is configured, it needs access to capacity.

This is where region quota becomes important. The region connects the organization to the underlying VCF capacity, while the quota defines how much of that capacity the organization is allowed to consume.
data.tf
data "vcfa_vcenter" "vc" { name = var.vcenter_name } data "vcfa_supervisor" "supervisor" { name = var.supervisor_name vcenter_id = data.vcfa_vcenter.vc.id } data "vcfa_region" "region" { name = var.region_name } data "vcfa_region_zone" "zone" { region_id = data.vcfa_region.region.id name = var.region_zone_name } data "vcfa_region_vm_class" "small" { region_id = data.vcfa_region.region.id name = "best-effort-small" } data "vcfa_region_storage_policy" "vsan" { region_id = data.vcfa_region.region.id name = var.storage_policy_name }
main.tf
resource "vcfa_org_region_quota" "lz" { org_id = vcfa_org.lz.id region_id = data.vcfa_region.region.id supervisor_ids = [data.vcfa_supervisor.supervisor.id] zone_resource_allocations { region_zone_id = data.vcfa_region_zone.zone.id cpu_limit_mhz = 20000 cpu_reservation_mhz = 0 memory_limit_mib = 65536 memory_reservation_mib = 0 } region_vm_class_ids = [ data.vcfa_region_vm_class.vm_class.id ] region_storage_policy { region_storage_policy_id = data.vcfa_region_storage_policy.storage_policy.id storage_limit_mib = 524288 } }
This is where the landing zone starts to become bounded.

The organization can consume the region, but only within the limits defined here. That is an important distinction. Self-service without quota is just delegated risk. Self-service with quota becomes a controlled consumption model.

Enabling organization networking

Before regional networking can be configured through the Terraform provider, organization networking has to exist for the organization.

The VCFA provider exposes this as a separate vcfa_org_networking resource:
main.tf
resource "vcfa_org_networking" "lz" { org_id = vcfa_org.lz.id log_name = "tnblue" }
The log_name is a provider resource argument rather than something I would treat as a major landing zone design decision. In the VCF Automation 9.1 UI, this part of the model is mostly hidden behind the organization networking and external connection workflow. In Terraform, however, it is explicit and must exist before the regional networking resource can be configured.

Configuring regional networking

After organization networking is enabled, the next logical step is to connect the organization to regional networking.

In VCF Automation 9.1, the UI model is based on external connections. I did not find a first-class Terraform resource for creating external connections in the current VCFA provider. Instead, the provider still exposes the regional networking relationship through provider_gateway_id, which appears to map to the underlying centralized connectivity object rather than the 9.1 UI terminology.

That makes this part slightly awkward. The VCF Automation UI talks in terms of external connections, while the Terraform provider still wants a provider gateway reference.

For this example, I assume that the external connection already exists in VCF Automation. In other words, I am not creating the provider network foundation itself here. I am only showing how the organization is connected to an existing provider-side network construct through the current Terraform provider model.

The Terraform configuration looks like this:
data.tf
data "vcfa_provider_gateway" "default" { name = var.provider_gateway_name region_id = data.vcfa_region.region.id } data "vcfa_edge_cluster" "default" { name = var.edge_cluster_name region_id = data.vcfa_region.region.id }
main.tf
resource "vcfa_org_regional_networking" "lz" { name = "${var.org_name}${var.region_name}" org_id = vcfa_org_networking.lz.id region_id = data.vcfa_region.region.id provider_gateway_id = data.vcfa_provider_gateway.provider_gateway.id edge_cluster_id = data.vcfa_edge_cluster.edge_cluster.id }
In my lab, the organization, organization networking and regional quota were created successfully. The regional networking resource was the first part of the sequence that did not apply cleanly.

The referenced provider gateway is backed by an Active/Active Tier-0, and Terraform failed with the following error:
Unable to create Regional Networking Setting because the Provider Gateway eu-north-1-t0 backing Tier 0 has an unsupported HA mode, ACTIVE-ACTIVE.
The same regional networking configuration could be created through the VCF Automation 9.1 UI. So the practical result was:
Terraform create: failed VCF Automation UI create: worked
After creating the regional networking setting in the UI, I wanted to see whether Terraform could at least import and read the object. The generated regional networking setting name in my lab was:
tenant-blueeu-north-1
I was then able to import the UI-created regional networking setting into Terraform state:
terraform import vcfa_org_regional_networking.tenant_blue \ 'tenant-blue.tenant-blueeu-north-1'
That import worked, and with the resource name aligned to the UI-generated name, Terraform could read the imported object cleanly.

VCF 9.1 supports Active/Active Tier-0 for this design, and the UI can create the configuration. The Terraform provider can also import and read the object after it exists. The part that failed in my lab was creating the regional networking setting directly through Terraform when the provider gateway was backed by an Active/Active Tier-0. I opened a GitHub issue for this against the VCFA Terraform provider so the behaviour can be tracked separately.

Optional: shared subnet

A shared subnet is a way to expose a VLAN-backed network to an organization. This can be useful when a landing zone needs access to an existing network segment, for example for legacy services, migration scenarios or shared infrastructure dependencies.

A shared subnet can be created through the Terraform provider, but creating the subnet is only part of the workflow.

In the VCF Automation UI, this is a two-step workflow. The provider first creates the shared subnet, and then explicitly shares or assigns it to an organization.

In my lab, the vcfa_shared_subnet resource handled the first part. It created the shared subnet successfully and the object reached REALIZED state. I did not find an organization assignment argument on the vcfa_shared_subnet resource, so I treat the assignment step as outside this Terraform example for now.
shared-subnet.tf
resource "vcfa_shared_subnet" "legacy_services" { name = "legacy-services-vlan-123" description = "Shared VLAN subnet for legacy service connectivity" region_id = data.vcfa_region.region.id subnet_type = "VLAN" gateway_cidr = "10.123.0.1/24" vlan_id = 123 }
I did not find an organization assignment argument on the vcfa_shared_subnet resource. Because of that, I treat this as a provider-side subnet object in this example, not as a fully automated tenant-consumable landing-zone step.

Adding a content library

A landing zone also needs something to consume.

The content library is one way to make approved content available to the organization. The storage class must be available to the organization through the regional quota, so I make the content library depend on the quota assignment.
content-library.tf
data "vcfa_storage_class" "content_storage" { region_id = data.vcfa_region.region.id name = var.storage_policy_name } resource "vcfa_content_library" "tenant_blue" { org_id = vcfa_org.lz.id name = "${var.org_name}-library" description = "${var.org_display_name} content library" auto_attach = false delete_recursive = true storage_class_ids = [ data.vcfa_storage_class.content_storage.id ] depends_on = [ vcfa_org_region_quota.lz ] }
This worked cleanly in my lab. At this point the organization has an associated content library, but that does not mean there is a complete service catalog yet. A content library can provide images or content sources that later become part of the consumption model, while catalog items, blueprints and published services are a separate layer.

Creating a namespace

At this point the landing zone starts to move from provider-side foundation to consumer-facing workload scope.

A namespace gives an application team a bounded place to consume Supervisor-backed capabilities. Depending on how the platform is designed, that can include Kubernetes workloads, VKS clusters, VM-based workloads, storage classes, VM classes, content sources and network connectivity.

This part uses the tenant-scoped provider configuration introduced earlier. The vcfa_kubeconfig data source retrieves the kubeconfig details, and the Kubernetes provider uses those details when creating the Supervisor Namespace.
namespace.tf
data "vcfa_kubeconfig" "org" { provider = vcfa.tenant_blue } provider "kubernetes" { host = data.vcfa_kubeconfig.org.host token = data.vcfa_kubeconfig.org.token insecure = data.vcfa_kubeconfig.org.insecure_skip_tls_verify }
The namespace resource also assumes that the target project and VPC already exist. I did not find first-class project or VPC resources in the current Terraform provider. In my lab, I used the default project and the default regional VPC that existed in the organization.
variables.tf
variable "project_name" { type = string description = "VCF Automation project used for the Supervisor Namespace" default = "default-project" } variable "vpc_name" { type = string description = "VCF Automation VPC used for the Supervisor Namespace" default = null } locals { namespace_vpc_name = coalesce(var.vpc_name, "default-${var.region_name}") }
With those prerequisites in place, the namespace resource looks like this:
namespace.tf
resource "vcfa_supervisor_namespace" "payments_dev" { provider = vcfa.tenant_blue name_prefix = "payments-dev" project_name = var.project_name class_name = "small" description = "Payments development namespace" region_name = data.vcfa_region.region.name vpc_name = local.namespace_vpc_name storage_classes_class_config_overrides { name = var.storage_policy_name limit = "100Gi" } vm_classes_class_config_overrides { name = "best-effort-small" } zones_class_config_overrides { name = var.region_zone_name cpu_limit = "4000M" cpu_reservation = "0M" memory_limit = "8192Mi" memory_reservation = "0Mi" } }
In the VCF Automation UI, the namespace normally inherits its resource settings from the selected Namespace Class unless they are overridden. In my lab, the Terraform resource still required explicit storage and zone configuration values before it would plan successfully, so I include them here.

What this gives us

If we look at the Terraform resources together, the landing zone starts to take shape.
vcfa_org → tenant boundary vcfa_org_oidc → identity provider configuration vcfa_org_local_user → initial bootstrap access vcfa_org_region_quota → capacity envelope vcfa_org_networking vcfa_org_regional_networking → organization network foundation vcfa_shared_subnet → optional provider-side shared subnet object vcfa_content_library → content source for later consumption vcfa_supervisor_namespace → consumer-facing workload scope
That is not the entire operating model, but it is a useful foundation.

Useful because it shows where Terraform helps and where it stops. Some parts of the landing zone foundation can be expressed quite cleanly today. Other parts still depend on existing platform objects, UI workflows, imports or other automation.

What is still outside this example?

This example is incomplete. It does not cover the full project model, VPC creation, Transit Gateway configuration, firewall policy delegation, catalog item publishing, approvals, lease policies, naming standards, DNS automation, backup integration or CMDB updates. Those are all important parts of a real platform, but they are not all part of the same Terraform workflow today.

Some of these areas may already be possible through other APIs or automation methods. Some may become better covered by the Terraform provider over time. Others probably belong in a different part of the operating model altogether. I do not think that is a problem. A landing zone is not necessarily one module, one provider or one pipeline.

The good thing here is that a significant part of the foundation can still be expressed declaratively. That makes the model more repeatable, but it also makes it easier to review. The design becomes explicit. Which organization exists? Which region is assigned? What is the quota? Which network foundation is the organization connected to? Which content sources are available? Which namespace configuration is used?

These are the same things I would want to be clear about in the design anyway. Terraform just makes the answers explicit.

Final thought

I still like the landing zone term for VCF Automation All Apps, as long as we do not treat it as a single product feature. To me, a landing zone is the assembled consumption environment: organizations, identity providers, access, quota, networking, projects, VPCs, namespaces, content sources and policies working together.

Testing this with Terraform made the current boundary pretty clear. The provider can describe parts of the foundation quite well, but it is not yet a complete representation of the VCF Automation 9.1 All Apps model. Some constructs are missing, some still use older terminology, and some workflows still need the UI, import, API calls or another automation path.

That is fine, as long as we are honest about it. Terraform can still help make parts of the landing zone repeatable, versioned and testable. It is not the whole answer today, but it is part of the answer.

Using Keycloak as an OIDC Identity Provider for a VCF Automation Organization

May 22, 2026

I’ve been spending some more time with VCF Automation 9.1 in the lab lately.

One thing I wanted to test was organization-level OIDC authentication. Not global VCF SSO this time, but OIDC directly for a VCF Automation organization.

Since I already have Keycloak running as part of my Provider Box project, it made sense to use that as the identity provider.

Provider Box is my small bootstrap stack for VCF labs. It provides DNS, NTP, PKI, identity, IPAM, object storage and SFTP from a single node. The goal is not to build a full enterprise platform around the lab, but to have enough supporting services to make the environment feel complete and repeatable.

This post walks through what finally worked for me when using Keycloak as an OIDC identity provider for a VCF Automation organization.

Lab setup

The lab setup is pretty simple:

VCF Automation: 9.1
VCF Automation organization: tenant-blue
VCF Automation URL: https://pod-240-auto.sddc.lab

Keycloak URL: https://auth.sddc.lab:8443
Keycloak realm: provider-box
Keycloak client: vcfa-tenant-blue
Keycloak user: tenant-admin
Keycloak group: vcf-tenant-blue-admins

OIDC group claim: groups
VCF Automation imported group: vcf-tenant-blue-admins

The goal was simple: authenticate the user with Keycloak and authorize access in VCF Automation based on group membership.

Keycloak realm

The first thing I needed was a Keycloak realm to represent the identity provider used by the VCF Automation organization.

In my lab I used the realm:

provider-box

The realm was created by the Provider Box setup and is exposed through the Keycloak instance at:

https://auth.sddc.lab:8443

The realm-specific OIDC well-known endpoint is therefore:

https://auth.sddc.lab:8443/realms/provider-box/.well-known/openid-configuration

This endpoint is important later because VCF Automation can use it for autodiscovery. Instead of manually entering all OIDC endpoint URLs, I can point VCF Automation to the well-known configuration endpoint and let it populate the authorization endpoint, token endpoint, issuer and logout endpoint automatically.

Nothing special was required in the realm itself for this setup. The important configuration was done on the Keycloak client, the user/group objects and the client scopes.

Keycloak client

Next I created an OpenID Connect client in Keycloak for the VCF Automation organization.

In my lab the client was called:

vcfa-tenant-blue

I enabled Client authentication, since VCF Automation authenticates to Keycloak using the client ID and client secret. I also kept Standard flow enabled, since VCF Automation uses the browser-based authorization code flow.

Redirect URI

The redirect URI comes from the VCF Automation organization OIDC configuration screen under:

Administer → Connections → Identity Providers → OIDC

Here VCF Automation shows the Client Configuration Redirect URI. In my lab this was:

https://pod-240-auto.sddc.lab/login/oauth?service=tenant:tenant-blue

I added this value to the Keycloak client under Valid redirect URIs.

So the flow is slightly circular:

Open the VCFA organization OIDC page.
Copy the Client Configuration Redirect URI.
Add it to the Keycloak client.
Finish the Keycloak client configuration.
Return to VCFA and complete the OIDC setup.

This is easy to miss because most of the IdP-side configuration is done in Keycloak, but the correct redirect URI is generated by VCF Automation for the specific organization.

I have also published a sanitized example of the Keycloak client configuration as a GitHub Gist:

	{
	"clientId": "vcfa-tenant-blue",
	"clientAuthenticatorType": "client-secret",
	"secret": "<replace-with-generated-client-secret>",
	"redirectUris": [
	"https://<vcf-automation-fqdn>/login/oauth?service=tenant:<organization-name>"
	],
	"publicClient": false,
	"standardFlowEnabled": true,
	"protocol": "openid-connect",
	"protocolMappers": [
	{
	"name": "groups",
	"protocol": "openid-connect",
	"protocolMapper": "oidc-group-membership-mapper",
	"config": {
	"full.path": "false",
	"id.token.claim": "true",
	"access.token.claim": "true",
	"userinfo.token.claim": "true",
	"claim.name": "groups"
	}
	}
	]
	}

view raw keycloak-vcfa-oidc-client-example.json hosted with ❤ by GitHub

Treat this as a reference rather than a file to import directly. The important parts are the confidential OIDC client configuration, the VCF Automation-generated redirect URI, the standard authorization code flow, and the groups protocol mapper used to include Keycloak group membership in the token.

Keycloak user and group

For the lab I created a dedicated Keycloak user for the VCF Automation organization.

The user was:

tenant-admin

In Keycloak, the only strictly required attribute when creating the user is the username. However, I recommend adding email, first name and last name immediately. Otherwise Keycloak may ask the user to update the account information during the first login.

In my lab the user had:

Username: tenant-admin
Email: tenant-admin@sddc.lab
Email verified: On
First name: Tenant
Last name: Admin

I also set a password for the user and set Temporary to Off. That means the user is not forced to change the password at first login.

I then created a Keycloak group for the VCF Automation organization administrators:

vcf-tenant-blue-admins

The tenant-admin user was added as a member of that group.

This group becomes important later when VCF Automation maps the authenticated user to an organization role.

In my lab the intended mapping was:

vcf-tenant-blue-admins → Organization Administrator

So the model is:

Keycloak user → Keycloak group → VCF Automation organization role

Keycloak group mapper

The next part was one of the things that made the difference in my setup.

VCF Automation needs group information from the identity provider so it can map the authenticated user to an organization role. In Keycloak, I exposed that information by adding a group membership mapper to the client’s dedicated scope.

I opened it from the client configuration:

Clients → vcfa-tenant-blue → Client scopes → vcfa-tenant-blue-dedicated

From there, I selected:

Configure a new mapper

The mapper type was:

Group Membership

I named the mapper:

groups

The mapper name is mostly for readability in Keycloak, but I used groups to match what the mapper is doing.

The important setting is the token claim name. I set that to:

groups

This is the claim name that VCF Automation later uses in the claims mapping:

Groups → groups

I also changed Full group path to Off.

This is important. With full group path enabled, Keycloak returned the group as:

/vcf-tenant-blue-admins

With full group path disabled, Keycloak returned the cleaner group name:

vcf-tenant-blue-admins

That is the value I wanted to use later when importing the group in VCF Automation.

For my test user, the relevant Keycloak group was:

vcf-tenant-blue-admins

This group is returned in the OIDC groups claim and is later mapped to an organization role in VCF Automation.

In my lab the intended mapping was:

vcf-tenant-blue-admins → Organization Administrator

VCF Automation OIDC general settings

With the Keycloak side prepared, I moved back to the VCF Automation organization.

The OIDC configuration is done inside the tenant organization. In my lab that organization was:

tenant-blue

In VCF Automation I went to:

Administer → Connections → Identity Providers → OIDC

From there I selected:

Configure

In the first step of the wizard, I configured the general OIDC settings.

Setting	Value
Status	`Active`
Client ID	`vcfa-tenant-blue`
Client Secret	`<client secret from Keycloak>`
Configuration Discovery	`Enabled`
IDP Well-known Configuration Endpoint	`https://auth.sddc.lab:8443/realms/provider-box/.well-known/openid-configuration`

The client ID must match the Keycloak client created earlier.

The client secret comes from the Keycloak client credentials.

I enabled configuration discovery so VCF Automation could read the OIDC metadata directly from Keycloak.

After entering the well-known endpoint, VCF Automation automatically populated the authorization endpoint, token endpoint, issuer ID and logout endpoint in the next step.

VCF Automation OIDC endpoints

The second step in the wizard did not require any input in my setup.

Because configuration discovery was enabled in the previous step, VCF Automation automatically populated the endpoint values from the Keycloak well-known configuration endpoint.

In my lab the discovered values were:

Field	Value
User Authorization Endpoint	`https://auth.sddc.lab:8443/realms/provider-box/protocol/openid-connect/auth`
Access Token Endpoint	`https://auth.sddc.lab:8443/realms/provider-box/protocol/openid-connect/token`
Issuer ID	`https://auth.sddc.lab:8443/realms/provider-box`
End Session Endpoint	`https://auth.sddc.lab:8443/realms/provider-box/protocol/openid-connect/logout`

I did not change these values manually.

VCF Automation OIDC scopes

The third step in the wizard defines the scopes requested by VCF Automation during login.

I kept the standard OIDC scopes:

openid
email
profile

These scopes are enough for the basic OIDC login and user profile information.

The group information used for authorization was added on the Keycloak side by configuring a group membership mapper on the client’s dedicated scope. I did not need to add a separate groups scope in this step.

VCF Automation OIDC claims mapping

The next step is claims mapping.

Most of the mappings were populated automatically after VCF Automation read the Keycloak well-known configuration.

I left the standard user mappings unchanged, but added one important mapping manually:

Groups → groups

This was easy to miss. Keycloak authentication worked, and Keycloak showed successful LOGIN and CODE_TO_TOKEN events, but VCF Automation still rejected the login.

The reason was that VCF Automation did not know which claim to use for group membership. After adding the Groups → groups mapping, VCF Automation could use the group returned by Keycloak and match it to the imported organization group.

The final claims mapping was:

VCF Automation theme	OIDC claim	Comment
Subject	`sub`	Auto-populated
Email	`email`	Auto-populated
Full Name	`name`	Auto-populated
First Name	`given_name`	Auto-populated
Last Name	`family_name`	Auto-populated
Groups	`groups`	Added manually
Name in Source	`sub`	Auto-populated

VCF Automation populated Name in Source with sub. I left that value unchanged, since the login and group-based authorization worked with this mapping.

VCF Automation OIDC key configuration

The key configuration was also populated automatically from the Keycloak discovery information.

Since the well-known configuration endpoint includes the Keycloak JWKS endpoint, VCF Automation was able to discover the signing keys without me uploading a PEM file manually.

In the wizard, the discovered keys appeared automatically in the key list.

I left Automatic Key Refresh disabled in this setup, since the required keys were already discovered during configuration.

The important part is that VCF Automation can validate the tokens signed by Keycloak. With configuration discovery enabled, this was handled automatically in my lab.

VCF Automation OIDC button label

The last step in the wizard controls the label shown on the VCF Automation login screen.

The default label is:

Log in with OIDC

I changed it to:

Log in with Provider Box

This is optional, but it makes the login screen a bit clearer in the lab since Provider Box is where Keycloak is running.

VCF Automation group-to-role mapping

After the OIDC provider was configured, the next step was to give the external group access inside the VCF Automation organization.

The Keycloak user was a member of this group:

vcf-tenant-blue-admins

That group is returned by Keycloak in the OIDC groups claim.

In VCF Automation I went to:

Administer → Access Control → Groups

From there I selected:

Import Groups

I imported the OIDC group:

vcf-tenant-blue-admins

and assigned it the role:

Organization Administrator

So the final authorization model became:

Keycloak user → Keycloak group → OIDC groups claim → imported VCF Automation group → organization role

In this lab:

tenant-admin → vcf-tenant-blue-admins → Organization Administrator

This is the part that controls what the user can actually do after login. The OIDC configuration handles authentication, but the imported group and role assignment handle authorization inside the VCF Automation organization.

Testing the login

After the OIDC provider and group mapping were in place, I tested the login from the VCF Automation organization login page.

Clicking the login button redirected me to Keycloak, where I logged in as:

tenant-admin

After successful authentication, Keycloak redirected the browser back to VCF Automation.

At this point there are two things that need to work.

First, the OIDC flow itself needs to complete. In Keycloak, I could see successful events for the login and token exchange:

LOGIN
CODE_TO_TOKEN

Second, VCF Automation needs to authorize the user inside the organization. That part depends on the group claim and the imported group mapping.

In my setup, the user was a member of this Keycloak group:

vcf-tenant-blue-admins

That group was returned in the OIDC groups claim, imported into VCF Automation, and mapped to the Organization Administrator role.

Once both parts were in place, the user could log in to the VCF Automation organization successfully.

Closing thoughts

After this, the setup worked as expected.

Keycloak handled the authentication, VCF Automation consumed the OIDC claims, and access inside the organization was controlled through the imported group and role assignment.

The two things that mattered most were the group mapper in Keycloak and the claims mapping in VCF Automation.

On the Keycloak side, the group membership mapper needed to return a clean groups claim:

vcf-tenant-blue-admins

On the VCF Automation side, I had to add the mapping:

Groups → groups

Without that mapping, the OIDC login and token exchange still succeeded, but VCF Automation could not use the Keycloak group membership for authorization.

This is also a good example of why I like having Provider Box in the lab. It gives me a small and repeatable identity provider close to the rest of the VCF supporting services, without having to depend on an external IdP just to test organization-level authentication.

Not a complex setup in the end, but there were a few details that were easy to miss.

Guardrails in VCF Automation 9.1

May 20, 2026
When we talk about private cloud, we often end up talking about automation.

That makes sense. Nobody wants to build a cloud platform where every request still ends up as a ticket, a manual change, and a meeting. Self-service is part of the point.

But self-service without guardrails is not really cloud. It is delegated infrastructure access with a nicer user interface.

That is why I think this is one of the more interesting parts of VMware Cloud Foundation 9.1 and VCF Automation. Users can request virtual machines, namespaces, VKS clusters, networks, and other services, but the platform team can also define the boundaries around that consumption up front.
- Who can consume what?
- Where can workloads land?
- How much can a tenant consume?
- Which networks can they attach to?
- When should temporary workloads disappear again?
Those boundaries are what I mean by guardrails in this context.

In this article I am focusing on the All Apps organization model in VCF Automation. That is the model built around vSphere Supervisor, regions, region quotas, projects, namespaces, namespace classes, VPCs, and the IaaS services used to consume VMs and Kubernetes-based services. I am not trying to cover the VM Apps organization model or the older Aria Automation-style consumption model here.

Guardrails are not a single feature

It is tempting to look for a specific feature in VCF Automation called “guardrails”. I do not think that is the right way to look at it.

The guardrails are really the result of how the platform model is put together. Organizations, regions, quotas, projects, namespaces, namespace classes, VPCs, catalog items, policies, identity, and extensibility all contribute in different ways.

Some of this becomes hard limits. Some of it becomes access control. Some of it becomes lifecycle management or placement control. And some of it is simply about giving users sensible defaults instead of making every request a small design exercise.

That last part is easy to underestimate. A shared platform should not rely on every consumer understanding the full architecture. It should encode enough of the architecture so that normal users can consume services without constantly stepping outside the intended design.

Guardrails in the VCF Automation 9.1 All Apps model, shown as layered platform controls.

The organization is the tenant boundary

At the top of the All Apps consumption model we have the organization.

An organization can represent a tenant, a customer, a department, a line of business, or some other administrative boundary. Exactly what it represents depends on the environment.

In an enterprise, I would not automatically create an organization per application team. That would probably create too much overhead. I would rather use organizations for larger boundaries where identity, quota, catalog, networking, and governance need to be separated.

In a service provider or internal platform provider model, the mapping is more obvious. One organization can represent one customer, agency, department, or tenant.

The key point is that the organization becomes the first real consumption boundary. It groups users, policies, resources, services, catalog entities, and access control.

This is also where VCF 9 starts to feel less like infrastructure with automation added on the side, and more like an actual platform model. The provider is not just handing consumers access to vCenter and NSX objects, but exposing more controlled abstractions through VCF Automation.

Regions and quotas define what can be consumed

A region in VCF Automation represents a collection of infrastructure resources. In practice, it maps to one or more Supervisor-enabled clusters and provides compute, memory, storage, and networking capacity.

The region by itself does not give an organization access to anything. That access is controlled through a region quota. The quota defines how much of the region an organization can consume, but also which capabilities are available. CPU and memory limits, reservations, VM classes, storage classes, storage limits, and zones are all part of the model.

This is also why I see region quota as one of the defining constructs of the All Apps model. It is the mechanism that connects provider-managed infrastructure to organization-level consumption.

I think this is one of the more important guardrails in the design because it changes the nature of self-service. Without quota, self-service depends too much on trust and informal agreements. With quota, it becomes an allocation model. The provider can give an organization access to capacity without giving it unlimited access to the platform.

It also makes the conversation with the business or application teams more concrete. Instead of saying “please don’t use too much”, the platform team can say “this is the capacity envelope assigned to you”. For a shared private cloud platform, that is a much healthier starting point.

Projects are where teams start to consume

Inside an organization, projects are closer to the application team or workload team boundary. Users and groups are added to projects, catalog items can be shared with projects, and resources are provisioned into namespaces that belong to those projects.

I like this model because it avoids making organizations too granular. The organization can remain the tenant or major administrative boundary, while projects can represent application teams, environments, or delivery groups.

For example, one organization might represent a department, with multiple projects for application teams, separate namespaces for application environments, and different policies depending on whether the project is used for production, test, or development.

That gives the platform team a useful hierarchy without exposing every underlying infrastructure detail.

Namespaces are the application landing zone

Namespaces are where the consumption model starts to become practical for application teams.

A namespace gives a team a defined place to deploy into. Compute, memory, storage, and networking are assigned to that namespace, and VMs, VKS clusters, volumes, and other services can then be provisioned inside that boundary.

That is a cleaner model than giving every team direct access to the underlying infrastructure and expecting them to interpret the platform design correctly. The application team gets a place to land, while the platform team still controls the size of that place, which storage classes can be used, which VM classes are available, and which networks are attached.

In this context, a namespace is not just a Kubernetes construct. It becomes a cloud consumption boundary for an application or environment.

Namespace classes make the model repeatable

If every namespace is created manually with custom settings, the model will eventually drift. Namespace classes help avoid that by turning common namespace patterns into reusable templates.

A namespace class can define CPU limits, memory limits, reservations, storage limits, VM classes, storage classes, content libraries, and zone placement. This is the kind of construct I like in a platform design because it makes the intended consumption model repeatable without forcing every application team to understand all the underlying design choices.

Instead of asking each team how much CPU, memory, storage, and which capabilities they need from scratch, the platform team can define a small number of sensible options.

For example, the platform team might define classes such as dev-small, test-medium, prod-medium, prod-large, gpu-enabled, or restricted.

The exact names are not important. What matters is giving users enough choice to be useful, without turning every namespace request into a small design exercise.

Networking guardrails are critical

Networking is one of the areas where self-service can quickly become problematic. It is useful to let application teams consume network services, but it is usually not a good idea to expose the full networking design to every consumer.

In VCF Automation, the networking model uses constructs such as VPCs, connectivity profiles, transit gateways, IP blocks, external connections, NAT, VPN, and load balancing. The provider defines the available building blocks, while organization and project users consume them through controlled abstractions.

IP consumption is part of that model as well. External IP blocks are not just pools of addresses that organizations can freely consume. The provider can apply IP quotas to control how many individual IP addresses or CIDRs an organization can allocate, and can also restrict the largest subnet size that can be requested. That matters because IP address space is often one of the easiest shared resources to waste if it is not governed.

This fits well with how I think NSX should be used in a modern VCF platform. NSX should provide the network and security foundation, but most consumers should not need to work directly with every NSX object underneath. They need a network model they can consume safely, not unrestricted access to the implementation details.

The platform team still owns the routing model, external connectivity, IP allocation boundaries, edge capacity, and security architecture. VCF Automation helps expose those capabilities at the right level.

Catalog items reduce unnecessary freedom

There is a difference between self-service and free-for-all. A catalog item gives users a controlled way to request something useful, whether that is a VM, a namespace, a VKS cluster, an application blueprint, or a workflow.

The value is not only in the request itself, but in the design work that happens before the item is published. The platform team or organization administrator can decide what a good request should look like, which inputs should be exposed, which defaults should be used, and which options should be hidden.

This is also why I think custom forms are more important than they might look at first. They are not just there to make the request page nicer. They can hide complexity, validate input, simplify choices, and reduce the number of ways users can get something wrong.

A good catalog item should be easy, almost boring, to consume. The design work should happen before the item is published, not every time someone requests it.

Policies are the visible guardrails

When people think about governance in automation platforms, they often think about policies first. That makes sense, because approval policies, lease policies, day-2 action policies, and IaaS resource policies are the most explicit governance constructs in VCF Automation.

Each of them solves a different problem. Approval policies can require human approval before a deployment or day-2 action continues. Lease policies help prevent temporary workloads from living forever. Day-2 action policies control what users are allowed to do after something has been deployed. IaaS resource policies can enforce more technical constraints around what resources are allowed inside namespaces.

They are all useful, but I would not start with approval policies. If everything requires approval, the platform probably has not encoded enough of the design into hard guardrails. At that point, the approval process becomes a substitute for platform design, and that is not where I would want to end up.

I would rather use quotas, namespace classes, project scoping, RBAC, catalog design, and IaaS resource policies to prevent bad outcomes automatically. Then use approvals for exceptions, high-cost requests, risky actions, or production-impacting changes.

Day-2 matters just as much as day-1

Provisioning usually gets most of the attention. That is understandable. The first thing people want to see from an automation platform is often whether it can deploy something at all.

But the operating model does not stop when the deployment is created. The next set of questions is just as important: can the user resize the VM, delete the deployment, extend the lease, add a disk, change network settings, or run a custom action?

Those actions can have just as much impact as the original request. In some cases they are more sensitive. Creating something in a controlled way is one thing. Changing or deleting something that already exists can be a bigger operational risk.

This is where day-2 action policies become useful. They allow the platform team to define which actions are available after deployment, and to whom. If the deployment process is governed but the operational lifecycle is wide open, the platform still has a gap.

Lease policies are simple, but powerful

Lease policies are not a new idea, but they are still one of the most useful guardrails. A lot of waste in infrastructure platforms does not come from bad intent. It comes from resources that were created for a valid reason and then forgotten.

That is especially common with test environments, proof-of-concepts, sandboxes, and other temporary deployments. The work gets delayed, the person who created the environment moves on, and eventually nobody knows whether it is still needed.

For those types of workloads, I would almost always use leases. They do not have to be aggressive, but there should be some kind of lifecycle expectation. Temporary resources should either be renewed because they are still needed, or reclaimed because they are not.

Production is different, of course. I would not want production workloads to disappear because someone forgot to click renew. But for non-production and temporary environments, lease policies are one of the easiest ways to keep the platform clean.

IaaS resource policies are where hard constraints belong

IaaS resource policies are interesting because they move beyond human approval and into admission control. Instead of asking someone to approve every exception, the platform can enforce certain rules before resources are created.

This is where you can control what types of namespace resources are allowed. For example, you might restrict which services can be used, which VM classes are available, which Kubernetes versions are acceptable, or how VKS clusters can be shaped.

That is closer to how I think platform engineering should work. The desired behavior should not only live in design documents, naming standards, or operational guidelines. Where it makes sense, it should be encoded into the platform itself.

At the same time, this is also where policy design needs some care. If policies overlap badly or contradict each other, users will not experience the platform as governed. They will experience it as broken.

For that reason, I would keep the first version simple. Start with a small number of clear policies that express real constraints, and expand when the operational consequences are better understood.

Identity and roles still matter

Guardrails are not only about resources. They are also about who can do what, and that is where the role model becomes important.

VCF Automation has provider-level roles, organization roles, and project roles. That makes it possible to separate platform administration, tenant administration, project administration, consumption, and auditing without giving everyone the same level of access.

This is another area where I would avoid over-engineering too early. I would rather start with a small number of clear personas:
- Platform provider administrator
- Organization administrator
- Project administrator
- Project user
- Auditor
Once those personas are clear, it becomes easier to decide what each of them should actually be allowed to do.

The worst outcome is usually not that you have too few roles in the beginning. The worse outcome is a role model nobody understands. If people cannot explain who is allowed to do what, the access model is already too complicated.

Extensibility fills the gaps

No platform product will know every rule an organization wants to enforce. There will always be things that depend on local processes, existing systems, naming standards, ownership models, or security requirements.

That is where event subscriptions, VCF Operations Orchestrator, APIs, and external systems become useful. They allow the platform to connect with the parts of the organization that sit outside VCF Automation itself.

In practice, that might mean validating names against a standard, creating tickets, updating a CMDB, applying metadata or tags, or enriching an approval based on cost center, data classification, or environment type. These are not always native product features, but they are still real platform requirements.

I would not use extensibility to compensate for a weak platform design. But when the main model is sound, extensibility is a good way to connect the platform to the rest of the operating model.

My preferred design approach

If I were designing this for a serious enterprise VCF platform, I would not start with approval policies. I would start with the platform boundaries.

The first step would be to define the basic consumption model: organizations, regions, zones, and region quotas. That gives the platform a structure before users start consuming anything.

After that, I would define the common landing zones and connectivity model. That means standard namespace classes, the VPC and connectivity model, projects, project roles, and the catalog items that represent the normal consumption paths.

Only then would I start adding policies. IaaS resource policies are useful for hard technical constraints. Lease policies are useful for temporary workloads. Day-2 policies are useful for operational control. Approval policies should be added where human approval actually adds value, not as the default answer to every governance question.

Finally, I would use extensibility for the things that need to connect to the rest of the organization, such as external validation, CMDB updates, ticketing, metadata, or approval enrichment.

This order matters because it changes the role of approvals. If you start with approvals, you risk building a ticketing system with a cloud portal in front of it. If you start with the platform boundaries, approvals become the exception rather than the foundation.

Closing thoughts

VCF Automation guardrails are not only about stopping users from doing things. They are mainly about making the platform easier to consume safely.

A good guardrail should not feel like unnecessary friction. It should make the intended path clearer, reduce the number of decisions an application team needs to make, protect the platform from uncontrolled growth, and make operations more predictable.

In older infrastructure designs, a lot of this discipline lived in documents, meetings, naming standards, and the heads of a few experienced engineers. That can work for a while, but it does not scale very well once more teams start consuming the platform.

What I like about the VCF Automation model is that more of this discipline can be encoded into the platform itself. Not everything, and not perfectly, but enough to make the operating model more explicit.

That is why guardrails are interesting to me. Not because they make self-service more restrictive, but because they make self-service more usable. Without them, a private cloud easily becomes just another shared infrastructure environment with a portal in front of it.
Owning the Platform on VCF 9

February 17, 2026

In my previous article, I reflected on what I would design differently if I were building an NSX platform today. That piece focused on architectural choices — fewer abstractions, clearer boundaries, stronger defaults.

But design decisions are only part of the story. What ultimately matters is who carries responsibility for how the platform behaves over time.

Much of my current work revolves around VMware Cloud Foundation and NSX. That hasn’t changed. What has changed is the altitude at which I tend to operate. The conversations are less about individual features and more about responsibility boundaries, lifecycle, and what the platform should enforce by default. VCF 9 simply makes those questions harder to ignore.

Deploying a platform is one thing. Owning it is something else entirely. And on a VCF 9 platform, that distinction becomes very visible.

Defaults and Guardrails

On a VCF 9 platform, one of the primary responsibilities of a platform engineer is defining the defaults that everything else inherits. This sounds simple, but it rarely is.

Defaults determine how namespaces are structured, how VPCs are segmented, what network isolation patterns are applied automatically, and which policies are enforced without anyone having to request them explicitly. They define what “normal” looks like.

In many organizations, the tendency is to focus on flexibility first. Every team wants options. Every application is treated as unique. Over time, this leads to exception-driven design, where the platform becomes a collection of special cases. A platform engineer has to resist that.

The goal isn’t to maximize flexibility. It’s to maximize safe autonomy. Application teams should be able to move quickly — within boundaries that are intentional and well understood. If every new workload requires new networking decisions or repeated security debates, the platform isn’t providing enough guidance.

In a VCF 9 world, constructs like VPCs and integrated lifecycle management make it easier to encode these decisions directly into the platform. That doesn’t remove responsibility. It concentrates it. The platform engineer must decide which choices are available, which are restricted, and which shouldn’t be visible at all. Those decisions shape the operational behavior of the platform far more than any single feature configuration.

Lifecycle Over Provisioning

Provisioning is visible. Lifecycle is not.

On a VCF 9 platform, provisioning a workload — whether a VM or a Kubernetes cluster — is increasingly straightforward. Templates exist. Policies are predefined. The automation layer does most of the work. That part isn’t where the platform engineer earns their keep.

The real responsibility sits in lifecycle.

Clusters need to be upgraded. Supervisor versions change. NSX components evolve. Dependencies shift. Policies that made sense six months ago may conflict with new operating models — all without breaking what already runs on the platform.

A platform engineer has to think in terms of failure domains and blast radius. What happens when a management domain upgrade introduces a behavioral change? How isolated are tenant VPCs during an incident? What is the rollback strategy if a network policy update has unintended side effects?

These aren’t provisioning questions. They’re lifecycle questions.

In many environments, the temptation is to optimize for day-one deployment. The real complexity shows up in year two and year three, when the platform has grown, ownership has shifted, and the original architects are no longer deeply involved.

On VCF 9, lifecycle is integrated across compute, storage, networking, and automation. That integration is powerful, but at the same time it also means that changes ripple across layers. The platform engineer needs to understand those relationships and design with change in mind from the beginning.

Provisioning makes something exist. Lifecycle ensures it continues to operate safely over time.

What the Platform Engineer Is Not Responsible For

Clear responsibility boundaries are just as important as defined ownership.

In many organizations, “platform engineering” becomes a catch-all term. Anything that sits somewhere between infrastructure and applications tends to fall under it, and that ambiguity creates friction quickly.

On a VCF 9 platform, the platform engineer isn’t responsible for writing application code, designing CI/CD pipelines, tuning individual microservices, or debugging application-level issues. Those are different domains, owned by different teams.

The platform engineer also isn’t responsible for designing bespoke networking patterns for every workload. If each new application triggers a new NSX policy design session, the platform engineer has drifted into application ownership. The goal is to define patterns once and let teams consume them safely.

Nor is the platform engineer a ticket router between compute, storage, and networking teams. One of the reasons VCF exists is to reduce fragmentation between those domains. The responsibility is to ensure they behave coherently — not to manually coordinate every interaction.

In practice, this means saying “no” at the right time. It means pushing back when flexibility undermines operability. It means protecting the platform from well-intentioned customization that introduces long-term fragility.

Platform engineering isn’t about controlling everything. It’s about deciding what must be controlled — and leaving the rest to the teams that own it.

Closing Thoughts

On a VCF 9 platform, the role of the platform engineer is less about configuring components and more about defining how they work together over time.

The technology stack is mature. The automation layers are capable. Provisioning is no longer the hard part.

Responsibility is.

Defining defaults, designing for lifecycle, setting boundaries, and protecting coherence across compute, storage, networking, and automation aren’t glamorous tasks — but they determine whether the platform remains stable and operable years after its initial deployment.

In that sense, platform engineering on VCF 9 isn’t a new discipline. It’s a shift in focus — less about building things and more about owning how they behave over time. And that shift changes what the role is really about.
Things I Would Do Differently If I Designed an NSX Platform Today

January 2, 2026
When I started designing large NSX platforms, most of the hard problems were technical.

How far could we push microsegmentation?
How much overlay networking could we introduce?
How flexible could we make the design so it would survive future requirements?

At the time, that made a lot of sense.

Today, the situation is different. NSX is mature, stable, and extremely capable. But the environment around it has changed quite a lot. VMware Cloud Foundation 9, platform thinking, GitOps, and new operating models have shifted where things like responsibility, security, and control should live.

Looking back, I don’t think we designed things wrong. The designs were actually pretty good given the context. But if I were to design an NSX platform today, there are several choices I would make differently — not because NSX matters less, but because it needs to matter in a different way.

I would design for fewer abstractions, not more

Early NSX designs often aimed for maximum flexibility.

We built layers of abstractions: segments, groups, policies, nested constructs. The thinking was usually sound — future-proofing, reuse, and separation of concerns. And technically, this worked.

Operationally, it often didn’t.

Every abstraction adds cognitive load. Someone has to understand it, maintain it, debug it, and eventually explain it to the next team. Over time, that cost adds up faster than most people expect.

If I were designing today, I would be more conservative. I would introduce abstractions only when there is a clear need — not just because “we might need it later”.

I would stop treating microsegmentation as the primary security control

This is not an argument against microsegmentation. It is still one of the strongest features NSX offers.

But in the past, microsegmentation was often treated as the foundation of security — something every workload would eventually get, almost by default.

What has changed here is that there are now more layers that contribute meaningfully to blast-radius reduction:
- Identity-aware workloads
- Stronger application-level security controls
- Kubernetes-native isolation mechanisms
- Platform-level guardrails
In many environments, I’ve seen microsegmentation applied too early and too broadly. The result is usually slower adoption, more fragile policy sets, and eventually, a security posture that is hard to reason about during incidents.

Today, I would treat microsegmentation as a precision tool, not a baseline requirement. Extremely valuable in the right places — but not something that needs to be everywhere from day one.

The goal remains the same: reduce blast radius.
There are now simply more ways to get there.

What NSX is (and is not) meant to be today

This is something I myself underestimated early on.

NSX is very powerful, and that sometimes creates pressure to use it as a solution for problems it was never meant to solve.

NSX is not:
- A developer-facing API
- A CI/CD system
- An application security framework
- A replacement for platform governance
With Supervisor, namespaces, and Kubernetes in the picture, there is often pressure to surface NSX concepts upwards into application design. In many cases, that is a mistake.

NSX works best when it is treated as foundational infrastructure — something that provides strong defaults and guardrails, but stays largely invisible to application teams.

NSX is becoming part of the platform, not a separate design exercise

If I were designing an NSX platform today, I would lean heavily into the fact that NSX is no longer a separate layer that needs to be engineered and exposed on its own.

With the VPC construct and the VCF Automation All Apps model, NSX becomes an integrated part of the platform’s native application model. Lifecycle, isolation, and consumption patterns are defined up front, rather than stitched together through project-specific designs.

From an architectural perspective, this is an important shift. It reduces the need to surface NSX concepts upward, while still preserving strong isolation and governance underneath. It also enforces some of the discipline that previously depended entirely on human restraint.

This does not remove the need for NSX expertise — but it does change where that expertise should be applied. The focus moves away from per-application engineering and toward designing the platform’s default behavior correctly from the start.

I would align NSX more with platform lifecycle, not projects

One reason lifecycle thinking matters so much is organizational reality. Platforms rarely live in stable conditions. Teams change, people leave, responsibilities shift, and platforms get handed over.

Designs that assume long-term ownership by the same small group tend to fail quietly over time. When I design NSX platforms today, I try to assume that I won’t be there in a few years — and that the people operating it will have different constraints and priorities than I did.

This pushes me toward standardization, clearer boundaries, and designs that are easier to reason about under pressure.

Closing Thoughts

NSX still matters. A lot.

But its value today is less about how much it can do, and more about how deliberately it is used. Platform-centric and cloud-native operating models don’t make NSX irrelevant — they make its role clearer and more focused.

If I were designing an NSX platform today, I would aim for fewer concepts, clearer boundaries, and designs that survive not just technical change, but organizational change as well.

The best NSX platforms I see today are often the ones nobody talks about — because they just work, and nobody is afraid to touch them.

Avi Load Balancer Metrics with Prometheus and Grafana

January 4, 2025

Avi Load Balancer offers a wealth of valuable metrics that can be accessed directly via the Avi Controller’s UI or API.

However, there are various reasons why you might want to make these metrics available outside of its native platform. For instance, you might wish to avoid granting users or systems direct access to the Avi Load Balancer management plane solely for metric consumption. Alternatively, you might need to store and analyze metrics in a centralized system or simply back them up for future use.

Fortunately, there are several methods for fetching metrics from the Avi Load Balancer and processing or storing them externally. In this article, I’ll guide you through the process of setting up an automated workflow where Avi Load Balancer metrics are fetched by Prometheus and visualized in Grafana.

Lab Environment

My lab environment for this exercise consists of the follow components:

vSphere 8 Update 3
A vSphere cluster with 3 ESXi hosts configured as Supervisor
A TKG Service cluster (Kubernetes cluster) with 1 controlplane node and 3 worker nodes
NSX 4.2.1.0 as the network stack
Avi Load Balancer 30.2.2 with DNS virtual service
Avi Kubernetes Operator (AKO)
vSAN storage

Apart from the Avi Load Balancer, none of these components are strictly required. For all I know, this exercise could be performed using upstream Kubernetes on bare metal instead. However, this is how my lab is currently configured, and I wanted to share that setup for your reference.

High-Level Overview

Below is a simple high-level overview illustrating the workflow we’re going to build. It demonstrates how Avi Load Balancer metrics flow through the system, from collection to visualization, using Prometheus and Grafana.

The various components—Grafana, Prometheus, and Avi API Proxy—will be deployed as pods within my Kubernetes cluster.

Let’s go!

Namespace

Keeping the components together in a dedicated namespace is my preferred approach in this case. This way, Prometheus can communicate with the Avi API Proxy using its Kubernetes-internal FQDN, and the same applies to communication between Grafana and Prometheus.

Create the observability namespace:

kubectl create ns observability

Deploying Components

Now, we can begin deploying the various components within this namespace.

Avi API Proxy

The Avi API Proxy is not a required component, but I recommend using it. Without the proxy, Prometheus would need to communicate directly with the Avi Controller. This would require enabling Basic Auth on the Avi Controller, which might not be desirable. There are additional advantages, as outlined in the official documentation. Essentially, by placing a proxy between the Avi Controller and Prometheus, we abstract away some complexity, resulting in a cleaner and more manageable solution.

The official documentation also references a Docker container. However, since I want to deploy the Avi API Proxy as a pod in Kubernetes, the manifest for the deployment I came up with (including the method to expose it) looks like this:

##
## avi-api-proxy-deployment.yaml
##
apiVersion: apps/v1
kind: Deployment
metadata:
  name: avi-api-proxy
  namespace: observability
  labels:
    app: avi-api-proxy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: avi-api-proxy
  template:
    metadata:
      labels:
        app: avi-api-proxy
    spec:
      containers:
      - name: avi-api-proxy
        image: avinetworks/avi-api-proxy:latest
        ports:
        - containerPort: 8080
        env:
        - name: AVI_CONTROLLER
          value: "10.203.240.15"
        - name: AVI_USERNAME
          value: "prometheus"
        - name: AVI_PASSWORD
          value: "VMware1!"
        - name: AVI_TIMEOUT
          value: "60"
---
apiVersion: v1
kind: Service
metadata:
  name: avi-api-proxy-service
  namespace: observability
  labels:
    app: avi-api-proxy
spec:
  selector:
    app: avi-api-proxy
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080

You’ll have to update the values for AVI_CONTROLLER, AVI_USERNAME, and AVI_PASSWORD . After that you should be good to go:

kubectl apply -f avi-api-proxy-deployment.yaml

Verify that the deployment and service are up and running:

kubectl get deployments -n observability avi-api-proxy && kubectl get svc -n observability avi-api-proxy-service

Looking good!

Prometheus ConfigMap

Prometheus uses a configuration file in YAML format. Since we’re deploying Prometheus in Kubernetes, we’ll add the contents of our specific configuration file as a ConfigMap within our namespace. We’ll then instruct Prometheus to look for the configuration in that ConfigMap:

##
## prometheus-configmap.yaml 
##
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: observability
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    rule_files:
      - /etc/prometheus/prometheus.rules
    scrape_configs:
      - job_name: avi_api_vs1  ## Job name
        honor_timestamps: true
        params:
          tenant:
          - admin   ## Tenant Names to be mentioned comma separated 
        scrape_interval: 1m ## scrape interval
        scrape_timeout: 45s ## scrape timeout
        metrics_path: /api/analytics/prometheus-metrics/virtualservice ## VirtualService metrics collected
        scheme: http
        follow_redirects: true
        metric_relabel_configs:  ## config to replace the controller instance name
        - source_labels: [instance]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: pod-240-alb-controller ## replacement name to be used
          action: replace
        static_configs:
        - targets:
          - avi-api-proxy-service.observability.svc:8080 ## avi-api-proxy container ip address and port 
      - job_name: avi_api_se_specific
        honor_timestamps: true
        params:
          metric_id:
          - se_if.avg_bandwidth,se_if.avg_rx_pkts,se_if.avg_rx_bytes,se_if.avg_tx_bytes,se_if.avg_tx_pkts  ## Specific SE metrics which are collected
          tenant:
          - admin
        scrape_interval: 1m
        scrape_timeout: 45s
        metrics_path: /api/analytics/prometheus-metrics/serviceengine   ## Metrics path for  Service Engine
        scheme: http
        follow_redirects: true
        metric_relabel_configs:
        - source_labels: [instance]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: pod-240-alb-controller
          action: replace
        static_configs:
        - targets:
          - avi-api-proxy-service.observability.svc:8080
      - job_name: avi_api_se
        honor_timestamps: true
        params:
          tenant:
          - admin
        scrape_interval: 1m
        scrape_timeout: 45s
        metrics_path: /api/analytics/prometheus-metrics/serviceengine ## Metrics path for  Service Engine
        scheme: http
        follow_redirects: true
        metric_relabel_configs:
        - source_labels: [instance]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: pod-240-alb-controller
          action: replace
        static_configs:
        - targets:
          - avi-api-proxy-service.observability.svc:8080
      - job_name: avi_api_pool
        honor_timestamps: true
        params:
          tenant:
          - admin
        scrape_interval: 1m
        scrape_timeout: 45s
        metrics_path: /api/analytics/prometheus-metrics/pool  ## Metrics path for Pool  
        scheme: http
        follow_redirects: true
        metric_relabel_configs:
        - source_labels: [instance]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: pod-240-alb-controller
          action: replace
        static_configs:
        - targets:
          - avi-api-proxy-service.observability.svc:8080
      - job_name: avi_api_controller
        honor_timestamps: true
        scrape_interval: 1m
        scrape_timeout: 45s
        metrics_path: /api/analytics/prometheus-metrics/controller  ## Metrics path for Avi Controller 
        scheme: http
        follow_redirects: true
        metric_relabel_configs:
        - source_labels: [instance]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: pod-240-alb-controller
          action: replace
        static_configs:
        - targets:
          - avi-api-proxy-service.observability.svc:8080

You’ll want to replace pod-240-alb-controller with the name of your Avi Load Balancer Controller.

Note that we’re targeting the Avi API Proxy service and addressing it by its internal FQDN: avi-api-proxy-service.observability.svc

Create the ConfigMap:

kubectl apply -f prometheus-configmap.yaml

Prometheus

The manifest for the Prometheus deployment looks like this:

##
## prometheus-deployment.yaml
##
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: observability
  labels:
    app: prometheus-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus
          args:
            - "--storage.tsdb.retention.time=12h"
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
          ports:
            - containerPort: 9090
          resources:
            requests:
              cpu: 500m
              memory: 500M
            limits:
              cpu: 1
              memory: 1Gi
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf
        - name: prometheus-storage-volume
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: observability
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9090'
spec:
  selector:
    app: prometheus-server
  ports:
    - port: 9090
      protocol: TCP
      targetPort: 9090

Note that the contents of the ConfigMap are made accessible to Prometheus via a volume.

Deploy Prometheus:

kubectl apply -f prometheus-deployment.yaml

Check the result:

kubectl get deployments -n observability prometheus-deployment && kubectl get svc -n observability prometheus-service

We’re good.

Ingress for Prometheus (optional)

This is optional, but Prometheus has a web UI that can be quite handy from time to time. Additionally, since I have the Avi Kubernetes Operator (AKO) configured in my cluster, it’s easy to create an Ingress for the Prometheus service.

##
## prometheus-ingress.yaml
##
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-ingress
  namespace: observability
spec:
  ingressClassName: avi-lb
  rules:
    - host: prometheus.ako.lab
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: prometheus-service
              port: 
                number: 9090

kubectl apply -f prometheus-ingress.yaml

The prometheus.ako.lab and several other Ingress hosted by Avi Load Balancer in my lab:

Grafana ConfigMap

Grafana is the component that transforms our Prometheus metrics into visually appealing graphs, making it easier to interpret the data.

The configuration we need to inject into our Grafana instance is the Prometheus data source. To do this, we’ll use the following ConfigMap:

##
## grafana-configmap.yaml 
##
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: observability
data:
  prometheus.yaml: |-
    {
        "apiVersion": 1,
        "datasources": [
            {
               "access":"proxy",
                "editable": true,
                "name": "Prometheus",
                "orgId": 1,
                "type": "prometheus",
                "url": "http://prometheus-service.observability.svc:9090",
                "version": 1
            }
        ]
    }

Grafana will use Prometheus’s internal FQDN to access the service: prometheus-service.observability.svc

Create the ConfigMap:

kubectl apply -f grafana-configmap.yaml

Grafana

Finally, we deploy Grafana using this manifest:

##
## grafana-deployment.yaml
##
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      name: grafana
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:latest
        ports:
        - name: grafana
          containerPort: 3000
        resources:
          limits:
            memory: "1Gi"
            cpu: "1000m"
          requests: 
            memory: 500M
            cpu: "500m"
        volumeMounts:
          - mountPath: /var/lib/grafana
            name: grafana-storage
          - mountPath: /etc/grafana/provisioning/datasources
            name: grafana-datasources
            readOnly: false
      volumes:
        - name: grafana-storage
          emptyDir: {}
        - name: grafana-datasources
          configMap:
              defaultMode: 420
              name: grafana-datasources
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-service
  namespace: observability
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '3000'
spec:
  selector:
    app: grafana
  ports:
    - port: 3000
      protocol: TCP
      targetPort: 3000

Validate the result:

kubectl get deployments -n observability grafana && kubectl get svc -n observability grafana-service

Ingress for Grafana (optional)

Creating an Ingress for the Grafana service is also optional. If you want to access this Grafana instance from outside the Kubernetes cluster, there are several ways to achieve that. In my case, I’ll create an Ingress and let AKO handle the rest. 😉

##
## grafana-ingress.yaml
##
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: observability
spec:
  ingressClassName: avi-lb
  rules:
    - host: grafana.ako.lab
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: grafana-service
              port: 
                number: 3000

Grafana Dashboards

At this point, we should have a working solution. Prometheus is fetching and storing Avi Load Balancer metrics through the Avi API Proxy, while Grafana is configured with the Prometheus data source and ready to visualize the data using dashboards.

Speaking of dashboards, I found some deep down in the devops repository of Avi Networks GitHub that work out-of-the-box with what we’ve set up here. There are a total of six dashboards::

Make sure to download these files, then log in to Grafana and import them as dashboards:

Once all the files have been imported, you should see something similar to this:

You can click through the slideshow below to view a screenshot of each dashboard:

Avi Controller metrics
Pool specific metrics
Service Engine specific metrics
Service Engine total metrics
Virtual Service specific metrics
Virtual Service total metrics

Summary

In this article, we explored the steps for configuring an automated workflow that collects Avi Load Balancer metrics using Prometheus via the Avi API Proxy and visualization through Grafana dashboards. All components were deployed as containers within a dedicated Kubernetes namespace.

This simple solution was implemented in an isolated lab environment. Deploying it in a production environment would require additional considerations, such as externalization, persistence, security, backup strategies, and adherence to Kubernetes best practices.

The YAML manifests that I used in this exercise can be found in my GitHub repository.

Hopefully this article gave you some inspiration on your network observability journey.

Thank you for reading! Feel free to share your thoughts or ask questions in the comments below or just reach out to me directly.

References:

How to Setup Prometheus Monitoring On Kubernetes Cluster
Avi Load Balancer Prometheus Integration

Network Visibility for TKG Service Clusters

December 30, 2024
TKG Service Clusters using the default Antrea CNI, can be easily configured for enhanced network visibility through flow visualization and monitoring.

The ability to monitor network traffic within your Kubernetes clusters, as well as between your Kubernetes constructs and the outside world, is essential for understanding system behavior—and especially important when things aren’t working as intended.

In this article, I’ll walk you through the steps to enable network visibility specifically for TKG Service Clusters. However, similar steps can be applied to any Kubernetes cluster that is using the Antrea CNI.

Bill of Materials

My lab environment for this exercise consists of the follow components:
- vSphere 8 Update 3
- A vSphere cluster with 3 ESXi hosts configured as Supervisor
- A TKG Service cluster with 1 controlplane node and 3 worker nodes
- NSX 4.2.1.0 as the network stack
- Avi Load Balancer 30.2.2 with DNS virtual service
- Avi Kubernetes Operator (AKO)
- vSAN storage
Note: Neither Avi Kubernetes Operator (AKO) nor Avi Load Balancer are required components, but you’ll most likely run into these when working with production vSphere Supervisor environments.

Diagram

The diagram below provides a high-level overview of the lab environment.

Obviously, not all details are provided in this overview, but it should at least give you an idea on how the lab environment has been configured.

Step 1 – Enable FlowExporter in AntreaConfig

The first step is to make sure that the Antrea FlowExporter feature is enabled in our cluster.

Connect to the Supervisor endpoint for our Namespace:
```
kubectl vsphere login --server=10.204.249.34 --vsphere-username administrator@vsphere.local  --insecure-skip-tls-verify
```
Switch to the context where the AntreaConfig is stored:
```
kubectl config use-context tkgs-ns-1
```
Fetch the name of the AntreaConfig:
```
kubectl get AntreaConfig -o=custom-columns=NAME:.metadata.name

tkgs-cluster-1-antrea-package
```
And finally edit the AntreaConfig:
```
kubectl edit AntreaConfig tkgs-cluster-1-antrea-package
```
Here we need have a look at two items:
- spec.antrea.config.featureGates.FlowExporter
- spec.antrea.config.flowExporter.enable
Both of these flags must be set to true:
```
apiVersion: cni.tanzu.vmware.com/v1alpha1
kind: AntreaConfig
metadata:
  name: tkgs-cluster-1-antrea-package
  namespace: tkgs-ns-1
spec:
  antrea:
    config:
      featureGates:
        ...
        FlowExporter: true
      flowExporter:
        ...
        enable: true
```
In the case that changes had to be made to the AntreaConfig, the Antrea agents need to be restarted inside the TKG Service cluster. Connect to the TKG Service cluster:
```
kubectl vsphere login --server=10.204.249.34 --tanzu-kubernetes-cluster-name tkgs-cluster-1 --tanzu-kubernetes-cluster-namespace tkgs-ns-1 --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify
```
Switch to the context of the cluster:
```
kubectl config use-context tkgs-cluster-1
```
Issue the following command to restart the Antrea agent pods:
```
kubectl delete pod -n kube-system -l app=antrea
```
Step 2 – Install the Flow Aggregator

Next we install the Flow Aggregator and for that we’ll use Helm. Make sure that you’re still connected to the TKG Service cluster and in the cluster’s context.

Add the Helm repository and receive the latest information about its charts:
```
helm repo add antrea https://charts.antrea.io
helm repo update
```
Install the Flow Aggregator using the following Helm command:
```
helm install flow-aggregator antrea/flow-aggregator --set clickHouse.enable=true,recordContents.podLabels=true -n flow-aggregator --create-namespace
```
Note: After installing the Flow Aggregator you wil notice that its pod moves into a CrashLoopBackOff state. This is expected behaviour as the service it’s trying to connect to is not installed yet.

Step 3 – Install and Configure Theia

Theia is installed on top of Antrea and consumes the network flows that are exported by Antrea.

What I like about Theia is that it comes with ClickHouse and Grafana pre-configured. This means that almost everything works out-of-the-box. Flow data is processed, stored, and visualized without having to spend time on manually configuring and maintaining integrations.

We’ll use Helm to install Theia as well:
```
helm install theia antrea/theia --set sparkOperator.enable=true,theiaManager.enable=true -n flow-visibility --create-namespace
```
I mentioned that “almost” everything works out of the box. Well, life is rarely perfect, and to get everything working as intended, we (unfortunately) need to roll up our sleeves and get our hands dirty.

Fortunately, it’s not too complicated. What happened is that for some reason, a couple of columns are missing for some tables in the ClickHouse database. We need to manually add these to allow the Flow Aggregator pod to exit the CrashLoopBackOff state.

Exec into the Clickhouse pod and start a shell:
```
kubectl -n flow-visibility exec --stdin --tty chi-clickhouse-clickhouse-0-0-0 -- /bin/bash
```
Start the Clickhouse client:
```
clickhouse-client
```
Add the missing columns to the tables by using these commands:
```
ALTER TABLE default.flows ADD COLUMN appProtocolName String;
ALTER TABLE default.flows_local ADD COLUMN appProtocolName String;
ALTER TABLE default.flows ADD COLUMN httpVals String;
ALTER TABLE default.flows_local ADD COLUMN httpVals String;
ALTER TABLE default.flows ADD COLUMN egressNodeName String;
ALTER TABLE default.flows_local ADD COLUMN egressNodeName String;
```
Exit from the Clickhouse client and the pod.

BTW, I did appreciate the message that was printed when exiting the Clickhouse client. All is forgiven now, well, almost all 🙂

Step 4 – Consume Network Visibility

The components are in place and it’s time to have a look at what we’ve ended up with.

By default the Grafana Pod is exposed using a NodePort service:

This means that we can access Grafana on the IP address of a node using (in this case) port 32366. To help you find out which node IP and port you should use, the Theia documentation provides a series of commands that will provide that information:
```
NODE_NAME=$(kubectl get pod -l app=grafana -n flow-visibility -o jsonpath='{.items[0].spec.nodeName}')
NODE_IP=$(kubectl get nodes ${NODE_NAME} -o jsonpath='{.status.addresses[0].address}')
GRAFANA_NODEPORT=$(kubectl get svc grafana -n flow-visibility -o jsonpath='{.spec.ports[*].nodePort}')
echo "=== Grafana Service is listening on ${NODE_IP}:${GRAFANA_NODEPORT} ==="
```
```
=== Grafana Service is listening on 10.204.241.38:32366 ===
```
Using that IP address and port, we can connect to the Grafana service. After logging in with the default credentials (username: admin, password: admin) and changing the initial password, we arrive at the homepage. This page provides an cluster overview dashboard featuring some key metrics for the entire cluster:

There are eight dashboard pages in total. The Flow Records and the Network Topology dashboards look particulary interesting as well:

Bonus Step – Ingress for Grafana

As mentioned earlier, I deployed and configured Avi Load Balancer with a DNS virtual service for this lab. Additionally, I set up the Avi Kubernetes Operator in my TKG Service cluster (tkgs-cluster-1). With these components in place, I can create an Ingress to access the Grafana service using a proper FQDN:

First I expose the Grafana pod as a ClusterIP service. My theia-dashboard-service.yaml contains the following:
```
---
apiVersion: v1
kind: Service
metadata:
  name: theia-dashboard-service
  namespace: flow-visibility
spec:
  selector:
    app: grafana
  ports:
    - port: 3000
```
Creating the service:
```
kubectl apply -f theia-dashboard-service.yaml
```
Next, I create the Ingress. My theia-dashboard-ingress.yaml looks like this:
```
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: theia-dashboard-ingress
  namespace: flow-visibility
spec:
  ingressClassName: avi-lb
  rules:
    - host: theia-dashboard.ako.lab
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: theia-dashboard-service
              port:
                number: 3000
```
Creating the Ingress:
```
kubectl apply -f theia-dashboard-ingress.yaml
```
Now AKO will do its magic creating a pool, VIP, and a virtual service. Additionally, Avi Load Balancer DNS will take care of registring the DNS record (theia-dashboard.ako.lab):
```
k get ingress -n flow-visibility 
NAME                      CLASS    HOSTS                     ADDRESS        PORTS   AGE
theia-dashboard-ingress   avi-lb   theia-dashboard.ako.lab   10.204.244.5   80      4h51m
```
Using the FQDN as defined in my Ingress to reach Grafana:

Summary

Adding network visibility to your TKG Service clusters—or any Kubernetes cluster using the Antrea CNI, for that matter—is not a complex task. Aside from the missing columns issue, it actually works seamlessly out of the box. These are exactly the kinds of solutions many of my customers are looking for. 😊

Hopefully this article gave you some inspiration on your network observability journey.

Thank you for reading! Feel free to share your thoughts or ask questions in the comments below or reach out to me directly.
Integrating TKG Service Clusters with NSX Security

November 23, 2024
Organizations aiming to leverage NSX for securing their TKG Service Clusters (Kubernetes clusters) can now achieve this with relative ease. In this guide, I’ll walk you through configuring the integration between a TKG Service Cluster and NSX—a required step for centrally managing security policies within TKG Service Clusters and between these clusters and external networks.

Architecture Diagram

For your reference, the diagram below, which is part of the NSX documentation, illustrates the architecture for the integration. Key component is the Antrea NSX adapter running on the control plane nodes of the TKG Service Cluster.

Bill of Materials

My lab environment for this exercise includes the following components:
- vSphere 8.0 Update 3
- A vSphere cluster with 4 ESXi hosts
- NSX 4.2.1
- vSAN storage
- NSX Networking & Security, deployed and configured
Note: For this proof-of-concept, I did not use Avi Load Balancer. However, this component is typically included in production SDDC environments.

Step 1 – Activate the Supervisor Service

Before deploying any TKG Service Clusters, you must configure and activate the Supervisor service on a vSphere cluster. This can be achieved through the vCenter GUI, API calls, or by importing a configuration file.

To save some time and space, I’ll share the contents of the Supervisor configuration file I used to active the Supervisor service in my lab.
```
{
  "specVersion": "1.0",
  "supervisorSpec": {
    "supervisorName": "Pod-210-S1"
  },
  "envSpec": {
    "vcenterDetails": {
      "vSphereZones": [
        "domain-c9"
      ],
      "vcenterAddress": "Pod-210-vCenter.SDDC.Lab",
      "vcenterDatacenter": "Pod-210-DataCenter"
    }
  },
  "tkgsComponentSpec": {
    "tkgsStoragePolicySpec": {
      "masterStoragePolicy": "vSAN Default Storage Policy",
      "imageStoragePolicy": "vSAN Default Storage Policy",
      "ephemeralStoragePolicy": "vSAN Default Storage Policy"
    },
    "tkgsMgmtNetworkSpec": {
      "tkgsMgmtNetworkName": "SEG-VKS-Management",
      "tkgsMgmtIpAssignmentMode": "STATICRANGE",
      "tkgsMgmtNetworkStartingIp": "10.204.210.10",
      "tkgsMgmtNetworkGatewayCidr": "10.204.210.1/24",
      "tkgsMgmtNetworkDnsServers": [
        "10.203.0.5"
      ],
      "tkgsMgmtNetworkSearchDomains": [
        "sddc.lab"
      ],
      "tkgsMgmtNetworkNtpServers": [
        "10.203.0.5"
      ]
    },
    "tkgsNcpClusterNetworkInfo": {
      "tkgsClusterDistributedSwitch": "Pod-210-VDS",
      "tkgsNsxEdgeCluster": "Pod-210-T0-Edge-Cluster-01",
      "tkgsNsxTier0Gateway": "T0-Gateway-01",
      "tkgsNamespaceSubnetPrefix": 28,
      "tkgsRoutedMode": false,
      "tkgsNamespaceNetworkCidrs": [
        "10.244.0.0/19"
      ],
      "tkgsIngressCidrs": [
        "10.204.211.0/25"
      ],
      "tkgsEgressCidrs": [
        "10.204.211.128/25"
      ],
      "tkgsWorkloadDnsServers": [
        "10.203.0.5"
      ],
      "tkgsWorkloadServiceCidr": "10.96.0.0/22"
    },
    "controlPlaneSize": "MEDIUM"
  }
}
```
Note: Obviously I’m using NSX as the networking stack for the Supervisor service.

Importing a configuration file is done at step 1 of the Supervisor service activation wizard:

For more details on Supervisors, TKG Service Clusters, and related concepts, check the vSphere documentation and its chapters on the vSphere IaaS Control Plane.

Step 2 – Create a vSphere Namespace

A vSphere Namespace is the runtime environment for TKG Service Clusters. You can create one by using the vCenter UI, making an API call, or kubectl.

In the vCenter UI:
- Navigate to Workload Management > Namespaces > New Namespace:
After creation, configure permissions, storage policies, and the VM Service. Here’s a snapshot of my vSphere Namespace configuration for this exercise:

Step 3 – Prepare the vSphere Namespace for the NSX Integration

Before deploying a TKG Service Cluster, we need to make sure the Antrea-NSX Adapter as well as Antrea Policy are enabled for any TKG Service Clusters being deployed within the Namespace. This is accomplished by adding an AntreaConfig to the Namespace. Below is the configuration file that I used in my lab:
```
# AntreaConfig.yaml
apiVersion: cni.tanzu.vmware.com/v1alpha1
kind: AntreaConfig
metadata:
 name: vks-cluster-1-antrea-package # The TKG Service Cluster name as prefix is required
 namespace: vks-ns-1
spec:
  antrea:
    config:
      featureGates:
        AntreaTraceflow: true # Facilitates network troubleshooting and visibility (Optional)
        AntreaPolicy: true # Enables advanced policy capabilities in Antrea (Required)
        NetworkPolicyStats: true # Provides visibility into the enforcement of network policies (Optional)
  antreaNSX:
    enable: true # This is the Antrea-NSX adapter which is disabled by default
```
Using kubectl follow these steps:

Connect to the Supervisor endpoint for the Namespace:
```
kubectl vsphere login --server=10.204.212.2 --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify
```
Switch to your Namespace context:
```
kubectl config use-context vks-ns-1
```
Apply the YAML file containing the AntreaConfig:
```
kubectl apply -f AntreaConfig.yaml 
```
Step 4 – Deploy the TKG Service Cluster

Finally, deploy the TKG Service Cluster using a cluster specification file. The specification below is what I used in my lab:
```
# vks-cluster-1.yml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: vks-cluster-1 # The name of the TKG Service Cluster. Must match the prefix in AntreaConfig
  namespace: vks-ns-1 # The vSphere Namespace created in step 2
spec:
  clusterNetwork:
    services:
      cidrBlocks: ["10.97.0.0/23"] # Internal non-routable IP CIDR for services
    pods:
      cidrBlocks: ["10.245.0.0/20"] # Internal non-routable IP CIDR for Pods
    serviceDomain: "cluster.local"
  topology:
    class: tanzukubernetescluster
    version: v1.30.1---vmware.1-fips-tkg.5
    controlPlane:
      replicas: 1 # The number of control plane nodes
    workers:
      machineDeployments:
        - class: node-pool
          name: node-pool-01
          replicas: 3 # The number of worker nodes
    variables:
      - name: vmClass
        value: best-effort-medium # The VM class you assigned during step 2
      - name: storageClass
        value: vsan-default-storage-policy # The storage policy you assigned during step 2
```
Apply the YAML file with the cluster specification to initiate the cluster deployment:
```
kubectl apply -f vks-cluster-1.yml
```
Verifying Integration Status in NSX Manager

Once the TKG Service Cluster is fully deployed we can head over to NSX Manager where we should find our TKG Service Cluster under System > Configuration > Fabric > Nodes > Container Clusters > Antrea. Preferably, the cluster should appear in an “Up” state. 😊

Another place where we can verify the integration is under Inventory > Containers > Clusters:

Using the Integration

With the integration in place, NSX Manager provides centralized control over security policies within the TKG Service Cluster, leveraging Antrea-native policies. These policies are designed to enhance security by managing traffic between Kubernetes resources, such as Pods.

In NSX Manager, this is achieved by assigning Kubernetes Pods, Services, and/or Namespaces as members of NSX Antrea groups. These groups are then referenced in the rules created within the NSX Distributed Firewall interface.

Example of an NSX Antrea group with a Pod as a member:

Policies are applied to TKG Service Clusters at the policy-level “Applied To” scope. Rules within these policies can specify an Antrea group or IP addresses/CIDRs as the source. Instead of explicitly defining a destination, the “Applied To” field is used, targeting the Antrea group that contains the resources serving as the destination.

Using NSX Generic groups, you can secure traffic between virtual machines and Kubernetes constructs like service, namespace, and ingress along with some other Kubernetes constructs.

Example of an NSX Generic group using dynamic criteria to “pick up” a specific Kubernetes Service:

There are no special considerations in this scenario when it comes to the firewall policies or the rules within the policies.

Summary

That’s it! This short guide demonstrated how to integrate TKG Service Clusters with NSX, enabling centralized management of security policies across clusters and platforms like Kubernetes and vSphere. The process involves enabling the Supervisor service, creating a Namespace, preparing it for NSX, and finally deploying the TKG Service Cluster.

To learn more about the integration between the Antrea CNI and NSX you should have a look at the official documentation.

Thank you for reading! Feel free to share your thoughts or ask questions in the comments below.
SDDC.Lab v6 Released

January 26, 2024
Slow and steady. That’s how I would describe the pace and progress around making SDDC.Lab version 6 the new default and recommended version of the project.

If you’re not familiar with the SDDC.Lab project, it’s a collection of Ansible Playbooks that perform fully automated deployments of nested VMware Software Defined Data Center environments called pods. Each pod consists of solutions like vSphere, vSAN, NSX, Tanzu, NSX Advanced Load Balancer, Aria Operations for Logs, and VyOS Router.

What’s New?

Product Versions

As always a wide range of product versions can be deployed using the SDDC.Lab scripts. The latest versions that we tested are:
- vCenter Server version 8 Update 2
- ESXi version 8 Update 2
- NSX version 4.1.2.1
- vSphere with Tanzu version 8 Update 2
- Aria Operations for Logs version 8.14.1
- NSX Advanced Load Balancer version 30.1.1
- VyOS Latest Rolling Nightly Build
- Ubuntu Server 22.04 (for ISC BIND)
New Features and Improvements

Luis and I, recommend that you have a look at the project’s CHANGELOG.md for a detailed list of all the new features and improvements that were added in version 6. The list below highlights some of the main new features and improvements:
- Automated deployment of NSX Advanced Load Balancer including configuration of the NSX integration (NSX-T Cloud)
- BFD is configured and enabled by default between the NSX Tier-0 Gateway and the VyOS router
- Developed SDDC.Lab credentials file for Firefox to automate logins to the different management UIs.
- Improved VyOS Router deployment process (thanks rexit1982)
- Improved DNS Server deployment process
- Added additional checks that validate prerequisites for successful SDDC.Lab pod deployments before launching a deployment
- Corrected an issue with the BGP peering between the VyOS router and the physical L3 switch
Besides this we’ve worked on many smaller items like code optimization that improve stability and performance of SDDC.Lab pod deployments.

How to Get Started?

Getting started with SDDC.Lab v6 is quite easy. You simply head over to the GitHub repository and read through the README.md which contains all the information you need to successfully deploy your SDDC.Lab pods.

Summary

SDDC.Lab version 6 is the most stable and mature release in the history of the project. It comes with some good improvements and new features and we really hope you will appreciate it.

We have plans and ideas for the next release and a new development branch for SDDC.Lab is in place already. Check it out if you want to follow what’s coming next in the project.
Quick Tip: NSX Advanced Load Balancer for vSphere Tanzu with NSX Networking

November 14, 2023
As of NSX version 4.1.1, NSX Advanced Load Balancer version 22.1.4, and vSphere with Tanzu version 8.0 Update 2 we have the option to leverage the NSX Advanced Load Balancer as the load balancer provider for new vSphere with Tanzu backed by NSX networking deployments.

This deployment option is a very welcome addition knowing that the NSX “native” load balancer is scheduled for deprecation in a future release.

Registering NSX Advanced Load Balancer with NSX

After deployment and the initial configuration of NSX and the NSX Advanced Load Balancer (detailed steps available in the vSphere with Tanzu documentation) we register the NSX Advanced Load Balancer with NSX Manager. This is accomplished with a simple API call:
```
PUT /policy/api/v1/infra/alb-onboarding-workflow 
```
The accompanying request body contains the following keys:
```
{
"owned_by": "LCM",
"cluster_ip": "<nsx-alb-controller-cluster-ip>",
"infra_admin_username" : "username",
"infra_admin_password" : "password",
"dns_servers": ["<dns-servers-ips>"],
"ntp_servers": ["<ntp-servers-ips>"]
}
```
Bringing this together in a curl one-liner could look something like this:
```
curl -u admin --location --request PUT 'https://ams-nsxt-lm/policy/api/v1/infra/alb-onboarding-workflow' \
--header 'X-Allow-Overwrite: True' \
--header 'Content-Type: application/json' \
--data-raw '{
"owned_by": "LCM",
"cluster_ip": "10.203.200.15",
"infra_admin_username" : "admin",
"infra_admin_password" : "VMware1!",
"dns_servers": ["10.203.0.5"],
"ntp_servers": ["10.203.0.5"]
}'
```
More information about this method can be found in the NSX API documentation.

When the registration is done you’ll notice that a shortcut to the NSX Advanced Load Balancer Controller UI has been added to the NSX Manager UI. Handy!

But more important, when we enable Tanzu Supervisor and/or deploy a Tanzu Kubernetes cluster under this Supervisor, we see that it is the NSX Advanced Load Balancer that’s hosting the VIP(s) on its Service Engine(s):

Summary

A very short article just to make you aware of this option and how it’s configured. I’m happy to see that customers can now use the NSX Advanced Load Balancer for their new vSphere with Tanzu backed by NSX networking installations.