Rebuilding a Broken AVD Estate: Fixing FSLogix Profile Failures at the Authentication Layer

A UK finance SME's Azure Virtual Desktop estate was losing FSLogix profiles on a recurring basis — the kind of fault that erodes trust in a whole platform. It had been treated as a timing problem for months. It wasn't: the profiles share had no authentication identity at all. This is the diagnosis, the rebuild around the real root cause, and the gates I left behind.

At a glance


Problem	Recurring FSLogix profile-mount failures on a pooled, multi-session Windows 11 AVD estate — worst after reboots
Wrong theory	A token "race condition" — plausible only because reboots triggered it
Root cause	The Azure Files share had no identity-based auth; profiles mounted on the storage-account key, with nothing to re-authenticate after a reboot
Fix	Rebuilt on Entra Kerberos (AADKERB) identity, configuration layered via Intune, deployed from the Marketplace image through a gated sequence
Outcome	Clean mounts, fixed at root cause; per-user ownership holds; deferred hardening logged with its gates

Anonymisation note: All client, host, tenant, storage and account identifiers are removed or generalised; the engineering is reproduced faithfully and nothing here identifies the client. Scope is the session-host build, FSLogix storage, and the hybrid-identity auth model. The network layer is deliberately thin (flat NSG, Windows App client), and tiered segmentation is named as future hardening, not claimed as built.

1. The symptom — and the wrong theory

A pooled, multi-session Windows 11 AVD estate with FSLogix containers on Azure Files. Profiles intermittently failed to mount — most visibly after the reboots Windows Update triggers. The standing theory was a token race condition: reboots firing before authentication was ready, so the mount lost its window.

It fit the timing, which is exactly why it had survived. It didn't survive assessment.

2. The real root cause

The storage account's identity-based authentication property came back null — not misconfigured, absent. No Entra Kerberos, no AD DS. Profiles were mounting on the storage-account key baked into the FSLogix config, with zero identity RBAC behind them.

That reframed everything. A key-mounted share has no resilient identity to fall back on: when a reboot disturbed the mount path, there was nothing to re-authenticate with — so it failed, and looked like a race because reboots were the trigger. The fix was never better timing. The share needed a real identity.

3. The rebuild decisions

Four decisions up front — the reasoning is the substance:

Reuse the anchors, rebuild the hosts. The host pool, workspace and desktop app group are what the client is subscribed to; recreate them and every user's connection breaks. Session hosts are disposable, so I preserved the three anchors and rebuilt only the hosts. Hosts are cattle; the anchors are not.
Fix the root cause, not the symptom. Move the storage to identity-based auth with Entra Kerberos (AADKERB) instead of papering over a key mount with timing tweaks.
Layer everything through Intune — no gold image. FSLogix, locale, lockdown and the drive map are delivered by Intune policy to a dynamic device group keyed on host naming. No baked-in GPO, no captured image — hosts reproducible from a known-good Marketplace base.
Standardise both hosts identically and verify by comparison before sign-off — no "works on one, not the other" drift.

Why Entra Kerberos over the alternatives

The share's starting point was a storage-account key — no identity at all, which was the root cause. Replacing it meant choosing among Azure Files' three identity-based authentication methods, and the choice was driven by what the organisation actually is — a hybrid-identity shop running Entra-joined AVD hosts.

Storage key (not an identity method — the status quo being replaced) — no identity, no RBAC. This was the root cause. Rejected.
On-premises AD DS — requires the session hosts to be domain-joined to on-premises Active Directory. These hosts are Entra-joined and cloud-managed. Rejected.
Microsoft Entra Domain Services — a Microsoft-managed domain the hosts would instead join. It still imposes a domain-join model — and a managed domain to run and pay for — on an estate that is deliberately cloud-managed. Rejected.
Entra Kerberos (AADKERB) — lets an Entra-joined host authenticate to Azure Files using the user's own Entra identity, with Kerberos tickets retrieved from Entra ID, no domain join required. Exactly right for a cloud-managed AVD estate. Chosen.

[!DANGER]

Entra Kerberos (AADKERB) is load-bearing and fails silently — understand it fully before changing anything. Three conditions take the entire profiles share offline with no honest error message: admin consent missing on the auto-created storage app registration (ticket retrieval fails quietly); the cloud-Kerberos client flag not applied, or the host not rebooted after it (mounts fail with 1326); and, in a hybrid tenant, the on-premises AD anchor that underpins the whole trust (see §4). None present as "authentication failed" — they look like timing, networking, or nothing at all. Treat every AADKERB change as a tested, pilot-validated change, never ad hoc.

Two permission layers — the part most people conflate

Access to an Azure Files share runs through two independent layers, and both must be right. Conflating them is the classic "it should work but doesn't":

Layer	Controls	Set by
Azure RBAC (SMB)	Whether an identity can connect to the share at all	Role assignment — Storage File Data SMB Share Contributor on the user group
NTFS ACL (filesystem)	What that identity can do inside the share	`icacls`, applied by mounting the share with a privileged identity

On the share root, the standard FSLogix per-user NTFS pattern applies: Authenticated Users create and traverse at the root only; CREATOR OWNER gets full control of its own subfolders and files; Administrators and SYSTEM get full control throughout. Each user owns their profile folder and can't touch anyone else's. RBAC gets you through the door; NTFS lets you use the room — most "phantom" permission faults live in the gap.

Rebuild the profiles, don't migrate them

The inherited containers came from a broken auth model with a documented history of mount failures — so this was an explicit choice, not a default:

Option	Pros	Cons
Migrate existing containers	Preserves user customisation and cached state; no first-login re-setup	Carries state forward from the exact system being replaced; risks inheriting corrupted or locked containers; migration tooling adds its own failure surface
Rebuild fresh (chosen)	Clean baseline under the corrected auth model; nothing carried over from the broken estate; no migration step to go wrong	Users start on new profiles; only safe where profile-resident data isn't the only copy

Decision: rebuild fresh. Re-mounting suspect containers onto the corrected model risks reintroducing the exact state I was there to eliminate. Gate — when I'd migrate instead: only when the containers are known-good and hold the sole copy of their data. Neither held here. I kept SID-first folder naming so the new containers match the old convention — consistency of standard, not of data.

4. The gated build

Several settings must land before first login, or the original failure simply replays in a new costume:

Enable AADKERB + grant admin consent on the auto-created app registration. Hard gate — nothing downstream proceeds until both are green.
Storage SMB RBAC — SMB Share Contributor for the user group; the elevated role for admins.
NTFS ACLs inside the share — the per-user filesystem layer.
Intune Settings Catalog — FSLogix config plus the cloud-Kerberos retrieval flag, to the device group, before first login.
Deploy hosts from the Marketplace image into the existing pool, Entra-joined and Intune-enrolled.
Reboot — cloud Kerberos doesn't activate until after one.
Applications, locale, lockdown.
Validate — pilot login, FSLogix Operational log, clean mount and write-back before release.

Go / no-go gates

Five of those are true gates, not just sequence. Cross one before it's green and the build fails in a way that's expensive to unwind:

Gate	Must be green before proceeding	Failure mode if skipped
Auth foundation	AADKERB enabled and admin consent granted on the auto-created app registration	Kerberos ticket retrieval fails silently — a mount error that looks like everything and nothing
Pre-login config	FSLogix config + cloud-Kerberos retrieval flag applied to the device group before first login	The original key-mount failure simply replays on the new hosts
Post-enrolment reboot	Hosts rebooted after Intune enrolment	Cloud Kerberos stays inactive — a fresh host fails to mount with error `1326`
Real-user validation	A standard, non-admin user mounts and writes back cleanly	An admin test account masks NTFS/SPN faults and the failure ships to a real user — which it nearly did
Policy proof	Disconnected-session limit fires; FSLogix shows a clean unload/compact on forced logoff	Profiles stay locked to dead sessions; write-back integrity goes unverified

Marketplace image, not a gold image

The plan was a captured gold image; I abandoned it at the Windows 11 24H2 sysprep wall — a protected system package can't be removed and blocks generalize with 0x80073cf2, and both standard workarounds (re-registering the package, the special-profiles registry flag) failed. Deploying both hosts from the stock Marketplace Windows 11 Enterprise multi-session image and layering through Intune removed the sysprep dependency entirely — the better long-term posture anyway.

FSLogix choices worth calling out

SID-first folder naming (FlipFlop disabled) — keeps the prior naming convention so the layout stays recognisable and any SID-first tooling still holds. Consistency of standard, not of data.
Prevent login with temp profile (fail-closed) — if a profile can't mount, the user is stopped, not silently dropped onto a throwaway temp profile that loses their work at logoff. For a finance team, failing loud is correct.
Profile container size cap (SizeInMBs) — FSLogix grows each container dynamically up to a maximum that defaults to 30 GB; a profile that reaches the cap starts throwing errors, and the value can be raised at any time but never shrunk. I sized it to 50 GB here to leave headroom for Outlook OST and OneDrive/Teams cache rather than discover the ceiling through user-facing failures. The rule is simple: size it to real data, not the default — and revisit it as that data grows.

Patching: an update ring with out-of-hours reboots

Reboots were the trigger for the original failures, so patching is governed, not left to each host's defaults. Windows updates are delivered through a Windows Update for Business ring — Intune-managed deferrals plus an install deadline — with installation and reboots scheduled outside working hours. That keeps the estate current without an update reboot landing mid-session on a finance user, and means the post-reboot profile remount happens when no one is on the box. Active hours and the reboot window track the client's working pattern and are adjusted as it changes.

Storage governance: the CannotDelete lock

The profiles account is enrolled in Azure Backup, which applies an automatic CannotDelete lock. It surprised me mid-build — a retained staging share couldn't be deleted until the backup association was understood — and it must not be removed casually, because the same lock protects the live profiles share. Deliberate, not an error; worth knowing before someone "tidies up" a share and quietly strips the protection off production.

Network posture — flat, and why that's workable

Minimal by design: a single flat NSG, access via the Windows App client. Workable rather than reckless, because AVD uses an outbound reverse-connect transport — hosts dial out to the control plane, so nothing inbound is exposed. The real access boundary is therefore identity and app-group assignment (Entra-joined hosts, desktop application group membership), not network segmentation. Tiered NSGs and private endpoints for the storage are the obvious next hardening step — named as honest future work, not claimed as delivered.

The hybrid-identity dependency

Because the organisation is hybrid, the Entra Kerberos trust is anchored on an on-premises AD DS object carrying the storage account's service principal name, synced into Entra ID alongside the users. Every ticket an Entra-joined host requests to mount a profile is validated against that synced chain. The on-premises directory is not a legacy bystander — it is part of the live FSLogix authentication path.

[!DANGER]

The on-premises AD DS / AD Connect sync is part of the live authentication path — do not touch it without a migration plan. Decommissioning on-prem AD DS, breaking the directory sync, or deleting/altering the storage account's SPN object breaks Kerberos ticket validation and fails every profile mount across the estate at once — the exact failure this project eliminated, only worse, and hitting everyone simultaneously. Moving to cloud-only is achievable, but only as a planned migration: re-establish and validate the cloud-only Kerberos model against a pilot user first, then retire the anchor — never the reverse.

5. Lessons

The dead ends were as instructive as the fixes. Five worth keeping:

Cloud Kerberos needs a reboot. A fresh second host failed to mount FSLogix with error 1326; the cloud-Kerberos client setting simply hadn't taken effect yet. On a fresh host showing 1326, rule the reboot out first — cheapest check, and it was the answer.
An admin test account will hide a real permissions fault. A standard user hit Access Denied (Error 5) where my admin account had worked perfectly — admins have broad access regardless, so it had masked the fault. RBAC was fine (the user was in the right group), which pointed straight at the NTFS layer and the Kerberos SPN anchor. Always validate identity-based storage with a non-admin — the admin lies to you by succeeding.
Elevated installers can't authenticate to an AADKERB share. An MSI run directly from the share failed with 1314: the elevated context (SYSTEM/admin) can't authenticate, because the Kerberos identity belongs to the user session. Copy the installer local, then run it — true of any elevated MSI sourced from an identity-based share.
Scheduled tasks don't work for drive mapping in AVD. A logon-triggered task reported "mapped OK" but the drive was invisible to the user: a scheduled task runs in a different session and token, and mapped drives are per-session. The fix is a per-user mechanism in the user's own interactive session — with the documented trade-off that the drive takes a minute or two to appear, so the app mustn't launch before it's present. (The data sits on an interim on-prem workgroup box, reached by IP.)
The workgroup-server bridge has an expiry date. A non-domain workgroup box can't authenticate Entra identities — the same fundamental limitation as the original key-mounted share. The interim mapping uses a local service-account credential that works now and disappears cleanly once the data moves onto Azure Files with AADKERB, mirroring the FSLogix model exactly. Only an identity-based file service (Azure Files + AADKERB) or a domain gives you identity-based SMB. Everything else is a temporary bridge — document it as one.

6. Deferred work — and the gates

Two changes are queued for later. Both are planned migrations, not deletions, and on each the order of operations is the whole game — the gate is the deliverable.

Retire hybrid identity (move to cloud-only).

Why: removes the on-premises directory and its sync from the live auth path — fewer moving parts, no on-prem single point of failure under every mount.
Risk: the estate-wide blast radius set out in the §4 danger, if sequenced wrong.
Gate: migrate Azure Files auth to the cloud-only Kerberos model and validate the SPN/identity chain against a pilot user — then, and only then, retire the anchor. Never the reverse.

Decommission the interim workgroup data source.

Why: moves the line-of-business data onto Azure Files with AADKERB, retiring the last local-credential bridge and making the drive map Entra-identity based — the model already proven for FSLogix.
Why it's lower-risk: the workgroup box isn't a domain controller, so it doesn't carry the hybrid retirement's blast radius.
Gate: migrate the data and prove the new mount works before retiring the old path.

7. Outcome

Live and in production. Both rebuilt session hosts serve users, FSLogix profiles mount cleanly, and the failures that triggered the engagement are resolved at root cause — not worked around. Regular users authenticate and mount their own profiles, the per-user ownership model holds, write-back on logoff is clean and verified, and users moved onto the rebuilt estate without losing the published-connection anchors.

The one open hardening item — endpoint-EDR exclusions for the FSLogix containers — is logged, not lost: not a blocker (profiles mount without them), but their absence can later masquerade as intermittent network latency, so it's recorded as a known follow-up rather than a surprise.

The headline isn't "AVD works again." It's that a failure everyone had read as a timing problem was an authentication problem — and naming it correctly turned an endless cycle of workarounds into a one-time fix.

If you're running AVD and fighting intermittent FSLogix profile-mount failures — or planning a rebuild and want the authentication model right the first time — I'm happy to talk it through. The hard part is rarely the deployment; it's diagnosing the layer the symptom is hiding.