Fix AKS ImagePullBackOff: ACR Auth Failure

Quick answer: Run az aks update -n -g --attach-acr to re-authenticate the cluster with your ACR. That fixes the root cause 90% of the time.

If you're seeing your Azure Kubernetes Service pod stuck in ContainerCreating with ImagePullBackOff, I know the frustration. This tripped me up the first time I migrated a production workload to AKS. The error literally means Kubernetes can't pull your container image from Azure Container Registry. The usual suspect? ACR authentication slipped—maybe you rotated registry keys, recreated the ACR, or the cluster's managed identity lost its permissions. Or your kubectl secret went stale. Let's fix it step by step.

Why This Happens

Your AKS cluster uses a managed identity or service principal to pull images from ACR. If that identity loses the AcrPull role assignment, pulling fails.
You created a Kubernetes docker-registry secret manually, but the credentials expired or got corrupted.
Your pod spec references an image tag that doesn't exist in ACR (e.g., typos or wrong tag).

Step-by-Step Fix

Check the pod event log
Run kubectl describe pod -n . Look for events like "Failed to pull image" with a 401 or 403 error. If you see "unauthorized: authentication required", you've got an ACR auth issue.
Verify ACR exists and the image tag is correct
Use az acr repository show-tags --name --repository --output table. Confirm the tag matches your pod spec. A typo like v1.0 instead of 1.0 will trigger ImagePullBackOff.
Re-attach the ACR to your AKS cluster
This is the nuclear option and the one I'd try first if you're sure the image tag is correct. Run:
```
az aks update -n  -g  --attach-acr 
```
This grants the AKS cluster's managed identity the AcrPull role. It's idempotent, so no harm running multiple times.
Delete any stale pull secrets and let AKS regenerate them
If you manually created a pull secret, delete it:
```
kubectl delete secret  -n 
```
Then restart the deployment: kubectl rollout restart deployment -n . AKS's mutating webhook will inject the correct secret automatically if you attached the ACR.
Force a pod restart
After re-attaching, delete the stuck pod: kubectl delete pod -n . The ReplicaSet will create a new one that should pull the image successfully.

Alternative Fixes (If the Main Fix Fails)

Manually create an ACR pull secret using admin credentials
If managed identity isn't your thing (or you're on an older AKS version), you can use the ACR admin account. Enable admin in the Azure portal under ACR > Access keys, then run:
```
kubectl create secret docker-registry acr-secret \
  --docker-server=.azurecr.io \
  --docker-username= \
  --docker-password=$(az acr credential show --name  --query "passwords[0].value" -o tsv) \
  --namespace 
```
Then reference this secret in your pod spec under imagePullSecrets.
Check if your ACR is behind a firewall or private endpoint
If your ACR has a firewall or uses private endpoints, the AKS cluster must be in the same VNet or have a service endpoint. Run az acr show --name --query networkRuleSet to see restrictions. If needed, add the AKS node subnet to the ACR firewall allow list.
Regenerate the service principal credentials
If you're using a service principal instead of managed identity, its secret might have expired. Update it with:
```
az aks update-credentials -g  -n  --reset-service-principal
```
Then re-attach the ACR.

Prevention Tips

Always use managed identity for ACR access—it's simpler and auto-rotates credentials. Enable it when creating your AKS cluster with --enable-managed-identity.
Set up AKS cluster auto-upgrade so you don't fall behind on patches that might fix auth issues.
Use a CI/CD pipeline that tags images with commit hashes or semantic versions, not latest. That way you never accidentally overwrite a tag.
Monitor pod events with Azure Monitor or a tool like kubectl events to catch ImagePullBackOff before it impacts users.

I've seen this error pop up after AKS node pool upgrades or when someone accidentally deleted the ACR. The --attach-acr flag is your best friend—it re-establishes the trust relationship between the cluster and the registry. If that doesn't work, check the firewall rules or rotate your secrets. You'll be back up in minutes.