Yes, the root cause of all security issues is that the Service Principal's contributor role assignment is at the subscription level/scope, which enables it to do quite damage specially if multiple applications are deployed to the same subscription (e.g. delete any resource group).
One approach would be:
- Provision one resource group for the Azure Key Vault specific to the application and region (the latter in case of geo-distributed applications).
- Provision the Azure Key Vault on the resource group created on the previous step.
In our case, the Security Office was responsible for the first 2 steps, where they had monitoring (e.g. email, text-messages, etc.) for any change in the Azure Key Vault (e.g. new keys/secrets/certificates added/deleted/changed, permission changes, etc.).
- Provision a second resource group, which will serve as a container for the application components (e.g. Azure Function, Azure SQL Server/Database, Azure Service Bus Namespace/Queue, etc.).
- Create the Service Principal and assign the Contributor role to the
application resource group only, for example:
scope =
/subscriptions/{Subscription Id}/resourceGroups/{Resource Group
Name}
Find a sample PS script to provision a Service Principal with custom scope at https://github.com/evandropaula/Azure/blob/master/ServicePrincipal/PS/Create-ServicePrincipal.ps1.
- Give appropriate permissions for the Service Principal in the Azure
Key Vault. In our case, we decided to have separate Service
Principal accounts for deployment (Read-Write permissions on keys/secrets/certificates) and runtime (Read-Only permissions on keys/secrets/certificates);
Find a sample PS script to set Service Principal permission on an Azure Key Vault at https://github.com/evandropaula/Azure/blob/master/KeyVault/PS/Set-ServicePrincipalPermissions.ps1.
Having that said, there are lots of inconveniences with this approach, such as:
- The process (e.g. via runbook) to provision the Azure Key Vault (including its resource group) and the application resource group will be outside of the main Terraform template responsible for the application components, which requires coordination with different teams/processes/tools/etc.
- Live site involving connectivity often involves coordination among multiple teams to ensure RTO and MTTM (Mean Time To Mitigate) goals are achieved.
- The Service Principal will be able to delete the application specific resource group when terraform destroy is executed, but it will fail to recreate it when running terraform apply after that due to lack of permission at the subscription level/scope. Here is the error:
provider.azurerm: Unable to list provider registration status, it is possible that this is due to invalid credentials or the service principal does not have permission to use the Resource Manager API, Azure error: resources.ProvidersClient#List: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client '' with object id '' does not have authorization to perform action 'Microsoft.Resources/subscriptions/providers/read' over scope '/subscriptions/{Subscription Id}'.".
Yeah, I know, this is a long answer, but the topic usually requires lots of cross-team discussions/brainstorming to make sure the security controls established by the Security Office are met, Developer productivity is not affected to the point that it will impact release schedules and RTO/MTTM goals are met. I hope this helps a bit!