A nostalgic prelude
Anyone remember building web apps in the old days? Write a PHP script, FTP it up to a server, hit refresh. No having to juggle React,
npm, RubyGems, version locking, incompatible versions of Python or Ruby or Java or whatever. No Ansible, no Chef or Puppet, or Docker containers, or Terraform, or Kubernetes. If you came from that world, when Amazon launched AWS back in 2006, the whole idea of infrastructure that you could spin up and down as you needed it was a fairly tantalising prospect, but then you dived into it and the sheer amount of complexity ended up making you pine for the old days of FTPing a PHP script up to a $5/mo LAMP server.
But slowly you adapted. Now everything, for better or worse, runs in someone’s cloud—most likely Amazon, or maybe Google or Microsoft or DigitalOcean or Heroku. Some of these services, like Heroku, Netlify and Zeit (or Vercel or whatever it calls itself now), hold your hand quite nicely, and guide you through the process of deploying quite simply. You push the button (or
git push), we do the rest (and charge you for the privilege).
If you’ve used AWS though, you are left to do this yourself. In doing so, you’ve probably found yourself hitting up against the bureaucratic complexity of AWS Identity & Access Management, or IAM.
What IAM solves
Let’s imagine you are running a company that does a few different functions. You’ve got a web application written in something like Rails or Django. Those run on EC2. Or maybe Beanstalk, or ECS. They talk to a database, running on RDS (or Dynamo). You might have a message queue in SQS, and you plonk out some emails via SES or Sendgrid.
But you’ve also got your employee database in an RDS instance. There’s some web UI that’s running on an EC2 box that talks to that employee database. Only your back office people (HR, finance, legal etc.) should be able to touch that database, because it is really sensitive and, well, the sort of thing the GDPR was enacted to try and protect.
A cloud platform like AWS gives you a big box of tools that you can use in coordination with one another to build a useful service for your users. A cloud platform like AWS gives an attacker a big box of tools they can use in coordination with one another to attack your users.
IAM policies let you manage access to your various AWS services, to prevent access from one to another by declaring a security policy—so, in the scenario described above, your web application servers should absolutely never be able to talk to your employee database.
(I’ve said AWS here, but both Google Cloud and Microsoft Azure have equivalents.)
The great thing about IAM policies is they can be really detailed and granular. The really quite annoying thing about IAM policies is they can be really detailed and granular.
Why IAM can be frustrating
There are a variety of reasons why IAM policies are quite annoying to write.
Firstly, nobody likes writing configuration files. There’s so many to choose from, and they all have some level of nasty attached. XML is too verbose, and XSD is an over-engineered mess, plus few people actually use a schema-aware editor (because basically that either means paying money for something like Oxygen, or using Emacs). JSON sucks because it doesn’t have comments, and nobody has really adopted JSON5 or whatever that lets you put comments in. TOML and other INI configuration files are okay, I guess, but they feel a bit constrained. YAML is a whole mess of implicit type conversions. Yuck. Whatever format we use, it feels necessary to bolt on linters and Git commit hooks to try and fix human errors.
In addition, all of these configuration file formats end up having some pre-compile step—someone concludes “yeah, but I need a dynamic value here, so let’s just run it through ERB or Jinja2 so we can render a value for this one particular thing”. And now you are on a journey to a very bad place.
The wider problem is IAM policies are basically a metadata layer between your code and your infrastructure. Metadata about code has a nasty habit of falling out of sync with code, partly because it is invisible and invisible data ends up being wrong, and also because it repeats things you’ve already said in your code, usually implicitly.
You then change your code, and the policy ends up being wrong.
How do we cope with this?
Firstly, we can allow AWS to generate policies for us. This is fairly common: AWS provides a number of useful “wizard”-style tools that generate policies from inside the AWS Console. These are useful, but they are the policies AWS has written, and may not be what you want.
Secondly, we can create a very general and broad policy. A general and broad policy is better than no policy at all, but we would get the most security value out of locking our policies down as best as we can. There is a balance to be struck between developer productivity and security here, but if we make the policy too general, we end up prioritising productivity over security.
We ideally want something that’s going to lock down each resource we create to the most restrictive security policy we can find, while also making it easy for those policies to evolve as our software changes.
A really simple approach to this is to have a policy autogenerator. AWS has built one into Chalice, a framework made by AWS to allow developers to build “serverless” functions in Python and easily deploy them to Lambda and API Gateway.
The Chalice framework replicates the familiar Flask API design, using decorators to wrap basic Python functions. It is delightfully simple to use.
Chalice’s policy generator is worth reading—it seems worth replicating in other languages and frameworks, and for other cloud platforms than AWS. The bulk of it is in policy.py and policies.json. The JSON file maps together boto3 function calls to AWS IAM policies.
So, for instance, consider this code to add an item to an SQS queue.
Chalice will read this and spot that you have called the
sqs.send_message API calls.
create_queue and turns that into
CreateQueue (the underlying AWS API function name), then looks that up in the
policies.json file and maps it to the IAM permission
If you run the policy auto generator, you’ll end up with an IAM policy that contains the specific permissions used by this script and no more. Here’s what Chalice generated for me:
Customising the policy with more AST parsing
What Chalice is doing is pretty great. It means we can write Lambda functions that have quite tightly restricted policies, and those policies get modified as we change our code.
But let’s go back to our initial problem: in a complex organisation, we might have a whole bunch of different systems sitting in our AWS account. My Lambda function can currently create any queue, and send messages to any queue. A malicious actor who somehow got access to this system could use it to put malicious messages into queues for other system—our orders database, our HR system, whatever. That’s bad. We want to scope the access the code has.
We can use AST parsing to generate a much more specific policy by writing code that reads values from our code and uses that to carefully individuate what resource we’re dealing with.
The first thing we do is take that queue name out and make it a constant.
Then we can use AST parsing to extract that value back out of our code.
This function will get the value of
QUEUE_NAME from our code. Now we need some code that’ll take our Chalice-generated policy and modify it.
We want to generate a very specific ARN selector to put in the
You then need to plumb this policy modification code into your application. I did this by having a file called
policy.py which does all the policy modification and then prints the modified policy out, and creating a
Makefile that makes the policy file.
The commands below run the policy generation and deployment.
chalice gen-policy > _policy.json python policy.py > _policy_modified.json mv _policy_modified.json .chalice/policy-dev.json cp .chalice/policy-dev.json .chalice/policy-prod.json chalice deploy --no-autogen-policy
This inclusion of
--no-autogen-policy is important because without it, Chalice will run
chalice gen-policy and deploy the unmodified policy as part of the deployment process.
We can also store
.chalice/policy-prod.json (etc.) in version control, so we can see clearly what has changed over time. For this reason, I also remove the
Sid (Statement ID) fields from policy files as part of post-processing the policy file, just so we have clean diffs. When AWS deploys a policy file, it will generate a SID for you.
We’re combining two basic patterns here: using metaprogramming techniques to avoid writing configuration files containing redundant and potentially out-of-date information, and using our codebase as a place to store details of infrastructure.
So far, I’ve detailed a constrained and fairly minimalist way to refine IAM policies using code parsing/AST techniques. There are others you could do, and in a variety of different languages.
In Ruby, there are a whole variety of different techniques you can use to parse your own source code. In Python, the built-in techniques for the
ast module are not the most user-friendly, so it is worth going and reading Green Tree Snakes and popping your code into the AST Explorer to see the data structure that it creates.
(Obviously, the Lisp/Clojure folks already doing the code-as-data thing like it is 1958.)
This is an approach that can be scaled up. I’d really like to see more frameworks and libraries like Chalice, where configuring infrastructure becomes part of the codebase that developers build along with the logic of the code itself, rather than being hidden away in unloved configuration repositories, or generically autogenerated by cloud platforms.
Where does that leave tools like Terraform? The general idea is you should try to build applications that manage their own infrastructure if you can, and if they can’t, that’s when you need to have a tool like Terraform, to manage the infrastructure of applications that cannot manage their own infrastructure.1
There’s also nothing stopping you from using autogeneration and AST parsing to generate your IAM policies while you are developing, then once your code has reached a point of maturity, move IAM policies into your infrastructure management code (in, say, Terraform) once it is less likely to change.
The important part of this is as much a mental shift: in a post-GDPR, post-Equifax world, building secure infrastructure is as much the job of good back-end developer as writing secure code. Have fun and enjoy it, because with a bit of creative thinking, you can do find better ways of doing it.