We’ve made great strides in ops automation, but there’s no one-size-fits-all approach to ops because abstractions have limitations.
Perhaps it’s my Industrial Engineering background, I’m a huge fan of operational automation and tooling. I can remember my first experience with VMware ESX and thinking that it needed tooling automation. Since then, I’ve watched as cloud ops has revolutionized application development and deployment. We are just at the beginning of the cloud automation curve and our continuous deployment tooling and platform services deliver exponential increases in value.
These cloud breakthroughs are fundamental to Ops and uncovered real best practices for operators. Unfortunately, much of the cloud specific scripts and tools do not translate well to physical ops. How can we correct that?
Now that I focus on physical ops, I’m in awe of the capabilities being unleashed by cloud ops. Looking at Netflix chaos monkeys pattern alone, we’ve reached a point where it’s practical to inject artificial failures to improve application robustness. The idea of breaking things on purpose as an optimization is both terrifying and exhilarating.
In the last few years, I’ve watched (and lead) an application of these cloud tool chains down to physical infrastructure. Fundamentally, there’s a great fit between DevOps configuration management (Chef, Puppet, Salt, Ansible) tooling and physical ops. Most of the configuration and installation work (post-ready state) is fundamentally the same regardless if the services are physical, virtual or containerized. Installing PostgreSQL is pretty much the same for any platform.
But pretty much the same is not exactly the same. The differences between platforms often prevent us from translating useful work between frames. In physics, we’d call that an impedance mismatch: where similar devices cannot work together dues to minor variations.
An example of this Ops impedance mismatch is networking. Virtual systems present interfaces and networks that are specific to the desired workload while physical systems present all the available physical interfaces plus additional system interfaces like VLANs, bridges and teams. On a typical server, there at least 10 available interfaces and you don’t get to choose which ones are connected – you have to discover the topology. To complicate matters, the interface list will vary depending on both the server model and the site requirements.
It’s trivial in virtual by comparison, you get only the NICs you need and they are ordered consistently based on your network requests. While the basic script is the same, it’s essential that it identify the correct interface. That’s simple in cloud scripting and highly variable for physical!
Another example is drive configuration. Hardware presents limitless options of RAID, JBOD plus SSD vs HDD. These differences have dramatic performance and density implications that are, by design, completely obfuscated in cloud resources.
The solution is to create functional abstractions between the application configuration and the networking configuration. The abstraction isolates configuration differences between the scripts. So the application setup can be reused even if the networking is radically different.
With some of our OpenCrowbar latest work, we’re finally able to create practical abstractions for physical ops that’s repeatable site to site. For example, we have patterns that allow us to functionally separate the network from the application layer. Using that separation, we can build network interfaces in one layer and allow the next to assume the networking is correct as if it was a virtual machine. That’s a very important advance because it allows us to finally share and reuse operational scripts.
We’ll never fully eliminate the physical vs cloud impedance issue, but I think we can make the gaps increasingly small if we continue to 1) isolate automation layers with clear APIs and 2) tune operational abstractions so they can be reused.