How I managed to save 1500+ hours of Engineering Time every year.
DevOps and SRE culture’s are all about automation and making your life easier, there is no debate on that but what about solving chaos between internal teams.
Once in a while you stumble upon problems that are present in clear sight but people just aren’t able to see. And that’s when you can experiment with your skillset. Even if it doesn’t work, you might as well learn a few things along the way.
I earlier wrote an article about solving problems that have the greatest impact.
SRE 103: Find & solve problems that have the greatest impact.
“Problems are not stop signs, they are guidelines.” Robert Schuller.
This is a much detailed writeup towards what was built and how it came to be.
At my current workplace, we have more than 175+ services and jobs running as micro-services. We host our CI / CD on gitlab, which was migrated from GitHub + Jenkins a while back. Before or after migration, there was still no clear picture of any service ownership of applications.
Any engineer on any team could enter any repo, create a MR/PR and publish their changes without anyone’s consent! (I KNOW!). And I came across a crossroad when one fine day, I was on call and committed a quick fix to a the main branch and rolled it out. This audacious and bold move was not taken lightly by the manager of the team owning that service and I received a nice email in my inbox cc’d to my then manager.
Well, I understood the sentiment behind someone else blatantly editing your codebase but the problem was not with what I did, but in the process.
Soon after, gitlab released two major features:
2] Protected Branches
One assigns a team to a particular repository and the other needs verification from the team assigned to get approval for changes.
Solved right? Not quite. The problem was far from over.
We are a Python + Golang shop. We also have multiple requirements, dependencies and different versions of protobufs.
Almost every single day, there are new builds pushed out, version and library changes happening round the work hours.
The ownership prior to all this was maintained in a Google Sheet which was rarely updated. So I started building a serverless application.
It has three major components.
- FrontEnd in HTML / JS / Ajax DataTables
- API GateWay + DynamoDB
- A job that runs every day/week to update all the required info and load it into dynamodb.
So that's that, every single job is listed on the dashboard including information like name, gitlab link, service language, team-name, list of libraries being used with their versions, list of protobufs being used by their version, code coverage etc.
So to think about the time saved:
- We have 75+ engineers.
- 5 Days a week
- 4 Weeks
- 12 months
- Lets assume an engineer spends at least 5 minutes of his time clicking around to see what got updated in all services that are connected/interacting with their service. (It’s def more than 5)
- That is approximately 90k minutes, meaning 1500 hours every year.