Feb 01, 2024
Antti Seppälä & Juha Leino
There were times, as some may recall, when Site Reliability Engineering wasn’t even a thing. Back then, Elisa was planning to organise its operations better to answer to the changing business demands – that is obviously something that happens from time to time.
Part of this was to create a business unit dedicated to building software and services for customers. Thus Software Services (SoSe) was born.
When our SRE team came to existence (little more on that in a while), its natural place to reside seemed to be at SoSe. But software development is not limited to the SoSe unit at Elisa, and we naturally want to spread all good practices throughout the whole house.
This brought us to ponder – where in an organisation should SRE actually be located? And what is its true role?
To better tell the whole tale, we have to go back and tell the origin story – you know, like the first part of a superhero film saga.
When something like SRE was already in the blueprints, there was a shared understanding at Elisa that there should be a team to take care of common tooling. Back then, the need for that was driven by the diverse set of ways people used to develop mobile applications.
It took only a little forward thinking to build common tools not only for mobile application development, but for all software development: CI/CD pipelines, test automation, and so on. After all, the same principles apply to all of it.
Almost as if on cue, talk of DevOps started spreading in the IT world at the same time. This new practice focused on bringing the development and operations closer together. So, the first iteration of our team was called – ta-dah – the DevOps team.
And of course, later on that proved to be a hasty decision.
While doing DevOpsy stuff, the team discovered that they really cannot build DevOps, as there actually is not such a thing per se. DevOps is not tangible, but rather a part of an organisation's culture, the way to operate, implement solutions, and cross boundaries in development.
Hence a name change was in order.
This time we turned to the wisdom of Google, which had launched the concept of Site Reliability Engineering a couple of years earlier, and the term somehow stuck.
As site reliability is something we can build, it gives the team a much more concrete mission. We renamed the team to SRE team and gave the members the job titles of Site Reliability Engineers.
To get back to the original question, what then is the role of SRE in an organisation such as Elisa software development?
We have seen that if software development is not supported throughout the organisation in a consistent manner, SRE will be asked to bridge quite a lot of gaps that it probably shouldn't.
As an example, well-designed, cloud-native software architecture is key to having reliable services. If one business unit is supporting development by multi-tiered, lead and service-specific, in-house architects, while the other utilises external development teams, is it fair to expect the SRE team to help teams in both business units achieve similar end results?
The things we have learned in utilising the SRE team in the most beneficial way are manyfold, but for the purpose of this blog, we condense them into five simple findings.
The number one function of our SRE team is to be responsible for supporting all Elisa developers to be more productive, produce reliable services, and efficiently utilise our own services while doing so.
As we develop and operate Elisa container platforms, version control, and common CI pipelines – along with most of the relevant tooling integrated into said CI, this already is enough to keep our hands quite full.
SREs are constantly looking at how to automate tasks, or develop software further to make those tasks unnecessary, be it operations or guidance.
The same mindset as in software development, regarding automation and operations, applies when discussing services developed by other teams or when encountering production issues.
To that end, we aim to have our communication openly on company-wide public channels, rather than DMs between individual developers and SRE people.
We have noticed that if we are able to create platform-specific, stack-specific, or tooling-specific communities, that significantly relieves Site Reliability Engineers from being the sole point of contact. Trailblazers and other users can instead answer the same questions they have faced before themselves.
Shared channels also help in keeping the communication kind and encouraging when individual SREs or developers are not constantly answering the same, sometimes quite simple (we all must start somewhere, right?) questions.
Providing and utilising various methods of training, such as self-paced, classrooms, and peer-to-peer, is a skill set of its own. Luckily, we figured out early on we needed external help, and it proved to be the correct assessment. Tailor-made training is supplemented with providing easy access to vetted training material from vendors.
As most of our cloud-native operations are utilising Kubernetes in one way or the other, the service-specific problems are extremely varied. We have first focused on providing training which aims at adopting new tooling, platforms, or methods.
This has been an attempt to cover all relevant aspects that developers in other teams should at least anticipate to manifest in their actual day-to-day, without the training being too exhaustive.
To train people well is not simply up to good training. It is also good training provided at the right time. Teams benefit from even more simple primers when they first have a look at new stack, rather than those that would focus on more advanced topics.
As an example, with our Kubernetes as a Service (KaaS) training material, we often find people asking precisely the questions it covers, or have their second go-around with the training when they know which information they are missing. So, at least a significant part of the first training was probably wasted.
Perhaps they were sent to the training as new employees and would have been better off with 30 minutes of basics of the service. After the familiarisation they could have actually gone through the training when setting out to tackle their first tasks.
SRE team is not there to perform all kinds of common functions. Instead, SREs should focus on improving reliabilty, anti-fragility, instrumentation, and lowering MTTR while other common needs should be handled differently or by a separate team.
If this is not adhered to, the meaning of SRE quickly gets muddled, and the team gets to perform operations on various common SaaS offerings and such, workstation software, security tooling, and so on.
That is not SRE. SREs should be able to do their work by analysing what has happened or will happen and by developing software accordingly.
Our SRE team currently has responsibilities that go beyond the scope of SRE, handling the duties of tooling or platform teams, for example. To tackle the ever-increasing complexity of modern IT, it is necessary to have as much simplicity and uniformity of stack as possible.
That increases the number of shared components and thus the amount of pressure on common teams. If your only common team is the SRE team, the core of SRE function is in peril.
An organisation should be careful what it calls SRE and how it is defined in the context of development work. Even when SRE is shared and perhaps supported with at least superficially similar methods than other teams, it is its own beast after all. And by taking good care of that beast, the whole organisation will reap the benefits.
And to make a confession, to be able to focus on the core of actual SRE work, we have not tackled this problem sufficiently. If you have experience on the matter, please get in touch with us to open a dialogue!