- 24 minutes read #developer , #business organisation , #pontification

The role of enterprise architect and principal engineer

Contents

This post was originally written for 101 Ways

How to read this doc

We’re all impatient and time-poor, and this document has ended up with a lot of words, so each section has a “Summary” block - you can read each to get the full content of the preceding chapter, and then only dive into a chapter if you need the explanation of the conclusions…

The balance between vertical and horizontal teams and the Role of the Principal Engineer in delivering value

Before we talk about roles of people, let’s look at how software systems evolve over time in large organisations.

Once a software service of any kind has been deployed within an organisation, either to be used internally or as part of company’s public product, we need to know who will support it - either for adding new features or, particularly in the case of a bug or failure being noticed, who do we call up and prioritise to get things fixed in a hurry?

The idea of “it’s everyone’s responsibility” doesn’t work because if everyone is responsible then no-one is: if you have a set of teams each with their own OKRs or other measures to meet, then which product manager is going to volunteer to fail at their team’s goals in order to fix a problem that another team should do? And even in a world of perfect altruism there is the practical consideration that fixing systems takes deep knowledge, so a request to fix a problem will gravitate towards the team that’s done it before and who will have a head start on understanding the fault.

So this drift towards each service being owned by a single team is completely natural and shouldn’t be resisted - indeed it’s recognised as best practice in some descriptions of microservice architecture (and there’s frustration when a service doesn’t have a clear owner), e.g.:

“With small, long lived teams in place, we can begin to improve the ownership of software. Team ownership helps to provide the vital ‘continuity of care’ that modern systems need in order to retain their operability and stay fit for purpose” - Team Topologies: Matthew Skelton, Manuel Pais

Services without an owning team are a tech debt and potential risk.

Note that these teams are often miscategorised as “horizontal” teams. In theory a team should own a set of services that form a coherent sub-domain within the business. But in practice the need to have deep understanding of software ecosystems means that services get parcelled out to teams based at least as much on the implementation tech choices as the domain. Which is why teams end up labelled “back end”, “front end”, “api”, “data” etc. Teams can be organised in a “Reverse Conway” to include tech choices but it’s quite difficult.

But the notion of team autonomy - each team working to their own roadmap - has its downsides. If you want to deliver a new large piece of functionality within a business you tend to touch more than one service and therefore more than one team will be likely to be involved. Unfortunately what happens is the communication between the teams, and the siloisation of their roadmaps, gets in the way of progress - you end up with the product managers of each team trying to align the request against their own roadmaps and multiple other demands which are seen as being “external” to their team.

In the face of this the senior leadership get frustrated with the seeming inefficiency of the engineering capability and decide, for a new product or functionality, to create a vertical team to deliver it. Note this may also be true for implementing any standards, conventions or governance that are supposed to be applied organisation-wide. This new team is often given a groovy, dynamic name like “Task Force”, “Hit Squad” or “Tiger Team” and is meant to cut across all the communication and planning barriers. The team is created by taking experienced developers from every service involved and they are highly motivated to deliver something new instead of the mundane day-to-day work.

The thing is, this works extremely well for the first feature delivered!

So the approach is seen as a success and then, upon completion, the team is disbanded and, maybe, some members are reconstituted into a new “Hit Squad” assembled for the next feature - or, worse, two or more are created at the same time.

After a year or so of this, certainly within two years, the system landscape of the company is a complete mess - it’s not that vertical teams are bad, it’s that they get disbanded! After a while no team owns anything, no experience is retained in the company, no clear responsibility for maintenance, there is “drive by coding”, the engineers describe everything as “being held together by chewing gum” or some other such idiom and a “rationalisation project” is started to migrate the organisation back to STOSA. The pendulum swings…

So if neither vertical nor horizontal teams are the way to deliver value in large-scale projects what can we do about this? How can we keep the best points of service ownership with capability-organised teams, plus the scale of vision and delivery effectiveness of an end-to-end team?

What works well is to create a virtual team using just the senior staff you need - for example only an experienced principal engineer and product manager. They can then act as shepherds or ambassadors for the project. Between them they should be able to articulate the new product or business function they wish to create and see how the different parts of the current landscape should fit together, be amended, extended or new services created in order to fill the gaps needed by the function.

The characteristics of a good Principal Engineer include:1

  • Strong multitasking ability - balancing a couple of “hot” projects with some growing future projects.
  • Strong analytical skills and outcome oriented.
  • Strong communication skills, the ability to explain complex technical issues in a way the listener can understand - a C-suite listener will be hearing very different things to a developer.
  • Excellent organisation and leadership skills, in particular the ability to network and “lead upwards”.
  • Experts in multiple technical domains through proven experience in building complex systems.

In particular the principal engineer will bring their experience of the different architectural options that are available and know how to choose the most appropriate option given the balance of constraints that the current landscape is in, either technically or socially given the knowledge of the engineering teams available. In some cases they may determine that a whole new subdomain is required and a new team should be formed to implement and support it long term - but this is not the same as a “hit squad” that gets disbanded!

The pair will then be able to negotiate with and empower the tech leads of each team to contribute to the larger project, to make sure their roadmaps are aligned, that APIs will have the required capabilities, that the solutions in general are supportable within the company and align with agreed best practices, check up on progress and hold individual teams accountable for contributing to the greater whole. With careful alignment, and a sense of urgency, the new feature will emerge out of the parts created by each involved team.

Summary

  • Engineering teams always end up service aligned, it’s natural and inevitable to support long term maintenance.
  • “Hit squads”, created to cut through the silos, end up producing a mess by the 3rd project as services end up with no clear owner - particularly if the hit squad has indulged in drive-by coding and even more so when the squad is disbanded.
  • For a large scale project, in large organisations, a combination of product manager and principal engineer empowered to sit across teams will make sure the various team roadmaps and outputs are in agreement with the business strategy and standards, and conventions are being followed, and therefore the teams are aligned, energised and valuable.

The Role of the Enterprise Architect

Given what we’ve said in the previous section, it seems that a healthy community of Principal Engineers should fill the function of Enterprise Architecture, ensuring a coherent engineering practice and delivery. So is there a specific role for Enterprise Architects as such?

We wish we had tools that would allow engineers to document their systems in such a way that any team can use it without help, or the same team can use it at some point in the future. But these tools can be difficult to use, still need alignment across the company, and the act of documenting can be uninspiring to the sort of person who enjoys creating and building.

One solution is the architecture librarian. They can be empowered to research the whole systems landscape and form a clear picture of how the systems fit together and depend on each other. Or they may have been at the company for so long that they naturally have all the retained knowledge.

There are many tools that can help the architecture function of a company document the whole system landscape:

But the value of the documentation seems to decrease the more distant the documenters are from the people writing the systems in the first place. So a balance needs to be reached whereby tech team leads, or Principal Engineers, are assisted in producing, maintaining and extending this documentation themselves. Architects should not be seen as senior to the Principal Engineers - or vice versa - they each perform a different function in mutual support with each other as peers.

The customers/consumers of the software architecture function will be the principal engineers and tech leads. And for every team in an engineering department the goal is to automate away work as much as possible - to enable self-service by the consumers - not to make oneself redundant but: 1. to enable greater scale, 2. to free oneself up for more interesting things.

Self-service at the level of abstraction of a Principal Engineer means being able to make systems design choices without having to wait for approval. This implies a well agreed set of standards - the right way is the easy way.

So another role for the Enterprise Architect function is enabling processes for reaching agreements on architectural solutions. Note, it shouldn’t be the role of EA to impose any particular solutions - no matter how experienced the EAs are that way leads to “ivory tower” style behaviour and diminishing respect.

Summary

  • Self-service exists for Principal Engineers to achieve scale - at this level of abstraction that involves systems design choices and architecture.
  • This abstraction level can often be agreed, in advance, between Principal Engineers and Architects using tools like Domain Driven Design combined with some agreed standards and conventions.
  • The correct approach is trust but verify - all senior members of the company are highly experienced: let them innovate when necessary.

The balance of tools and the paved road

One part of the art of leadership is to hire people smarter than yourself and then get out of their way. And this applies to the strategy of Enterprise Architecture too. But that doesn’t mean there should be a free-for-all when it comes to the choices of tools, languages and techniques.

In general there is a balance to be found between the chaos of everyone doing their own thing, and the inadvertent shackles of central planning.

  • Engineers want “The best tool for the job” - this is chaos, every job ends up with its own tool and there’s minimal sharing of knowledge across the company. CV driven development is also a significant risk.
  • Enterprise Architecture has a tendency to Command and Control style management and wants to declare “This is the tool to use”. This doesn’t scale to empowering creative solutions.
  • The Goldilocks solution is “The best set of tools for all the jobs” - the “paved road”, with some scope for bespoke solutions if there’s an articulated need. As with all defined processes there must be a way to challenge the chosen tools created at the same time that the tools are defined.

Teams are empowered to use whatever tech they see fit, but it is made clear to all the stakeholders (i.e. product managers) that there are consequences of being 1st and using stuff away from the paved road and the product teams need to be accountable for such choices. For example, if the platform support team is not familiar with a technology choice then the team that’s chosen to use it may have to provide on-call support (or the Product Manager has to acknowledge there may be downtime out of office hours). However they are also empowered to engage their peers and see if multiple teams want that feature/capability and then either make a request against the platform team’s roadmap, or use an “internal open-source” model to enhance the platform for all.

The platform team makes the building blocks of the technical paved road - make the right way the easy way.

“The same principles of good design and functional architecture apply in the world of choices as well. Our primary mantra is a simple one: if you want to encourage some action or activity, Make it Easy” - Nudge: Richard Thaler, Cass Sunstein

The customers of these building blocks are the EAs and Principal Engineers. Technical teams code the products with the PEs and EAs ensuring teams don’t fall into known traps, which involves knowledge and experience of system design and understanding the tools.

The role of Enterprise Architecture is to guide and enforce the method for coming up with standards - the EA may have an influence on the resulting standards, but shouldn’t be coming up with them in isolation.

Any standards determined by any Enterprise Architecture function need to be empowering in some way. I have experience of an EA team writing out a set of standards like:

  • “All software should be designed to be flexible and allow change”
  • “All services should minimise the cost of infrastructure”

and so on… The EA team spent weeks writing them out into the internal wiki. The thing is they were all platitudes: nothing untrue about them in any way, but none of them useful. That EA team lost the respect of the Principal Engineers and any influence they may have had (NB. influence is about being likeable, connected, and credible)

Summary

  • If you apply control you freeze progress.
  • Make the right way the easy way - the paved road: show the general direction and let smart people charge forward.
  • Platitudes are no help, standards should be informative and enabling.

Internal Open Source

Internal open source (“Inner source”) is often discussed but rarely practised. There’s no magic to it but people who haven’t run or significantly contributed to an open source project always underestimate how much management it requires. Linus Torvalds invented an entirely new source control solution just to help him manage the Linux Kernel project (fortunately, because it is open source, Git is available to everyone).

To effectively manage inner source every project needs several practical things including:

  • Good documentation that details the purpose of the project
  • Installation instructions and usage instructions - these will be required anyway by any new member joining the team, so why not make them readable by other teams too.
  • Contribution instructions for people not in the team - e.g. communication channels used, how to find the product manager for that service, minimum requirements for raising issues or pull requests.
  • The documentation, or pointers to the documentation, should live with the code where it’s obvious - often in the README file.

This is already more documentation than most teams are willing to do for their projects. However by agreeing to some organisation-wide conventions creating these guides can become a template exercise. Particularly if there’s a high degree of similarity in the tech stacks.

“Every part of the software system needs to be owned by exactly one team. This means there should be no shared ownership of components, libraries, or code. Teams may use shared services at runtime, but every running service, application, or subsystem is owned by only one team. Outside teams may submit pull requests or suggestions for change to the owning team, but they cannot make the changes themselves. The owning team may even trust another so much that they grant them access to the code for a period of time, but only the original team retains ownership.

Note that team ownership of code should not be a territorial thing. The team takes responsibility for the code and cares for it, but individual team members should not feel like the code is theirs to the exclusion of others. Instead, teams should view themselves as stewards or caretakers as opposed to private owners. Think of code as gardening, not policing.” - Team Topologies: Matthew Skelton, Manuel Pais

In a practical sense, when using git to manage code, the phrase “grant access to the code” means having one’s ID added to the CODEOWNERS file so one can approve pull requests. This implies a lot of trust by the owning team and is a double-edged sword: it is both flattering but implies an obligation to care.

As an aside, the term “inner source” seems to have been coined by Tim O’Reilly in December 2000. That post also details some virtues of “open source” development style which include: robust, well-designed, carefully documented, having an available specification/extension process, an existing reference implementation, and an open and responsive stewardship of the software and the standard by those who control it - all these virtues also seem appropriate to software systems created within the community of a company.

Small, nimble teams inspired to fulfil mission objectives with the freedom, flexibility, and empowerment to get it done under any circumstances, contributing where necessary, while respecting the long-term stewardship of code are the key to making rapid progress on a number of opportunities simultaneously.

Summary

  • If your team is spending all its time reviewing external contributions then that is a success - you’re enabling and leveraging the whole company to support the growth and value of your services.

Things that don’t (often) work

Architecture Review Boards

Those meetings where when a principal engineer or tech lead comes up with a system design for a new function they have to get it “approved” at the fortnightly governance meeting. This may be slightly controversial as ARBs can be valuable under certain circumstances, but those circumstances are difficult to get right.

In my experience architecture governance boards often end up as low-value talking shops and getting disbanded. Symptoms of a bad ARB include:

  • Attendees of the meeting don’t read the proposal beforehand so a presentation has to be done in the meeting (effectively reading the proposal aloud)
  • Attendees end up bickering about some tiny detail.

In the face of this, proposers will end up gaming the meeting in order to get a design through to meet their targets - figuring out what they need to say, and also what to leave out, such that meeting attendees are given the impression they’ve been consulted, and given the pleasing opportunity to exercise their authority, but without being given the opportunity to be obstructive.

ARBs can work where they are seen more as a mechanism for exploring possibilities, i.e. behave more like systems analysis than a clearing house. If there is an agreed set of loose principles to follow then the solution-space may focus onto a clear outcome more quickly - once a problem has been analysed into a domain then only minimal architectural exploration should be needed.

Summary

  • ARBs can descend into politicking.
  • They can work if positioned as agreeing and disseminating loose standards for thinking through a solution space, rather than mandating particular solutions.
  • Need to be enablers rather than controllers.
  • A good tech strategy with buy-in enables self-service at the Principal Engineering level.

Architecture Decision Records

When looking at a particular piece of a system, the engineer may be asking themselves “Why was it done THIS way?”. If no-one is around who worked on the system at the time then it may be difficult to find any answer that hasn’t been passed down as some kind of folklore by word-of-mouth…

Architecture Design Records have been suggested as a solution to this problem - unfortunately they don’t work. I’ve seen it attempted 4 times in my career and, in every case, the person who proposed it was the only one to write any, and after a few months they gave up. Even in the case where a senior leadership insisted on them they only got written (as a backfill) when that senior outsider was having something explained to them and no-one else ever found them useful.

There are several reasons for this:

Firstly: there is a great mismatch of incentives / value for writing ADRs. The cost of writing them is incurred now - and it’s a very boring task - whereas the value is gained by someone else or even your future self, but only in some far off future which possibly may never happen.

Secondly: what constitutes a valuable decision worth writing down? The choice of event streaming vs. REST? Using Python vs. Java vs Typescript? How can you tell the difference between something that is new and innovative vs. something that’s obvious? Are ADRs just making up for experience? You will end up with arguments about bothering to write the ADR (remember, they’re boring and, at the time of writing the ADR, the question they answer is now obvious) and they get done either as a labour of love or under compulsion by a senior colleague (with all the reluctance and bad feeling that comes from that…).

Thirdly: even if the first two were not an issue, human beings cannot produce a projection on an event stream. ADRs are written as a sequence of events over time, whereas we want to know why a certain system is designed the way it is now - i.e. the accumulation of all those decisions over the whole time. But ADRs are not indexed that way, particularly when a subsequent ADR can supersede and invalidate a previous one. So, just like producing a “projection” in a CQRS system, they have to all be read in sequence with the reader keeping track of all the consequences until they can understand the whole picture. The value of this mental effort is too low given the cost.

The solution is the architect librarian, while this sounds like some multi-class dungeons and dragons character it is perhaps the most important role an enterprise architect can play. Consider them the village shaman. They retain the long history of decision making within their team, their oral history is far richer and more valuable than anything they might write. They should be consulted before important decisions are made.

Summary

  • The incentives for writing ADRs are misaligned.
  • Humans cannot produce projections on an event stream easily (it’s done very slowly over time at great cost and called “experience”).
  • The answer is an enterprise architect.

Working to specification

Reputably a main problem with outsourcing was that suppliers always delivered what you asked for, rather than what you needed. So detailing the specification correctly became such an issue that outsourcing was no longer a benefit.

I’ve worked at a couple of dysfunctional places where “working to spec” was a strong smell: At one place, a team always demanded very detailed specs in their definition of done and delivered against that - because they were fed up with being told they’d done the wrong thing against vague problem descriptions…

At another, a team delivered a service and the team leader took the attitude “this is the spec of the micro-service, we won’t consider changing it, we have other things to do” (even though it was a core config service and not fit for purpose. I had to build a parallel config service that ended up being used by a third of the company). That particular team lead had, shall we say, some inappropriately domineering behaviours in other areas too…

Specs are needed of course, they form the basis of a service/data contract and enable testing etc. But they work well when all sides are contributing to the spec and collaborating on creating a solution to a business problem (“we all work for the same company”). When a spec is created one-sided then it results in silos and the worst kind of “ownership” (teams being blocked) etc…

Specifications can take many forms. They might be product specifications expressed as stories in the team’s backlog. Equally they might be in the form of engineering principals the engineer community has agreed to abide by such as the minimum-necessary responsibility principle2. Specifications can come from standards such as defining a particular language to be used for specific types of work. They might come from an artefact repository where security signed versions of packages reside to ensure teams are using the right libraries. They might come from the available SAAS solutions such as the cloud the company uses. All of these things and more can form specifications.

Summary

  • Working to spec is a smell of a team protecting itself for some reason or attempting to dominate.
  • The ethos of “we’re all one team” has been destroyed somehow and you need to fix that root cause before the symptom of self-protection or dominance can be addressed.
  • Fixing the root cause can mean taking some of the burden away from teams. This is where an Enterprise Architect, working alongside Principal Engineers can coalesce specifications from the business strategy, product designers, and the engineering community.

Business Lifecycle Model

One way of modelling the stages of a business includes:

Seed Start-up Scale-up Growth Maturity Transform or Decline
Annual Revenue 0m 0m - 10m 5m - 100m 50m - 500m 100m - 1bn 10m - n bn
No. of people 1 - 3 2 - 20 10 - 500 200 - 2000 Thousands Thousands
Character What are we doing? Say yes to everything Can start to say no, focus and get big quick Dominate a market Optimise efficiency What do we do next?

But as companies grow in size and tackle grander projects, so may the number of people in the engineering department. As this happens the communication structures within the department will change, this can be loosely modelled by the Dunbar Number.

Aside: The actual numbers of where the boundaries lie are hugely debatable (the original ethnographic animal studies were extrapolated to humans), and the association between brain size, intelligence (what kind of intelligence, there are many), and group size is somewhat dodgy and, shall we say, “of its time”… But under all that is just a very general observation about the different kinds of communication styles that happen in groups of various sizes which shouldn’t get lost. There’s definitely a change in communication style within a community as it grows from “close friends / confidants” to “family” to “clan” etc.

As a company moves into the “Scale-up”, “Growth”, “Maturity” and “Transform” stages the roles of the Principal Engineers and Architects will become more and more valuable. They will be the people that hold the long term knowledge, and have the leadership skills, to open the communication channels and help the teams in the clan to move forward in the same direction.

Summary

  • In the early startup and scaleup stages of a company all engineers are effectively potential Architects simply because they are the ones making all the decisions.
  • As a company grows it will be more and more useful to consider explicit Principal and Architectural roles to: 1. Hire in a wider range of technical experience, 2. Keep open the needed communication channels as the group dynamic changes.

Overall Summary

There are no easy fixes to the problems of delivering value in software engineering. But there are a few principles we can draw out of the above that may help:

  • Principal Engineers can keep tech teams aligned and magnify the value of their output. They will do this by sitting across teams as needed, probably varying project by project.
  • In particular it’s better to create virtual teams with e.g. a Principal Engineer and a Product Manager, who can shepherd a project, than it is to deal with the fallout of a disbanded full end-to-end team.
  • Both Enterprise Architecture and Platform Engineering need to take on an enabling rather than controlling role. In fact this is a good general principle for all leadership roles.
  • Standards need to pave the road rather than be either restrictions or empty platitudes.
  • A platform engineering team should make the right building blocks available and the easiest to use - where “right” is agreed across the Principal Engineers and Architects - while still enabling teams to explore novel or edge-case solutions.
  • Command and Control is a dysfunctional leadership style, better to “Inspire, Align and Empower” (alignment and control are very different things).

If you have any comments or feedback about this article, please use the Linkedin thread


  1. https://handbook.gitlab.com/job-families/engineering/development/management/principal-engineer/
    https://www.linkedin.com/pulse/what-principal-engineer-anyway-douglas-w-arcuri/
    https://engineering-manager.com/2020-03-21/what-is-principal-engineer-role ↩︎

  2. The phrase “single responsibility principle” has become over-emphasised and quite damaging in some circumstances - search: single responsibility principle considered harmful ↩︎

This page and its contents are copyright © 2024, Ian Rogers. Theme derived from Prav