Posts on Locally Optimal

Nobody will take care of your career like you will

Thu, 16 May 2024 00:08:09 +0000

The reality as you get more and more senior is your manager will most likely have more reports, and almost certainly have more problems under their scope. In fact it gets to the point where they are almost guaranteed to have at least one enormous burning fire active at any given time, simply due to the size of their org.

To navigate this, you must take the initiative in your career development and communication with managers. Understanding what you want to be optimizing for (career growth, work/life balance, new skills, etc.) and clearly communicating your desires is crucial. Be proactive in bringing information and suggestions to your manager so they can focus their limited time and energy on solving problems for you instead of guessing what you need.

You also (at least in my experience) likely have to admit that your manager or a single perfect mentor is not going to be capable of providing all you need. Building a rich network of peers and role models helps in understanding opportunities and challenges. You can also assemble a board of advisors (or a Voltron, depending on your preferred analogy) to help develop very specific skills from multiple people where no single person can provide them all.

Considering a detailed mental model of the company's landscape can aid in explicit career planning. Are there teams, projects, areas, or other that you’d like to steer your career toward? Treat your career like an enormous (hopefully very seaworthy) ship – the best way to avoid icebergs and end up where you want to be going is to notice issues early, start turning the ship, always keeping in mind the big picture of where you’re headed. Even the very best managers have limited time for your individual development – don’t outsource being the captain of your own career.

Guard your time jealously, use most of it for high leverage work

Thu, 16 May 2024 00:07:42 +0000

As you become more senior in your career, demands on your time will outstrip the hours you’re willing to work, no matter what. Your time, focus, and energy is perhaps the most immovable constraint on your output. There is a very real ceiling on the useful output hours you get per week – while you can raise the raw quantity of hours easily enough, it almost certainly comes at a cost on quality short term and sustainable energy longer term. So instead of adding hours, focus on making a fixed set of hours really count.

It's imperative for you to be an active owner/director of what you spend your time working on. This includes what to set aside, as not all demands can be met. As you get more senior, you’ll increasingly need to be an active participant in helping your manager and team prioritize what is right for you.

Focus on identifying the top one to three big efforts you want to consume the bulk of your project time. You should be in the details of these things, as is appropriate for your level, of course. Utilize your calendar effectively to ensure your time allocation matches agreed priorities and be ruthless in prioritizing essential participation and eliminating unnecessary commitments.

If you need or want frameworks, you can try the Eisenhower Matrix or LNO, but recognize that all frameworks are useful only in as much as they actually work for you. At the end of the day you should be using whatever tools in the toolbox get the best outcomes for your working style.

Finally realize this is not a one time intervention, it’s an ongoing process. Check in with your manager every so often and confirm these are actually the most important problems. I find my own “top 3” tend to change roughly once per quarter. The single best predictor of my priorities needing a reassessment is feeling overwhelmed or like I'm working on the wrong projects. You should expect and plan for periodic reassessments of your time distribution as priorities shift.

Take a moment to identify your current top three priorities. Does your calendar reflect them? Are they still aligned with your biggest impact areas? Schedule a 15-minute review with yourself this week and consider bringing any findings to your manager to discuss.

Ask your manager for 5 growth areas, so they can pick 1-2 that actually work

Wed, 20 Mar 2024 04:56:35 +0000

I always try to remember two golden rules of managers – they’re often busy juggling many problems unrelated to you, and they cannot read your mind.

I have heard (and have also sometimes personally felt!) complaints like “my manager isn’t offering me the opportunities I want” or “my manager doesn’t get what I care about” or “I don’t care about my job” or “this company doesn’t support my growth”. These are both 1/ valid feelings I don’t want to pretend aren’t real and 2/ frequently unhelpful framings. Both are something you can directly improve with a little bit of honest communication.

Your manager is not a mind reader. They are probably burdened (like you are!) with a metric ton of other problems. Your career growth doesn’t have to be more important than the severe incident they’re dealing with, but it sure needs to be loud enough to be noticed over the general din and business of an average month at your company. Tell your manager what you want. If they’re a phenomenal people manager, have the spare energy, and feel like taking a fun risk in a 1-1 sometime, they might call you out unprompted and guess at what would make you happier. But in all likelihood, they’re exhausted, they're dealing with something outside work with their kids/partner/parents, and/or they just aren’t sure enough to justify guessing at what you need.

If you want to change your work or grow your skills, set aside explicit time in a 1-1 and give your manager a too-large set of things you are interested in improving/practicing/whatever. I like to say “5 skills I’d be willing to learn” as a rule of thumb. The point is your manager has their own universe of constraints they’re juggling – people, roadmaps, project deadlines, planning overhead, and a bunch of other reports’ interests to balance. Inevitably your list of 5 growth areas will not all be realistic right now, and that’s okay. You come with several options, your boss counters with the subset of those that actually match with constraints of reality, and everyone leaves with a better understanding of your interests and a mutually beneficial outcome.

Bonus points for making sure you tie the skills to specific behaviors or roles. Recently I did this for myself – I wanted to absorb and learn more of Stripe’s secret sauce for building and releasing great product APIs, and concretely proposed that I get more involved in the approval process for APIs. Being specific makes it that much easier for your manager to understand how to help you, and has the bonus of you getting to name your (ideal) instantiation of the skill.

A company (at least a growing/expanding company like Stripe) is always looking for people to step up and handle more. If that’s interesting to you, help your manager help you by explicitly listing what you’d like, giving a few extra options for flexibility, and accepting that your manager has their own constraints to respect. The outcome should at minimum be a better “mental model of what you want” for your boss, and very likely a good project/role match for you.

Make sure your foundations can support your high velocity product growth

Sat, 02 Mar 2024 20:10:30 +0000

You own a system. You are obsessed with your users. You are building value as fast as you can for them, your laggard dependencies be damned. Where dependencies can't keep up, you solve problems yourself!

You have seen the dire warnings from stagnant, slow bureaucracies. You've seen them calcify because they were too tied to other parts of the business. And you know that you won't end that way. "Decoupled" is your watch word and you're determined to stay nimble however you can.

Maybe you haven't found PMF yet, so you need to keep experimenting quickly. Or you have and things are all the more dire because you're barely able to keep scaling ahead of success better than your wildest dreams. Either way, you are shipping new features as fast as you can, because feature quantity has a quality all its own and you need to keep learning/stay ahead/make money.

There's just one problem. You've become dangerously overextended. You're building the 9th gorgeous wing of a castle on top of increasingly unstable sands. Or the wild growth of your product's trees is starting to succeed wildly but is growing top heavy, and your shallow root systems increasingly can't keep you upright. You know what else is awesomely decoupled, moving rapidly, and completely independent of pesky constraints? Wile E Coyote suspended 200ft above a gorge with nothing under his feet.

Instead, you must think equally about your velocity and your overhang. Worry about that brick wall you're headed at full steam ahead -- don't just get excited for how fast you're moving. Think early about the length of your runway, and talk to your dependencies so they can hopefully anticipate your needs.

Code, by volume, is tech debt. If you can build atop a well aligned dependency, you get to recruit its developers to build features (and do ongoing maintenance) for free. Their work is your value and leverage. If you lose this alignment, you simply now own more code. It will eventually hurt even if you can't yet see it today.

Watch for your progress vs foundations. Work with dependency teams to make sure the runway stays ahead of your features. Bite the bullet and admit when the forest is about to topple – slow new features, solidify the foundations, and get back into a good place. Leverage is absolutely magical when your dependencies are squarely supporting the direction you're headed. If that's not the case right now -- look at it honestly, figure out how to address it, and go ensure the foundations get fixed.

Know your neighbors

Sat, 02 Mar 2024 20:05:21 +0000

It's not enough to know your own scope, your own team, and your immediate surroundings. Your work inherently succeeds or fails based on the first (and even second) degree removed participants – customers of your APIs, stakeholders in the invariants you ensure, the hard and soft dependencies you build atop, and the broader ecosystem you thrive or wither within.

Do you know who your neighbors are? Have you met them, talked to them, built rapport?

Can you ask them for urgent favors, or lean on them to help you out of a jam, or trust that they will assume best intentions from that rushed, slightly-too-harsh Slack message you sent in haste? Do you know what matters to them, where you could improve things for them, and what their most important request is for your team/system?

If you aren't sure who your neighbors are or where things stand, put in the time now to fix that while the pressure is relatively low. Relationships (in real life and at work) live or die in hard times based on the investment you put in during the other seasons. There's no time like now to meet that coworker one team over, have a casual curiosity-first conversation about what is going well/poorly for them, and talk about something you have in common.

You never know when it will matter until you know what matters to them. Start learning.

Building consensus iteratively with feedback spirals

Thu, 11 Jul 2019 07:00:00 +0000

Building true consensus is hard work! By its nature, it implies you’ve taken time to hear lots of opinions, convince people on the margins by addressing their concerns, and probably spent some of your time discussing with those who will never agree with you. Often the goal of getting true consensus and buy-in from a broad group of people can feel impossibly hard.

One of the paths toward consensus is eliciting and incorporating feedback regularly. Unfortunately, many projects do this too little, and too late. A classic failure mode is gathering feedback after decisions have been made and executed on, like a long argument in code review about the fundamental architectural choices underpinning the whole design. This produces wasted effort, frustrated people, and overwhelmed reviewers.

Wouldn’t it be better if we treated feedback like we do agile engineering projects — getting feedback early and often instead of in a huge big-bang at the final approval step?

Gathering feedback iteratively

So this is my shot at trying to codify how I approach gathering iterative, small feedback for my own controversial architectures or decisions.

The main mistake we’re trying to avoid is waiting too long for feedback. It is a natural mistake — releasing things for feedback early feels risky! It’s easier for your reviewers to not understand (“why is this so messy?” or “this is incomplete, I can’t review?”) — you must work harder to share the same context and expectations.

If you haven’t seen it before, the concept of “30% feedback” proposes getting earlier feedback on a messier version of your final idea. I really like this, but the most obvious problem is it only splits your feedback into two big chunks (the 30% mark and the final 100% mark). I want to use that same reasoning, but make it fundamentally continuous and let myself regularly check in for feedback whenever I need it along the path from raw idea to final consensus.

Without any further ado, here’s our approach to building consensus with iterative feedback:

Start small with deciding on the basic thrust of your idea and what problem it solves. Share with a core group of project peers or respected advisors. Get feedback on the clarity of problem statement and your directional plan of attack.
Expand the feedback group to include your planned final stakeholders and decision makers. This is the 30% feedback point. Ask for preliminary agreement that this solution looks promising. Get feedback on the shape of your solution and anything you’ve de-scoped from the problem.
Reach the widest point of feedback. Be pretty sure of the details of what you want to do and why. Offer chance for feedback to everyone, especially those who won’t be in the final decision making group.
Start shrinking the feedback circle toward the core group of decision makers, making fewer, smaller changes, and finalizing the plan.
Finally the decision makers make the final go/no-go call, based on the proposal crystallized from all this feedback.

With this pool of reviewers first expanding and then shrinking, you can imagine why I call this “feedback spirals” 😄

Let’s dive into each of these stages in more depth.

Finding your MVP and getting directional feedback

There’s great literature ( The Lean Startup, any book about agile, etc.) endorsing early feedback, cheap MVPs, and iterative work for engineering projects, so let’s steal those concepts.

Our goal is twofold: discover the simplest statement of the problem we’re trying to tackle and form a directional opinion about how we’re going to solve it.

I recommend using a structured format and keeping your first approach small. I like the Amazon 2 pager in this case, and focusing in particular on making sure my proposed solution is particularly high level. The written format is particularly good at forcing a clear problem statement up front and offering a concrete artifact of the plan as you go along.

The MVP I like is a directional position on how we’re going to solve our problem. This should not be a concrete, detailed proposal at this point, but rather a “big idea” or directional solution. If a later proposal will be “30% done”, this is more like 10%. Our goal is to frame the very general target of our solution, of course with the knowledge that future investigation may well change our opinion on the best approach.

Once we are satisfied with our initial 2 pager, we get the very earliest feedback! I recommend the first reviewers is a small group of supporters and interested parties with enough experience to be able to give loose, direction feedback about an idea that is barely even defined. Use this feedback to build a consensus direction for the start of your project.

Expanding to 30% feedback and beyond

After we’ve settled on our problem we’re trying to solve, and a directional solution, it is time for us to iteratively work out the details, getting feedback as we go.

The 30% mark is arbitrary here — remember the value of our fully iterated approach is there isn’t a single magical point where we need to get our early feedback exactly right. Instead treat it as a continuum, where we’re regularly expanding whom we get feedback from and hopefully also regularly reducing how much change this feedback has on our project’s design.

Our goal is a narrowing of changes, where over time we can be building confidence in our project’s proposal. Though you might worry about all this feedback slowing a project down, in practice I’ve found it actually increases how quickly I can safely move, secure in the belief that there’s growing momentum behind what I’m proposing.

Reaching the point of maximum feedback

At this point, your proposal should be mostly unchanging. I aim now for the largest circle of reviewers, including the majority of the engineers or other teams who will be most impacted by the change. This offers me the most chances for hearing novel feedback and lets the whole team buy into the idea. My goal here is not to get review from “experts only” — but to now confirm with the larger team that the idea makes sense and will work. Often this is where you’ll discover last minute details that seemed fine in the abstract but don’t work in the detail.

This is not only the largest reviewer group, it’s also hopefully the last time our project’s actual details change. From here on out we’ll be moving from building consensus to acting on the final result of that consensus.

Making a decision and committing to it

Now for all that hard work, we return to a final small group to make the final call. I like this to contain a lot of the original 10–30% reviewers, since they’ve had the longest context over the lifetime of the design. The upside of this process is we’ve formed a proposal that’s included feedback from a huge diversity of sources, and got repeated, ongoing consensus as we went along.

In a more traditional, waterfall approach to project decisions, this would be the moment of truth where you throw the project to reviewers and cross your fingers. Some feedback would be minor, but some will be second guessing the very underpinnings of your solution. Reviewers in this situation can slip into a role that is more antagonist than collaborator — bringing objections that are too fundamental to be solved with any change except going back to the drawing board.

However that’s the beauty of our iterative approach. The final call is made on the strength of layered feedback at every step of the process, including an original direction set and ratified by at least some members of the final decision committee. The goal is always to discover disagreement or fundamental concerns as early as possible.

Put another way, I strongly believe that every final decision point should be as boring and pre-determined as possible. Be skeptical of long periods without feedback, and of a decision process that produces surprises at the end.

In conclusion (what does this get us)

Consensus is incredibly valuable, but it’s also hard won. A smooth feedback process gives you the best of both worlds — smooth momentum that isn’t delayed by arguing over proposal fundamentals and a final decision that feels effortless and widely agreed upon.

To get there, avoid “design everything up front” quagmires by focusing on a problem statement and directional MVP and immediately getting feedback from a core group.

Build regular consensus up gradually over time, growing confidence that your idea makes sense in the concrete, and broadening your feedback circle to build consensus wider and watch for mistakes others might catch.

As feedback starts to solidify behind a consistent solution, start shrinking the feedback circle. Watch to make sure the magnitude of changes is reducing regularly, and build confidence that this is a good enough “first approach” to commit to executing.

Finally return to a relatively small decision making group you trust to make the final call. Running this process in the open and including feedback from a large group balances transparency for big decisions with an efficient decision process that doesn’t devolve into design-by-committee. Document your final decision and the justification for it in public, so those not in the deciding group can follow your reasoning.

In the end, hopefully you get the best of all worlds with excellent decisions and easy consensus. There’s a reason we prefer agile projects over waterfall ones — your decision making process deserves the same. Try out feedback spirals and let me know how well it works for you, or what process you’ve used to avoid these same pitfalls.

Originally published at http://www.locallyoptimal.com on July 11, 2019.

Delegating safely and successfully

Thu, 28 Mar 2019 00:00:00 +0000

At Yelp, I’ve been in a role we call Group Technical Lead (GTL) for a little over a year now. In short, it involves being the technical (aka non-people-manager) leader of a technical space that spans teams. The closest industry standard analog is probably Senior Staff Engineer. I worked in 2017–2018 as the GTL of Yelp’s commerce platform — payments and billing infrastructure that powers food delivery, advertisements, and other paid products. However this year I’ve just transitioned to serving in the same GTL role for Yelp’s ad platform. This is a pretty dramatic shift in teams and results in me owning a very different set of tasks.

One of the obvious changes caused by me leaving Commerce is the work I used to do needs an owner! The one most at the front of my mind is the group’s technical roadmap for the next couple years, but there are a number of other responsibilities that were very central to my job description and are left unowned in my absence.

There are strong technical leaders staying in Commerce who I want to grow into my old role(s), but there’s a fair concern that they might not be ready to immediately perform the same function. This comes up a lot even when you aren’t switching teams. As you regularly acquire new, harder tasks the old work you used to do has to go somewhere. The question is how to delegate work you used to do while setting its new owner up for success.

Delegation a task doesn’t imply zero involvement

A normal fear whenever you try to delegate something you used to own is that whoever you give it to might not know how to do it well. This is even more common when (like for me right now) you’re delegating a task that wasn’t easy for you down to someone who has less experience doing it than you.

A common (but bad!) response to this is to just not delegate your task. This results in you getting overloaded (more work always arrives), the people around you never getting challenging opportunities to grow, and silos of knowledge where expertise doesn’t get shared.

I like to hack around this by frequently reminding myself that delegating a task doesn’t need to imply having zero involvement! The RACI model encourages thinking of four possible ways to be involved in a project:

Responsible: You do the work to complete the task
Accountable: You are ultimately answerable for the correct and thorough completion of the deliverable or task
Consulted: You are consulted for your expert opinions when needed
Informed: You just hear about the task’s progress (without blocking any decisions)

Partial delegation via RACI

Thinking about this, you can see that I can hand off Accountability for the Commerce Roadmap while still helping (lightly Responsible), and definitely offering lots of Consulting help. This lets me not lead the effort, but still be around a lot for advice and keep an eye on any worst-case outcomes. My normal transition from full ownership to full delegation looks like this over time:

I start fully Accountable, and probably Responsible too.
I involve my replacement first, making them also co-Responsible for the task. We focus on talking through and pairing up for any work I do in my Accountable capacity.
Next I delegate Accountable, but stay Consulted for sure and possibly Responsible if needed. This is the training wheels period — I’m still very involved and can spot mistakes, but I’m reducing the choices I make and expecting my replacement to be more self sufficient.
I try to let go of Responsible once I’m sure my day-to-day involvement isn’t necessary, but stay Consulted. This is easiest if you slowly reduce your involvement over a period of time.
Consulted often lasts for a while, but I aim to make sure the amount of time I’m spending is reducing over time.
And eventually I can drop Consulted and just trust the delegate to handle it completely. If I long term need to care about the task, I can choose to stay Informed.

Even though using acronyms runs the risk of making me start wearing a suit to work and saying SYNERGY a lot, I really like RACI as a way of helping me remember delegation isn’t a big-bang binary choice. If you find it this useful, you might like me talking about it and other things in this talk on surviving overloading at work. It’s certainly feeling very relevant to me maintaining sanity during this team transition.

Originally published at www.locallyoptimal.com on March 28, 2019.

If it isn’t scheduled, it won’t happen

Tue, 26 Mar 2019 00:00:00 +0000

I’ve had a pretty good run of writing something (internally, for Yelp) every week for a while now, averaging ~3 posts per month since August. These posts are usually focused on what I’ve been thinking about this past week, but I try to include a dedicated non-status-update section in each one. Writing these sections is the majority of my effort for each post and also the most common reason I fail to publish something — they’re hard!

When writing isn’t scheduled….

Lately I’ve noticed a bad pattern (including this last Friday) that goes something like this:

My week is a little crazy and I feel behind
I don’t write the weekly post until Friday
Friday morning quickly fills up with all the other things I’m behind on
The post either happens very late Friday, or I write it on the weekend, or it doesn’t happen at all that week

This isn’t occurring every week, but when I miss a post it’s nearly always due to this sequence of problems. I’ve given a talk at Pycon with a variety of ideas for not getting overloaded, and part of that strategy is to identify important work and prioritize it explicitly.

Parkinson’s Law and the value of explicitly blocked time

Parkinson’s Law says “work expands to fill the time available” and the only way I know to protect tasks against encroachment by other tasks is proactively setting aside of time.

I strongly encourage explicitly blocking time for the most important work you want to do, and scheduling it up front where you’re forced to work on it before anything else can claim your limited time. Leo Babauta advocates for thinking regularly on your most important tasks (MIT) and attacking it immediately and directly — another way to avoid unintentionally prioritizing the tiny “junk food” tasks that are easy to crank out but relatively unimportant.

For my writing, this means no “squeezing writing in after work” or “when I have free time on a Saturday”, but blocking dedicated time (1 hour a week for now, Thursday mornings) in my calendar. Inevitably I’m tired on a weeknight or I have competing life plans on a weekend (like spending quality dog park time with a certain puppy).

Applying it in practice

So I’m going to try setting aside time for writing less informally and more explicitly. Major changes:

Set aside 1 hour Thursday mornings
Check in another couple weeks and see if this is enough time or I need to adjust the duration. Maybe try one longer block and more, shorter blocks to see which works best.
Build a feedback loop of making sure every month or so that whatever setup I have is working (am I actually writing and publishing regularly?)

The goal is to also make a point of more regularly writing here in a public and visible way. I’ve often felt frustrated at the lack of technical leadership writing on the web — time to make sure mine is generally available at least.

Originally published at www.locallyoptimal.com on March 26, 2019.

Publish independently and publish often

Sun, 24 Mar 2019 00:00:00 +0000

I got sideswiped this a confluence of factors that finally convinced me to resurrect a decent static site generator (hi Pelican!) and bring this blog back from the relative dead.

In fast succession: Medium finally locked all useful distribution they do behind an even stronger paywall, I read this article by Fred Wilson on the value of being self-sufficient, and the one and only patio11 kindly responded to some of my questions about how to bootstrap a fledgling website no one visits out of the cold start problem.

I’m going to primarily be writing on my own blog (e.g. this post you're reading now), syndicating posts across to Medium for a least a bit, and seeing how it goes.

A new site generator

I got fed up with fighting a packaging ecosystem I didn’t know well in a language I didn’t understand, so I picked this up and took it from Octopress to Pelican. So far there seems to be actively less magic (mostly because I can probably read Makefiles and Python much more natively) and it was easy enough to port.

Credit to an old friend’s blog post showing up unexpectedly in Google results while I tried to work through the port 😄

Publish independently

On both Patrick’s and Fred’s advice, I’m going to try and have this website be the first publishing platform and, where it seems useful, I can syndicate out elsewhere. Seems that importing content into Medium w/ canonical links isn’t too hard, so I got that going for me. In some ways it seems a shame to lose the very small amount of momentum I had on Medium, but I suppose that makes it cheaper to do now than later.

Publish often

If history is any indication, this will be the hard part. I’ve had a pretty good run publishing nearly-weekly posts internal to Yelp in the last 3 months or so, but I’ve been quite awful at putting them out on the greater internet. That seems like a shame in retrospect, because I’ve often been frustrated at the extreme rarity of any decent writing on technical leadership for non-managers. Except this week of course — see Jessie Frazelle’s excellent post on distinguished engineers which has immediately become my new favorite post describing the holistic set of skills behind excellent senior engineers.

And to better defend my ability to publish often, there’s probably going to be a bias here to just text and links for a while. Something I can crank out in a markdown editor without worrying about any technical issues and throw onto the website without effort.

So here’s to nothing?

Wish me luck — we’ll see if a better generator, some reaffirmed intent, and a coat of new paint do the trick.

Ask the Tech Lead: How should I approach simplifying a complex system?

Fri, 14 Dec 2018 01:08:01 +0000

I’m writing a series of these posts, discussing the unwritten advice for excelling in highly technical leadership and tackling some of the hardest questions I’ve faced in my time as a lead engineer for teams and groups at Yelp.

The question: Why is my code so complex? How can I fix it?

My team owns a pile of code that has a bit of a reputation for being extra confusing and hard to work with. We often experience our Product Managers rattling off “simple” feature changes that should be easy but we know would take months of effort to execute on. I want to simplify the system but it’s a mess and I’m not sure where to realistically start.

How do I identify what parts of my system are essential complexity vs accidental complexity? And once identified, how do I remove accidental complexity?

The solution: Learn more, gain context, find opportunity, then execute

If there’s one lesson I’ve learned over and over again, it’s that complexity is the death of a software system.

Sometimes the complexity is hard to avoid (IAM permissions are horribly painful, but also extremely expressive and powerful!) but often the complexity is accidental, coming from the normal wear and tear of time, the organic evolution of requirements, and (most frequently) a lot of people making good local choices in one part of the system while failing to contain the whole of the complexity.

A large part of my job as a technical lead is applying wise, targeted back-pressure to contain the increasing complexity of the software my group produces. By default, all systems tend toward a tangled mess of code that nobody understands.

I get excited when I see a project proposal where there’s a huge gap between the complexity to fully explain the feature (it might take an expert ~10 minutes to give you all the details) and the complexity of the underlying systems of code (I’ve seen cases where all the experts in a single room couldn’t collectively give a coherent explanation of the current implementation no matter how much time they had).

An engineering lead’s role is to ask (pointedly and repeatedly): “why can’t this software be as simple to own and operate as it is to explain?”.

Learn the “why” behind your system’s complexity

The answer is nearly always extremely interesting and informative — get curious! A few reasons for complexity I’ve heard before:

It integrates with $ANCIENT_SYSTEM and that system is very confusing and gnarled. (How can we either not rely on that system, or simplify the interaction model between them? Lots of complexity comes from the naive or outdated integration seam between two systems. Does an interface need to be rewritten to hide its implementation details?)
I don’t know why it’s so complex. (Ah! Probably our complete lack of understanding is hiding some opportunities for simplifications. Dig deeper and ask “why” recursively until you get some hard lessons. Low hanging fruit often lives in the shadow of lack of knowledge)
It is complex because this monolithic codebase is big and terrible. (Now we’re getting down to brass tacks. What is keeping this feature in the monolith? Is there a game plan to move it out we can help execute on with this project? Is there a way we can partially migrate in the right direction and actually speed up the delivery of the product win we’re building toward? Get curious!)
My team’s part of it is simple, but $OTHER_TEAM’s half is a mess. (Be very cautious here. Sometimes your team’s side is simple because you abandoned complexity to the other team’s half. Sometimes the interface between teams can be easily reworked to reduce complexity for both sides, if only you discussed it! Dig deep, learn more, stay curious, and fight cynicism.)

Why curiosity works

You might be noting that “curiosity” is a key part of how I discover and dispel unnecessary complexity.

One non-technical reason I like is staying curious is it helps me avoid the danger of the Fundamental Attribution Error: if I purposefully react with curiosity when I don’t understand, I can counteract the normal human temptation to be “sure” they’re wrong (or worse, stupid!).

I try to avoid imitating Homer in my professional life

Without some sort of active strategy to prevent it, it’s disturbingly easy for us all to fall prey to the idea that we’re the only ones who have thought through a problem “properly”. Reminding myself to stay curious forces me to actively engage with people who disagree with me, dig into why they feel that way, and very likely learn something important along the way.

It’s not just about what I’m learning though. Being authentically interested and open to someone else’s opinions makes the projects you’re a part of more welcoming and inclusive for everyone involved. In the most extreme cases, being closed off or hostile to dissenting views can completely warp the way people interact with you. Actively being curious instead of dismissive of other views does wonders for encouraging contribution from everyone involved in a project. Encourage participation!

Curiosity also helps me build a mental model of the “why” behind the systems we have. Telling the difference between “historical quirks” and “crucial design or product decisions” is incredibly important for navigating the complexity of large production systems and the only way to know which is which is by digging deep and asking questions when you start projects.

No silver bullet

There’s no single bit of advice for diagnosing and fixing unnecessary complexity in systems. But I believe very strongly that in nearly all cases, the best thing you can do to improve the state of your code is to stay curious and learn more.

I’ve lost count of the number of times a deep dive into a system (and the people who use it!) has revealed opportunity for improvement. Applying this approach for the last couple years has produced multiple projects where we took years of accumulated complexity and reduced it to something much closer to what the feature actually required.

Engineering thrives when we can produce systems that are as close to their essential complexity as possible. Shine a light in the dark corners of your codebase and you’ll be amazed at the volume of easy-to-fix technical debt you find.

If you found this interesting, follow Scott Triglia here or on Twitter (https://twitter.com/scott_triglia).

Ask a Tech Lead: I have to make a technical decision but I can’t know the right answer

Mon, 26 Nov 2018 22:31:48 +0000

Ask the Tech Lead: I have to make a technical decision but I can’t know the right answer

I’m hoping to make this a series of posts, discussing the unwritten advice for excelling in highly technical leadership. In the spirit of Camille Fournier’s excellent series, I’ll tackle some of the hardest questions I’ve heard from coworkers and mentees in my time as a lead engineer for teams and groups at Yelp. Huge credit to Jonathan Maltz for help refining this particular post and the overall format.

The question: How do I make an impossible technical decision wisely?

I’m struggling with making a choice about the technical direction of an upcoming project. This choice will involve significant expense: multiple engineer-months worth of work either way. I need to make a call, but I don’t feel like I have enough information to guarantee what the right direction is.

What’s the right approach when I need to make a big bet but have little guarantee of the right call in the end?

The Solution: Think like a scientist

Major decisions aren’t easy. There are many reasons a technical choice might feel impossible:

The decision is often between the status quo and some option we have no or little experience in.
Sometimes the best and worst case outcomes are either unknown or have a scary amount of variance.
Sufficiently new or different ideas often imply unknown challenges lurking. What if those unknown challenges are really bad and make it unappealing in hindsight?
New ideas might imply major architecture changes that are outright dangerous. If we fail to control risk, it may not matter how good a choice we make.

The reality is you still have to make impossible decisions, even if the right choice can’t be known until later.

My advice is to work your decision like you would a scientific experiment: deeply invest in learning about the problem, build a hypothesis about the right call, and then, most importantly, propose a series of small, incremental experiments to build confidence that your choice is correct.

As a whole, my approach to these big questions looks like:

Frame the big question and take an opinionated stance on the answer based on whatever data is currently available.
Come up with an initial experiment to partially vet that stance. It should be quick (~1 quarter) to accomplish, give meaningful directional feedback on whether the opinionated stance is still correct, and hopefully provide engineering leverage to test the next experiment more easily/quickly/safely.
Evaluate the experiment’s result. Does it suggest our opinionated answer to the original big question was right? Wrong? Have you learned something that needs to alter your answer to the big question?
If needed, update our best current proposal from this feedback.
Rinse and repeat steps 2–4 until you’ve answered the big question empirically.

Applying this to a concrete problem

In my own work, I’ve been digging into whether we should invest in using AWS Lambda more within Step Functions. Unfortunately we don’t yet have much experience using Lambdas and this would imply a pretty big technical effort to make the switch. Is it worth it? This is a big, uncertain question, so let’s see the framework in action.

Frame the big question, take a stance

We’ve been using an API-drive pull architecture up until now, but nearly all companies in the industry use Lambdas. Let’s pick the largest change for our framed question: “should we use Lambdas for all of our Step Functions tasks”?

After a little research and my anecdotal survey of industry, I’m sufficiently curious about the alleged development velocity and ease-of-use of Lambdas to take an opinionated stance: “we should use Lambdas by default with Step Functions, with the pull-based architecture only used as a fallback in rare cases”.

Come up with an initial experiment

Our initial experiment should have a few traits: it should be quite small, give us meaningful feedback, and be technically feasible. In particular, the size and scope of our experiments should start very small and grow larger as we gain confidence in our overall hypothesis.

So for our first experiment, I aimed small and replaced a trivial task with a Lambda: all it did was reformat some JSON and log the result. This is incredibly “boring” from a technical perspective, but still required making some important directional choices. Namely:

How do we deploy and monitor lambdas?
Should we use any frameworks?

This line of thought led us to discover a few important hypotheses I wanted to prove/disprove with our first experiment:

Is the serverless framework a good tool to leverage to build Lambdas?
Can we build a CI/CD pipeline that feels like best-in-class service tooling that actually deploys Lambdas instead of our normal SOA setup?
Are Lambdas performant in production?
In the end, will this change let us get a functional change through the coding lifecycle faster, without sacrificing safety or architectural sanity?

Evaluate the experiment’s result

For my particular experiment, production experimentation suggests the answer to all of these questions is “yes, Lambdas seem to do as well or better than the status quo”.

Our hypotheses happened to mostly be answering yes/no questions, but this isn’t the only way to measure success. Depending on your particular experiment you may want to check business metrics or even squishier human ideas like “oncall happiness”. The important thing is that you have a clear idea of what you want to learn from your experiment and an clear sense of how you plan to measure that learning once the experiment is live.

Update our overall proposal and iterate

After our first Lambda experiment, the result reinforced the direction of the overall plan, and no major updates were required. Excitingly, we did learn enough to reprioritize the next steps: several problems we thought would require entire future experiments to vet were actually completely solved by Serverless framework plugins. This feedback made us more confident betting on the Serverless Framework as a technology and let us tweak our roadmap to actually be more aggressive on what we tried next.

This feedback loop (experiment, see results, modify hypothesis, experiment again) is crucial to noticing problems and course correcting. Make sure that even your big nebulous bets have a clear way to learn + iterate and you’re comfortable with the cost of the worst-case outcome. This helps protect your project from losing momentum halfway due to an unsuccessful experiment.

Final thoughts

As the problems get messier and more complex, it can help to centralize this process in a single long-running document. It’s a great scratch space for thinking through future experiments and making sure others can follow along with your thought process while you’re at it. My rough format:

What’s the big idea? (add historical context, frame the opportunity)
Who are the owners this project(s)? (clear ownership helps avoid death by analysis and offers clear points of contact)
What’s my best guess at the right answer? (doesn’t have to be correct in hindsight, just has to be a useful, opinionated stance that implies action)
What experiments have we already done in this effort? What did we learn from them?
What is the next, most valuable experiment we should try?

The ability to take on a vague project of potentially massive scope without succumbing to a blind guesswork is quite challenging. But on the upside, it’s also a rare skill that you’ll use more and more as you become a more senior technical leader.

Luckily for all of us, the key isn’t to magically know the right answer up front. Instead you need to build deep context in the problem space — really live and breathe it — until an initial educated guess forms itself. Your first experiment should provide directional feedback and force you to take theoretical ideas and build them for real. Then each ensuing bet shapes your hypotheses about the overall question, and guides you down the rest of the decision tree for your project.

Building self-healing, observable systems with AWS Step Functions

Thu, 06 Sep 2018 02:18:44 +0000

Modern highly-distributed application architectures solve real problems, but they bring with them novel challenges. Knowing what is happening in production and which part of the system needs fixing is non-trivial and makes understanding outages challenging for the most informed experts.

AWS Step Functions is targeted at coordinating the tangle of microservices and serverless functions that make up the cutting edge backend architectures. I’ve been using it at scale for over a year now and have been impressed with the confidence and deep understanding it gives both developers and the on-call engineers who keep it running in production.

Let’s dive into a few techniques for using it more successfully.

Workflows are everywhere and you don’t even know it

The core concept of Step Functions is a single execution of a configured workflow (state machine). Though you may not think of it, many applications are built by assembling a series of tasks together, perhaps passing some context between them. These are workflows!

Step Functions’s core contract involves the developer defining the steps of a workflow and the code to run each step. Then whenever this workflow is executed, AWS manages its state and calls the right step at the right time.

Courtesy of the high quality docs at https://aws.amazon.com/step-functions/

This decomposition lets the developer focus on writing small tasks that do one thing well (where have I heard that before?) and compose them together into production workflows that accomplish business goals. And, in grand tradition, AWS handles the undifferentiated heavy lifting of managing concurrent executions, storing a small in-execution JSON data blob, and knowing which task to invoke when.

Resilient right out of the box

In large production systems, it’s easy to accidentally implement important business processes like Rube Goldberg machines: if any part of the system goes wrong the total result won’t work properly and figuring out what broke after the fact is a total mystery.

One of the immediate benefits of Step Functions is its resiliency and auditability.

Resiliency comes from the first-class error handling in the workflow definition language. You can define retries, exponential backoff, state transitions, and more all in your workflow’s JSON definition. Best of all, this behavior is defined entirely outside the code implementing each step in your workflow — this lets your main code focus on behavior and compose that with your workflow’s retry configuration. Great separation!

But even in the most well-defined workflow, you’ll inevitably have workflows that fail for reasons you don’t immediately understand. Luckily Step Functions provides a build in audit log of every execution’s progress through your workflow. This includes inputs, outputs, errors, timings, retries, and timeouts for the whole execution. This kind of detailed debugging info makes noticing issues and pinning them down to individual workflow steps trivially easy with no additional effort on your part.

You can see a nice visual of all this in the AWS console, but the same data is available via the GetExecutionHistory API call.

Was that error your fault or mine?

The only problem left is creating a workflow design that clearly communicates all this great context to your oncall engineers so they can rapidly tell whether workflows are failing correctly (bad input maybe?) or due to a bug (your datastore is on the fritz again).

The most successful way I’ve seen this done is by separating workflows into three conceptual kinds of terminal states — success, known failure modes, and unknown failure modes. In service terms, these correspond to 2xx, 4xx, and 5xx HTTP status codes.

Applying these ideas to a workflow for storing user information might look something like this:

Now if we notice input issues (e.g. someone entered garbage for the email address), we transition to the ValidationError state and end the workflow. On the other hand if we get to AddUser and our datastore is unavailable, we might end up in the CriticalError state.

The benefit for this effort is which terminal state you end up in tells you something about the health of the workflow:

The “success” end states signal the workflow executed cleanly
The “validation error” end state may not inherently signal problems (sometimes input is invalid!), but the rate at which they’re executed should be monitored. Often you can set up baseline frequency for these cases and monitor exceeding a threshold.
The “critical error” is a problem! Any executions that end up in this state either indicate a significant dependency failing entirely or the presence of a novel bug in your workflow.

If we look at the Step Functions workflow DSL for this, you can see how these transitions look in code:"States": {
"ValidateEmail": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:1234:function:myfunc",
"Catch": [
{
"ErrorEquals": ["ValidationError, SerializationError"],
"Next": "ValidationError",
"ResultPath": "$.exception_details"
},
{
"ErrorEquals": ["States.ALL"],
"Next": "CriticalError",
"ResultPath": "$.exception_details"
}
],
"Next": "AddUser"
},
..

We create a custom list of errors we consider “validation errors” and catch those and transition to the ValidationError state, and then catch any other errors and head to CriticalError. If we want, a separate Retry clause can let us retry on some errors while immediately transitioning for others. Finally the ResultPath ensures that we’re capturing the exception context and handing it along to the next state without clobbering the rest of the input.

This pattern of activities gives you crystal clear signals to focus alerting on (too many workflows ending in either kind of failure mode) so you can notice production problems promptly and diagnose them accurately. As a bonus, the unknown failure mode state acts as a very fast feedback loop for any kinds of behaviors you’ve forgotten to build into your main workflow. Simply notice a new type of unknown failure mode, triage how to either eliminate it or make it a known failure mode, rinse and repeat.

Leaving the world a little cleaner than you found it

There’s just one hitch with this plan: some failure modes require cleanup. In a distributed system, we can’t rely on enormous database transactions to keep systems consistent…we have to implement that ourselves.

The most general version of this is often referred to as the Saga Pattern. That pattern holds that for every side effect, you need a compensating “undo” equivalent side effect. You can see how this general case might look in Yan Cui’s Medium post about implementing Saga Patterns with Step Functions.

But in most cases you can simplify this approach and make relatively few steps which undo the most critical side effects your workflow performs.

Whether you go for the minimal or large-scale version of these undo steps, the goal is for your workflow to clean up after itself. Ideally when you any terminal state (even “unknown failure mode”!), your data is in a consistent state and doesn’t require additional cleanup.

This consistency allows your oncall engineers to focus on understanding causes and mitigating damage, even in the worst outages, secure in the knowledge that the workflows have automatically cleaned up any bad data they created.

Iterating to stability in production

So we’ve built a production system that executes our workflow, includes retries+timeouts where we need them, has very well defined failure modes, and can give us clear feedback whenever we need to tweak our design because it isn’t robust to all possible production realities.

Using Step Functions doesn’t magically make your code stable in production, but it does allow you to easily compose your business logic with a platform that handles the resiliency and auditability for you.

If you found this interesting, follow Scott Triglia on Twitter (https://twitter.com/scott_triglia) or his blog at http://www.locallyoptimal.com/.