<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Engineering on Locally Optimal</title><link>http://www.locallyoptimal.com/tags/engineering/</link><description>Recent content in Engineering on Locally Optimal</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><copyright>© Scott Triglia</copyright><lastBuildDate>Sat, 02 Mar 2024 20:10:30 +0000</lastBuildDate><atom:link href="http://www.locallyoptimal.com/tags/engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>Make sure your foundations can support your high velocity product growth</title><link>http://www.locallyoptimal.com/make-sure-your-foundations-can-support-your-high-velocity-product-growth/</link><pubDate>Sat, 02 Mar 2024 20:10:30 +0000</pubDate><guid>http://www.locallyoptimal.com/make-sure-your-foundations-can-support-your-high-velocity-product-growth/</guid><description>&lt;p&gt;You own a system. You are obsessed with your users. You are building value as fast as you can for them, your laggard dependencies be damned. Where dependencies can't keep up, you solve problems yourself!&lt;/p&gt;&lt;p&gt;You have seen the dire warnings from stagnant, slow bureaucracies. You've seen them calcify because they were too tied to other parts of the business. And you know that &lt;em&gt;you&lt;/em&gt; won't end that way. "Decoupled" is your watch word and you're determined to stay nimble however you can.&lt;/p&gt;&lt;p&gt;Maybe you haven't found PMF yet, so you need to keep experimenting quickly. Or you have and things are all the more dire because you're barely able to keep scaling ahead of success better than your wildest dreams. Either way, you are shipping new features as fast as you can, because feature quantity has a quality all its own and you need to keep learning/stay ahead/make money.&lt;/p&gt;&lt;p&gt;There's just one problem. You've become dangerously overextended. You're building the 9th gorgeous wing of a castle on top of increasingly unstable sands. Or the wild growth of your product's trees is starting to succeed wildly but is growing top heavy, and your shallow root systems increasingly can't keep you upright. You know what else is awesomely decoupled, moving rapidly, and completely independent of pesky constraints? Wile E Coyote suspended 200ft above a gorge with nothing under his feet.&lt;/p&gt;&lt;p&gt;Instead, you must think equally about your velocity and your overhang. Worry about that brick wall you're headed at full steam ahead -- don't just get excited for how fast you're moving. Think early about the length of your runway, and talk to your dependencies so they can hopefully anticipate your needs.&lt;/p&gt;&lt;p&gt;Code, by volume, is tech debt. If you can build atop a well aligned dependency, you get to recruit &lt;em&gt;its&lt;/em&gt; developers to build features (and do ongoing maintenance) for free. Their work is your value and leverage. If you lose this alignment, you simply now own more code. It will eventually hurt even if you can't yet see it today.&lt;/p&gt;&lt;p&gt;Watch for your progress vs foundations. Work with dependency teams to make sure the runway stays ahead of your features. Bite the bullet and admit when the forest is about to topple – slow new features, solidify the foundations, and get back into a good place. Leverage is absolutely magical when your dependencies are squarely supporting the direction you're headed. If that's not the case right now -- look at it honestly, figure out how to address it, and go ensure the foundations get fixed.&lt;/p&gt;</description></item><item><title>Building self-healing, observable systems with AWS Step Functions</title><link>http://www.locallyoptimal.com/building-self-healing-observable-systems-with-aws-step-functions/</link><pubDate>Thu, 06 Sep 2018 02:18:44 +0000</pubDate><guid>http://www.locallyoptimal.com/building-self-healing-observable-systems-with-aws-step-functions/</guid><description>&lt;p&gt;Modern highly-distributed application architectures solve real problems, but they bring with them novel challenges. Knowing what is happening in production and which part of the system needs fixing is non-trivial and makes understanding outages challenging for the most informed experts.&lt;/p&gt;&lt;figure class="kg-card kg-embed-card"&gt;&lt;blockquote class="twitter-tweet"&gt;&lt;a href="https://twitter.com/honest_update/status/651897353889259520"&gt;&lt;/a&gt;&lt;/blockquote&gt;&lt;/figure&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/step-functions/" rel="noopener"&gt;AWS Step Functions&lt;/a&gt; is targeted at coordinating the tangle of microservices and serverless functions that make up the cutting edge backend architectures. I’ve been &lt;a href="https://www.youtube.com/watch?v=UeYHJISlWgk" rel="noopener"&gt;using it&lt;/a&gt; at &lt;a href="https://engineeringblog.yelp.com/2017/11/breaking-down-the-monolith-with-aws-step-functions.html" rel="noopener"&gt;scale&lt;/a&gt; for over a year now and have been impressed with the confidence and deep understanding it gives both developers and the on-call engineers who keep it running in production.&lt;/p&gt;&lt;p&gt;Let’s dive into a few techniques for using it more successfully.&lt;/p&gt;&lt;hr&gt;&lt;h3 id="workflows-are-everywhere-and-you-don-t-even-know-it"&gt;Workflows are everywhere and you don’t even know it&lt;/h3&gt;&lt;p&gt;The core concept of Step Functions is a single execution of a configured workflow (state machine). Though you may not think of it, many applications are built by assembling a series of tasks together, perhaps passing some context between them. These are workflows!&lt;/p&gt;&lt;p&gt;Step Functions’s core contract involves the developer defining the steps of a workflow and the code to run each step. Then whenever this workflow is executed, AWS manages its state and calls the right step at the right time.&lt;/p&gt;&lt;figure class="kg-card kg-image-card kg-card-hascaption"&gt;&lt;img src="http://www.locallyoptimal.com/images/1-nipakxtacku_li74r1fztw.png" class="kg-image" alt loading="lazy" &gt;&lt;figcaption&gt;Courtesy of the high quality docs at &lt;a href="https://aws.amazon.com/step-functions/" data-href="https://aws.amazon.com/step-functions/" class="markup--anchor markup--figure-anchor" rel="nofollow noopener" target="_blank"&gt;https://aws.amazon.com/step-functions/&lt;/a&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;p&gt;This decomposition lets the developer focus on writing small tasks that do one thing well (&lt;a href="https://en.wikipedia.org/wiki/Unix_philosophy" rel="noopener"&gt;where have I heard that before?&lt;/a&gt;) and compose them together into production workflows that accomplish business goals. And, in grand tradition, AWS handles the undifferentiated heavy lifting of managing concurrent executions, storing a small in-execution JSON data blob, and knowing which task to invoke when.&lt;/p&gt;&lt;hr&gt;&lt;h3 id="resilient-right-out-of-the-box"&gt;Resilient right out of the box&lt;/h3&gt;&lt;p&gt;In large production systems, it’s easy to accidentally implement important business processes like Rube Goldberg machines: if any part of the system goes wrong the total result won’t work properly and figuring out what broke after the fact is a total mystery.&lt;/p&gt;&lt;p&gt;One of the immediate benefits of Step Functions is its resiliency and auditability.&lt;/p&gt;&lt;p&gt;Resiliency comes from the &lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html" rel="noopener"&gt;first-class error handling&lt;/a&gt; in the workflow definition language. You can define retries, exponential backoff, state transitions, and more all in your workflow’s JSON definition. Best of all, this behavior is defined entirely outside the code implementing each step in your workflow — this lets your main code focus on behavior and compose that with your workflow’s retry configuration. Great separation!&lt;/p&gt;&lt;p&gt;But even in the most well-defined workflow, you’ll inevitably have workflows that fail for reasons you don’t immediately understand. Luckily Step Functions provides a build in audit log of every execution’s progress through your workflow. This includes inputs, outputs, errors, timings, retries, and timeouts for the whole execution. This kind of detailed debugging info makes noticing issues and pinning them down to individual workflow steps trivially easy with no additional effort on your part.&lt;/p&gt;&lt;figure class="kg-card kg-image-card"&gt;&lt;img src="http://www.locallyoptimal.com/images/1-q9clgs1za9glj80mcp8fwg.png" class="kg-image" alt loading="lazy" &gt;&lt;/figure&gt;&lt;p&gt;You can see a nice visual of all this in the AWS console, but the same data is available via the &lt;a href="https://docs.aws.amazon.com/step-functions/latest/apireference/API_GetExecutionHistory.html" rel="noopener"&gt;GetExecutionHistory API call&lt;/a&gt;.&lt;/p&gt;&lt;hr&gt;&lt;h3 id="was-that-error-your-fault-or-mine"&gt;Was that error your fault or mine?&lt;/h3&gt;&lt;p&gt;The only problem left is creating a workflow design that clearly communicates all this great context to your oncall engineers so they can rapidly tell whether workflows are failing correctly (bad input maybe?) or due to a bug (your datastore is on the fritz again).&lt;/p&gt;&lt;p&gt;The most successful way I’ve seen this done is by separating workflows into three conceptual kinds of terminal states — success, known failure modes, and unknown failure modes. In service terms, these correspond to 2xx, 4xx, and 5xx HTTP status codes.&lt;/p&gt;&lt;p&gt;Applying these ideas to a workflow for storing user information might look something like this:&lt;/p&gt;&lt;figure class="kg-card kg-image-card"&gt;&lt;img src="http://www.locallyoptimal.com/images/1-lfwal01xkoul5ubm6rtjjq.png" class="kg-image" alt loading="lazy" &gt;&lt;/figure&gt;&lt;p&gt;Now if we notice input issues (e.g. someone entered garbage for the email address), we transition to the &lt;code&gt;ValidationError&lt;/code&gt; state and end the workflow. On the other hand if we get to &lt;code&gt;AddUser&lt;/code&gt; and our datastore is unavailable, we might end up in the &lt;code&gt;CriticalError&lt;/code&gt; state.&lt;/p&gt;&lt;p&gt;The benefit for this effort is which terminal state you end up in tells you something about the health of the workflow:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;The “success” end states signal the workflow executed cleanly&lt;/li&gt;&lt;li&gt;The “validation error” end state may not inherently signal problems (sometimes input is invalid!), but the rate at which they’re executed should be monitored. Often you can set up baseline frequency for these cases and monitor exceeding a threshold.&lt;/li&gt;&lt;li&gt;The “critical error” is a problem! Any executions that end up in this state either indicate a significant dependency failing entirely or the presence of a novel bug in your workflow.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If we look at the Step Functions workflow DSL for this, you can see how these transitions look in code:"States": {&lt;br&gt;    "ValidateEmail": {&lt;br&gt;      "Type": "Task",&lt;br&gt;      "Resource": "arn:aws:lambda:us-west-2:1234:function:myfunc",&lt;br&gt;      "Catch": [&lt;br&gt;        {&lt;br&gt;          "ErrorEquals": ["ValidationError, SerializationError"],&lt;br&gt;          "Next": "ValidationError",&lt;br&gt;          "ResultPath": "$.exception_details"&lt;br&gt;        },&lt;br&gt;        {&lt;br&gt;          "ErrorEquals": ["States.ALL"],&lt;br&gt;          "Next": "CriticalError",&lt;br&gt;          "ResultPath": "$.exception_details"&lt;br&gt;        }&lt;br&gt;      ],&lt;br&gt;      "Next": "AddUser"&lt;br&gt;    },&lt;br&gt;..&lt;/p&gt;&lt;p&gt;We create a custom list of errors we consider “validation errors” and catch those and transition to the ValidationError state, and then catch any other errors and head to CriticalError. If we want, a separate &lt;code&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-errors.html#amazon-states-language-retrying-after-error" rel="noopener"&gt;Retry&lt;/a&gt;&lt;/code&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-errors.html#amazon-states-language-retrying-after-error" rel="noopener"&gt; clause can let us retry on some errors while immediately transitioning for others&lt;/a&gt;. Finally &lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-input-output-processing.html" rel="noopener"&gt;the &lt;/a&gt;&lt;code&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-input-output-processing.html" rel="noopener"&gt;ResultPath&lt;/a&gt;&lt;/code&gt; ensures that we’re capturing the exception context and handing it along to the next state without clobbering the rest of the input.&lt;/p&gt;&lt;p&gt;This pattern of activities gives you crystal clear signals to focus alerting on (too many workflows ending in either kind of failure mode) so you can notice production problems promptly and diagnose them accurately. As a bonus, the unknown failure mode state acts as a very fast feedback loop for any kinds of behaviors you’ve forgotten to build into your main workflow. Simply notice a new type of unknown failure mode, triage how to either eliminate it or make it a known failure mode, rinse and repeat.&lt;/p&gt;&lt;h3 id="leaving-the-world-a-little-cleaner-than-you-found-it"&gt;Leaving the world a little cleaner than you found it&lt;/h3&gt;&lt;p&gt;There’s just one hitch with this plan: some failure modes require cleanup. In a distributed system, we can’t rely on enormous database transactions to keep systems consistent…we have to implement that ourselves.&lt;/p&gt;&lt;p&gt;The most general version of this is often referred to as the Saga Pattern. That pattern holds that for every side effect, you need a compensating “undo” equivalent side effect. You can see how this general case might look in &lt;a href="https://read.acloud.guru/how-the-saga-pattern-manages-failures-with-aws-lambda-and-step-functions-bc8f7129f900" rel="noopener"&gt;Yan Cui’s Medium post about implementing Saga Patterns with Step Functions&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;But in most cases you can simplify this approach and make relatively few steps which undo the most critical side effects your workflow performs.&lt;/p&gt;&lt;p&gt;Whether you go for the minimal or large-scale version of these undo steps, the goal is for your workflow to clean up after itself. Ideally when you any terminal state (even “unknown failure mode”!), your data is in a consistent state and doesn’t require additional cleanup.&lt;/p&gt;&lt;p&gt;This consistency allows your oncall engineers to focus on understanding causes and mitigating damage, even in the worst outages, secure in the knowledge that the workflows have automatically cleaned up any bad data they created.&lt;/p&gt;&lt;hr&gt;&lt;h3 id="iterating-to-stability-in-production"&gt;Iterating to stability in production&lt;/h3&gt;&lt;p&gt;So we’ve built a production system that executes our workflow, includes retries+timeouts where we need them, has very well defined failure modes, and can give us clear feedback whenever we need to tweak our design because it isn’t robust to all possible production realities.&lt;/p&gt;&lt;p&gt;Using Step Functions doesn’t magically make your code stable in production, but it does allow you to easily compose your business logic with a platform that handles the resiliency and auditability for you.&lt;/p&gt;&lt;p&gt;&lt;em&gt;If you found this interesting, follow Scott Triglia on Twitter (&lt;/em&gt;&lt;a href="https://twitter.com/scott_triglia" rel="nofollow noopener"&gt;https://twitter.com/scott_triglia&lt;/a&gt;&lt;em&gt;) or his blog at &lt;/em&gt;&lt;a href="http://www.locallyoptimal.com/" rel="nofollow noopener"&gt;http://www.locallyoptimal.com/&lt;/a&gt;.&lt;/p&gt;</description></item></channel></rss>