• Stars
    star
    1,286
  • Rank 36,580 (Top 0.8 %)
  • Language
    Ruby
  • License
    MIT License
  • Created about 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🐒 Resiliency toolkit for Ruby for failing fast

Semian Build Status

Semian is a library for controlling access to slow or unresponsive external services to avoid cascading failures.

When services are down they typically fail fast with errors like ECONNREFUSED and ECONNRESET which can be rescued in code. However, slow resources fail slowly. The thread serving the request blocks until it hits the timeout for the slow resource. During that time, the thread is doing nothing useful and thus the slow resource has caused a cascading failure by occupying workers and therefore losing capacity. Semian is a library for failing fast in these situations, allowing you to handle errors gracefully. Semian does this by intercepting resource access through heuristic patterns inspired by Hystrix and Release It:

  • Circuit breaker. A pattern for limiting the amount of requests to a dependency that is having issues.
  • Bulkheading. Controlling the concurrent access to a single resource, access is coordinated server-wide with SysV semaphores.

Resource drivers are monkey-patched to be aware of Semian, these are called Semian Adapters. Thus, every time resource access is requested Semian is queried for status on the resource first. If Semian, through the patterns above, deems the resource to be unavailable it will raise an exception. The ultimate outcome of Semian is always an exception that can then be rescued for a graceful fallback. Instead of waiting for the timeout, Semian raises straight away.

If you are already rescuing exceptions for failing resources and timeouts, Semian is mostly a drop-in library with a little configuration that will make your code more resilient to slow resource access. But, do you even need Semian?

For an overview of building resilient Ruby applications, start by reading the Shopify blog post on Toxiproxy and Semian. For more in depth information on Semian, see Understanding Semian. Semian is an extraction from Shopify where it's been running successfully in production since October, 2014.

The other component to your Ruby resiliency kit is Toxiproxy to write automated resiliency tests.

Usage

Install by adding the gem to your Gemfile and require the adapters you need:

gem 'semian', require: %w(semian semian/mysql2 semian/redis)

We recommend this pattern of requiring adapters directly from the Gemfile. This ensures Semian adapters are loaded as early as possible and also protects your application during boot. Please see the adapter configuration section on how to configure adapters.

Adapters

Semian works by intercepting resource access. Every time access is requested, Semian is queried, and it will raise an exception if the resource is unavailable according to the circuit breaker or bulkheads. This is done by monkey-patching the resource driver. The exception raised by the driver always inherits from the Base exception class of the driver, meaning you can always simply rescue the base class and catch both Semian and driver errors in the same rescue for fallbacks.

The following adapters are in Semian and tested heavily in production, the version is the version of the public gem with the same name:

Creating Adapters

To create a Semian adapter you must implement the following methods:

  1. include Semian::Adapter. Use the helpers to wrap the resource. This takes care of situations such as monitoring, nested resources, unsupported platforms, creating the Semian resource if it doesn't already exist and so on.
  2. #semian_identifier. This is responsible for returning a symbol that represents every unique resource, for example redis_master or mysql_shard_1. This is usually assembled from a name attribute on the Semian configuration hash, but could also be <host>:<port>.
  3. connect. The name of this method varies. You must override the driver's connect method with one that wraps the connect call with Semian::Resource#acquire. You should do this at the lowest possible level.
  4. query. Same as connect but for queries on the resource.
  5. Define exceptions ResourceBusyError and CircuitOpenError. These are raised when the request was rejected early because the resource is out of tickets or because the circuit breaker is open (see Understanding Semian. They should inherit from the base exception class from the raw driver. For example Mysql2::Error or Redis::BaseConnectionError for the MySQL and Redis drivers. This makes it easy to rescue and handle them gracefully in application code, by rescueing the base class.

The best resource is looking at the already implemented adapters.

Configuration

There are some global configuration options that can be set for Semian:

# Maximum size of the LRU cache (default: 500)
# Note: Setting this to 0 enables aggressive garbage collection.
Semian.maximum_lru_size = 0

# Minimum time in seconds a resource should be resident in the LRU cache (default: 300s)
Semian.minimum_lru_time = 60

Note: minimum_lru_time is a stronger guarantee than maximum_lru_size. That is, if a resource has been updated more recently than minimum_lru_time it will not be garbage collected, even if it would cause the LRU cache to grow larger than maximum_lru_size.

When instantiating a resource it now needs to be configured for Semian. This is done by passing semian as an argument when initializing the client. Examples built in adapters:

# MySQL2 client
# In Rails this means having a Semian key in database.yml for each db.
client = Mysql2::Client.new(host: "localhost", username: "root", semian: {
  name: "master",
  tickets: 8, # See the Understanding Semian section on picking these values
  success_threshold: 2,
  error_threshold: 3,
  error_timeout: 10
})

# Redis client
client = Redis.new(semian: {
  name: "inventory",
  tickets: 4,
  success_threshold: 2,
  error_threshold: 4,
  error_timeout: 20
})

Thread Safety

Semian's circuit breaker implementation is thread-safe by default as of v0.7.0. If you'd like to disable it for performance reasons, pass thread_safety_disabled: true to the resource options.

Bulkheads should be disabled (pass bulkhead: false) in a threaded environment (e.g. Puma or Sidekiq), but can safely be enabled in non-threaded environments (e.g. Resque and Unicorn). As described in this document, circuit breakers alone should be adequate in most environments with reasonably low timeouts.

Internally, semian uses SEM_UNDO for several sysv semaphore operations:

  • Acquire
  • Worker registration
  • Semaphore metadata state lock

The intention behind SEM_UNDO is that a semaphore operation is automatically undone when the process exits. This is true even if the process exits abnormally - crashes, receives a SIG_KILL, etc, because it is handled by the operating system and not the process itself.

If, however, a thread performs a semop, the SEM_UNDO is on its parent process. This means that the operation will not be undone when the thread exits. This can result in the following unfavorable behavior when using threads:

  • Threads acquire a resource, but are killed and the resource ticket is never released. For a process, the ticket would be released by SEM_UNDO, but since it's a thread there is the potential for ticket starvation. This can result in deadlock on the resource.
  • Threads that register workers on a resource but are killed and never unregistered. For a process, the worker count would be automatically decremented by SEM_UNDO, but for threads the worker count will continue to increment, only being undone when the parent process dies. This can cause the number of tickets to dramatically exceed the quota.
  • If a thread acquires the semaphore metadata lock and dies before releasing it, semian will deadlock on anything attempting to acquire the metadata lock until the thread's parent process exits. This can prevent the ticket count from being updated.

Moreover, a strategy that utilizes SEM_UNDO is not compatible with a strategy that attempts to the semaphores tickets manually. In order to support threads, operations that currently use SEM_UNDO would need to use no semaphore flag, and the calling process will be responsible for ensuring that threads are appropriately cleaned up. It is still possible to implement this, but it would likely require an in-memory semaphore managed by the parent process of the threads. PRs welcome for this functionality.

Quotas

You may now set quotas per worker:

client = Redis.new(semian: {
  name: "inventory",
  quota: 0.51,
  success_threshold: 2,
  error_threshold: 4,
  error_timeout: 20
})

Per the above example, you no longer need to care about the number of tickets. Rather, the tickets shall be computed as a proportion of the number of active workers.

In this case, we'd allow 50% of the workers on a particular host to connect to this redis resource. So long as the workers are in their own process, they will automatically be registered. The quota will set the bulkhead threshold based on the number of registered workers, whenever a new worker registers.

This is ideal for environments with non-uniform worker distribution, and to eliminate the need to manually calculate and adjust ticket counts.

Note:

  • You must pass exactly one of options: tickets or quota.
  • Tickets available will be the ceiling of the quota ratio to the number of workers
  • So, with one worker, there will always be a minimum of 1 ticket
  • Workers in different processes will automatically unregister when the process exits.
  • If you have a small number of workers (exactly 2) it's possible that the bulkhead will be too sensitive using quotas.
  • If you use a forking web server (like unicorn) you should call Semian.unregister_all_resources before/after forking.

Net::HTTP

For the Net::HTTP specific Semian adapter, since many external libraries may create HTTP connections on the user's behalf, the parameters are instead provided by associating callback functions with Semian::NetHTTP, perhaps in an initialization file.

Naming and Options

To give Semian parameters, assign a proc to Semian::NetHTTP.semian_configuration that takes a two parameters, host and port like 127.0.0.1,443 or github_com,80, and returns a Hash with configuration parameters as follows. The proc is used as a callback to initialize the configuration options, similar to other adapters.

SEMIAN_PARAMETERS = { tickets: 1,
                      success_threshold: 1,
                      error_threshold: 3,
                      error_timeout: 10 }
Semian::NetHTTP.semian_configuration = proc do |host, port|
  # Let's make it only active for github.com
  if host == "github.com" && port.to_i == 80
    SEMIAN_PARAMETERS.merge(name: "github.com_80")
  else
    nil
  end
end

# Called from within API:
# semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
# semian_identifier = "nethttp_#{semian_options[:name]}"

The name should be carefully chosen since it identifies the resource being protected. The semian_options passed apply to that resource. Semian creates the semian_identifier from the name to look up and store changes in the circuit breaker and bulkhead states and associate successes, failures, errors with the protected resource.

We only require that the semian_configuration be set only once over the lifetime of the library.

If you need to return different values for the same pair of host/port value, you must include the dynamic: true option. Returning different values for the same host/port values without setting the dynamic option can lead to undesirable behavior.

A common example for dynamic options is the use of a thread local variable, such as ActiveSupport::CurrentAttributes, for requests to a service acting as a proxy.

SEMIAN_PARAMETERS = {
  # ...
  dynamic: true,
}

class CurrentSemianSubResource < ActiveSupport::Attributes
 attribute :name
end

Semian::NetHTTP.semian_configuration = proc do |host, port|
  name = "#{host}_#{port}"
  if (sub_resource_name = CurrentSemianSubResource.name)
    name << "_#{name}"
  end
  SEMIAN_PARAMETERS.merge(name: name)
end

# Two requests to example.com can use two different semian resources,
# as long as `CurrentSemianSubResource.name` is set accordingly:
# CurrentSemianSubResource.set(name: "sub_resource_1") { Net::HTTP.get_response(URI("http://example.com")) }
# and:
# CurrentSemianSubResource.set(name: "sub_resource_2") { Net::HTTP.get_response(URI("http://example.com")) }

For most purposes, "#{host}_#{port}" is a good default name. Custom name formats can be useful to grouping related subdomains as one resource, so that they all contribute to the same circuit breaker and bulkhead state and fail together.

A return value of nil for semian_configuration means Semian is disabled for that HTTP endpoint. This works well since the result of a failed Hash lookup is nil also. This behavior lets the adapter default to whitelisting, although the behavior can be changed to blacklisting or even be completely disabled by varying the use of returning nil in the assigned closure.

Additional Exceptions

Since we envision this particular adapter can be used in combination with many external libraries, that can raise additional exceptions, we added functionality to expand the Exceptions that can be tracked as part of Semian's circuit breaker. This may be necessary for libraries that introduce new exceptions or re-raise them. Add exceptions and reset to the default list using the following:

# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]

Semian::NetHTTP.reset_exceptions
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
Mark Unsuccessful Responses as Failures

Unsuccessful responses (e.g. 5xx responses) do not raise exceptions, and as such are not marked as failures by default. The open_circuit_server_errors Semian configuration parameter may be set to enable this behaviour, to mark unsuccessful responses as failures as seen below:

SEMIAN_PARAMETERS = { tickets: 1,
                      success_threshold: 1,
                      error_threshold: 3,
                      error_timeout: 10,
                      open_circuit_server_errors: true }

Active Record

Semian supports Active Record adapter trilogy. It can be configured in the database.yml:

semian: &semian
  success_threshold: 2
  error_threshold: 3
  error_timeout: 4
  half_open_resource_timeout: 1
  bulkhead: false # Disable bulkhead for Puma: https://github.com/shopify/semian#thread-safety
  name: semian_identifier_name

default: &default
  adapter: trilogy
  username: root
  password:
  host: localhost
  read_timeout: 2
  write_timeout: 1
  connect_timeout: 1
  semian:
    <<: *semian

Example cases for activerecord-trilogy-adapter can be run using BUNDLE_GEMFILE=gemfiles/activerecord_trilogy_adapter.gemfile bundle exec rake examples:activerecord_trilogy_adapter

Understanding Semian

Semian is a library with heuristics for failing fast. This section will explain in depth how Semian works and which situations it's applicable for. First we explain the category of problems Semian is meant to solve. Then we dive into how Semian works to solve these problems.

Do I need Semian?

Semian is not a trivial library to understand, introduces complexity and thus should be introduced with care. Remember, all Semian does is raise exceptions based on heuristics. It is paramount that you understand Semian before including it in production as you may otherwise be surprised by its behaviour.

Applications that benefit from Semian are those working on eliminating SPOFs (Single Points of Failure), and specifically are running into a wall regarding slow resources. But it is by no means a magic wand that solves all your latency problems by being added to your Gemfile. This section describes the types of problems Semian solves.

If your application is multithreaded or evented (e.g. not Resque and Unicorn) these problems are not as pressing. You can still get use out of Semian however.

Real World Example

This is better illustrated with a real world example from Shopify. When you are browsing a store while signed in, Shopify stores your session in Redis. If Redis becomes unavailable, the driver will start throwing exceptions. We rescue these exceptions and simply disable all customer sign in functionality on the store until Redis is back online.

This is great if querying the resource fails instantly, because it means we fail in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow, this can take as long as our timeout which is easily 200ms. This means every request, even if it does rescue the exception, now takes an extra 200ms. Because every resource takes that long, our capacity is also significantly degraded. These problems are explained in depth in the next two sections.

With Semian, the slow resource would fail instantly (after a small amount of convergence time) preventing your response time from spiking and not decreasing capacity of the cluster.

If this sounds familiar to you, Semian is what you need to be resilient to latency. You may not need the graceful fallback depending on your application, in which case it will just result in an error (e.g. a HTTP 500) faster.

We will now examine the two problems in detail.

In-depth analysis of real world example

If a single resource is slow, every single request is going to suffer. We saw this in the example before. Let's illustrate this more clearly in the following Rails example where the user session is stored in Redis:

def index
  @user = fetch_user
  @posts = Post.all
end

private
def fetch_user
  user = User.find(session[:user_id])
rescue Redis::CannotConnectError
  nil
end

Our code is resilient to a failure of the session layer, it doesn't HTTP 500 if the session store is unavailable (this can be tested with Toxiproxy). If the User and Post data store is unavailable, the server will send back HTTP 500. We accept that, because it's our primary data store. This could be prevented with a caching tier or something else out of scope.

This code has two flaws however:

  1. What happens if the session storage is consistently slow? I.e. the majority of requests take, say, more than half the timeout time (but it should only take ~1ms)?
  2. What happens if the session storage is unavailable and is not responding at all? I.e. we hit timeouts on every request.

These two problems in turn have two related problems associated with them: response time and capacity.

Response time

Requests that attempt to access a down session storage are all gracefully handled, the @user will simply be nil, which the code handles. There is still a major impact on users however, as every request to the storage has to time out. This causes the average response time to all pages that access it to go up by however long your timeout is. Your timeout is proportional to your worst case timeout, as well as the number of attempts to hit it on each page. This is the problem Semian solves by using heuristics to fail these requests early which causes a much better user experience during downtime.

Capacity loss

When your single-threaded worker is waiting for a resource to return, it's effectively doing nothing when it could be serving fast requests. To use the example from before, perhaps some actions do not access the session storage at all. These requests will pile up behind the now slow requests that are trying to access that layer, because they're failing slowly. Essentially, your capacity degrades significantly because your average response time goes up (as explained in the previous section). Capacity loss simply follows from an increase in response time. The higher your timeout and the slower your resource, the more capacity you lose.

Timeouts aren't enough

It should be clear by now that timeouts aren't enough. Consistent timeouts will increase the average response time, which causes a bad user experience, and ultimately compromise the performance of the entire system. Even if the timeout is as low as ~250ms (just enough to allow a single TCP retransmit) there's a large loss of capacity and for many applications a 100-300% increase in average response time. This is the problem Semian solves by failing fast.

How does Semian work?

Semian consists of two parts: Circuit Breaker and Bulkheading. To understand Semian, and especially how to configure it, we must understand these patterns and their implementation.

Disable Semian via environment variable SEMIAN_DISABLED=1.

Circuit Breaker

The circuit breaker pattern is based on a simple observation - if we hit a timeout or any other error for a given service one or more times, we’re likely to hit it again for some amount of time. Instead of hitting the timeout repeatedly, we can mark the resource as dead for some amount of time during which we raise an exception instantly on any call to it. This is called the circuit breaker pattern.

When we perform a Remote Procedure Call (RPC), it will first check the circuit. If the circuit is rejecting requests because of too many failures reported by the driver, it will throw an exception immediately. Otherwise the circuit will call the driver. If the driver fails to get data back from the data store, it will notify the circuit. The circuit will count the error so that if too many errors have happened recently, it will start rejecting requests immediately instead of waiting for the driver to time out. The exception will then be raised back to the original caller. If the driver’s request was successful, it will return the data back to the calling method and notify the circuit that it made a successful call.

The state of the circuit breaker is local to the worker and is not shared across all workers on a server.

Circuit Breaker Configuration

There are four configuration parameters for circuit breakers in Semian:

  • circuit_breaker. Enable or Disable Circuit Breaker. Defaults to true if not set.
  • error_threshold. The amount of errors a worker encounters within error_threshold_timeout amount of time before opening the circuit, that is to start rejecting requests instantly.
  • error_threshold_timeout. The amount of time in seconds that error_threshold errors must occur to open the circuit. Defaults to error_timeout seconds if not set.
  • error_timeout. The amount of time in seconds until trying to query the resource again.
  • error_threshold_timeout_enabled. If set to false it will disable the time window for evicting old exceptions. error_timeout is still used and will reset the circuit. Defaults to true if not set.
  • success_threshold. The amount of successes on the circuit until closing it again, that is to start accepting all requests to the circuit.
  • half_open_resource_timeout. Timeout for the resource in seconds when the circuit is half-open (supported for MySQL, Net::HTTP and Redis).

It is possible to disable Circuit Breaker with environment variable SEMIAN_CIRCUIT_BREAKER_DISABLED=1.

For more information about configuring these parameters, please read this post.

Bulkheading

For some applications, circuit breakers are not enough. This is best illustrated with an example. Imagine if the timeout for our data store isn't as low as 200ms, but actually 10 seconds. For example, you might have a relational data store where for some customers, 10s queries are (unfortunately) legitimate. Reducing the time of worst case queries requires a lot of effort. Dropping the query immediately could potentially leave some customers unable to access certain functionality. High timeouts are especially critical in a non-threaded environment where blocking IO means a worker is effectively doing nothing.

In this case, circuit breakers aren't sufficient. Assuming the circuit is shared across all processes on a server, it will still take at least 10s before the circuit is open. In that time every worker is blocked (see also "Defense Line" section for an in-depth explanation of the co-operation between circuit breakers and bulkheads) this means we're at reduced capacity for at least 20s, with the last 10s timeouts occurring just before the circuit opens at the 10s mark when a couple of workers have hit a timeout and the circuit opens. We thought of a number of potential solutions to this problem - stricter timeouts, grouping timeouts by section of our application, timeouts per statement—but they all still revolved around timeouts, and those are extremely hard to get right.

Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix and the book Release It (the resiliency bible), and look at our services as connection pools. On a server with W workers, only a certain number of them are expected to be talking to a single data store at once. Let's say we've determined from our monitoring that there’s a 10% chance they’re talking to mysql_shard_0 at any given point in time under normal traffic. The probability that five workers are talking to it at the same time is 0.001%. If we only allow five workers to talk to a resource at any given point in time, and accept the 0.001% false positive rate—we can fail the sixth worker attempting to check out a connection instantly. This means that while the five workers are waiting for a timeout, all the other W-5 workers on the node will instantly be failing on checking out the connection and opening their circuits. Our capacity is only degraded by a relatively small amount.

We call this limitation primitive "tickets". In this case, the resource access is limited to 5 tickets (see Configuration). The timeout value specifies the maximum amount of time to block if no ticket is available.

How do we limit the access to a resource for all workers on a server when the workers do not directly share memory? This is implemented with SysV semaphores to provide server-wide access control.

Bulkhead Configuration

There are two configuration values. It's not easy to choose good values and we're still experimenting with ways to figure out optimal ticket numbers. Generally something below half the number of workers on the server for endpoints that are queried frequently has worked well for us.

  • bulkhead. Enable or Disable Bulkhead. Defaults to true if not set.
  • tickets. Number of workers that can concurrently access a resource.
  • timeout. Time to wait in seconds to acquire a ticket if there are no tickets left. We recommend this to be 0 unless you have very few workers running (i.e. less than ~5).

It is possible to disable Bulkhead with environment variable SEMIAN_BULKHEAD_DISABLED=1.

Note that there are system-wide limitations on how many tickets can be allocated on a system. cat /proc/sys/kernel/sem will tell you.

System-wide limit on the number of semaphore sets. On Linux systems before version 3.19, the default value for this limit was 128. Since Linux 3.19, the default value is 32,000. On Linux, this limit can be read and modified via the fourth field of /proc/sys/kernel/sem.

Bulkhead debugging on linux

Note: It is often helpful to examine the actual IPC resources on the system. Semian provides an easy way to get the semaphore key:

irb> require 'semian'
irb> puts Semian::Resource.new(:your_resource_name, tickets: 42).key # do this from a dev machine
"0x48af51ea"

This key can then be used to easily inspect the semaphore on any host machine:

ipcs -si $(ipcs -s | grep 0x48af51ea | awk '{print $2}')

Which should output something like:

Semaphore Array semid=5570729
uid=8192         gid=8192        cuid=8192       cgid=8192
mode=0660, access_perms=0660
nsems = 4
otime = Thu Mar 30 15:06:16 2017
ctime = Mon Mar 13 20:25:36 2017
semnum     value      ncount     zcount     pid
0          1          0          0          48
1          42         0          0          48
2          42         0          0          27
3          31         0          0          48

In the above example, we can see each of the semaphores. Looking at the enum code in ext/semian/sysv_semaphores.h we can see that:

  • 0: is the semian meta lock (mutex) protecting updates to the other resources. It's currently free
  • 1: is the number of available tickets - currently no tickets are in use because it's the same as 2
  • 2: is the configured (maximum) number of tickets
  • 3: is the number of registered workers (processes) that would be considered if using the quota strategy.

Defense line

The finished defense line for resource access with circuit breakers and bulkheads then looks like this:

The RPC first checks the circuit; if the circuit is open it will raise the exception straight away which will trigger the fallback (the default fallback is a 500 response). Otherwise, it will try Semian which fails instantly if too many workers are already querying the resource. Finally the driver will query the data store. If the data store succeeds, the driver will return the data back to the RPC. If the data store is slow or fails, this is our last line of defense against a misbehaving resource. The driver will raise an exception after trying to connect with a timeout or after an immediate failure. These driver actions will affect the circuit and Semian, which can make future calls fail faster.

A useful way to think about the co-operation between bulkheads and circuit breakers is through visualizing a failure scenario graphing capacity as a function of time. If an incident strikes that makes the server unresponsive with a 20s timeout on the client and you only have circuit breakers enabled--you will lose capacity until all workers have tripped their circuit breakers. The slope of this line will depend on the amount of traffic to the now unavailable service. If the slope is steep (i.e. high traffic), you'll lose capacity quicker. The higher the client driver timeout, the longer you'll lose capacity for. In the example below we have the circuit breakers configured to open after 3 failures:

resiliency- circuit breakers

If we imagine the same scenario but with only bulkheads, configured to have tickets for 50% of workers at any given time, we'll see the following flat-lining scenario:

resiliency- bulkheads

Circuit breakers have the nice property of re-gaining 100% capacity. Bulkheads have the desirable property of guaranteeing a minimum capacity. If we do addition of the two graphs, marrying bulkheads and circuit breakers, we have a plummy outcome:

resiliency- circuit breakers bulkheads

This means that if the slope or client timeout is sufficiently low, bulkheads will provide little value and are likely not necessary.

Failing gracefully

Ok, great, we've got a way to fail fast with slow resources, how does that make my application more resilient?

Failing fast is only half the battle. It's up to you what you do with these errors, in the session example we handle it gracefully by signing people out and disabling all session related functionality till the data store is back online. However, not rescuing the exception and simply sending HTTP 500 back to the client faster will help with capacity loss.

Exceptions inherit from base class

It's important to understand that the exceptions raised by Semian Adapters inherit from the base class of the driver itself, meaning that if you do something like:

def posts
  Post.all
rescue Mysql2::Error
  []
end

Exceptions raised by Semian's MySQL2 adapter will also get caught.

Patterns

We do not recommend mindlessly sprinkling rescues all over the place. What you should do instead is writing decorators around secondary data stores (e.g. sessions) that provide resiliency for free. For example, if we stored the tags associated with products in a secondary data store it could look something like this:

# Resilient decorator for storing a Set in Redis.
class RedisSet
  def initialize(key)
    @key = key
  end

  def get
    redis.smembers(@key)
  rescue Redis::BaseConnectionError
    []
  end

  private

  def redis
    @redis ||= Redis.new
  end
end

class Product
  # This will simply return an empty array in the case of a Redis outage.
  def tags
    tags_set.get
  end

  private

  def tags_set
    @tags_set ||= RedisSet.new("product:tags:#{self.id}")
  end
end

These decorators can be resiliency tested with Toxiproxy. You can provide fallbacks around your primary data store as well. In our case, we simply HTTP 500 in those cases unless it's cached because these pages aren't worth much without data from their primary data store.

Monitoring

With Semian::Instrumentable clients can monitor Semian internals. For example to instrument just events with statsd-instrument:

# `event` is `success`, `busy`, `circuit_open`, `state_change`, or `lru_hash_gc`.
# `resource` is the `Semian::Resource` object (or a `LRUHash` object for `lru_hash_gc`).
# `scope` is `connection` or `query` (others can be instrumented too from the adapter) (is nil for `lru_hash_gc`).
# `adapter` is the name of the adapter (mysql2, redis, ..) (is a payload hash for `lru_hash_gc`)
Semian.subscribe do |event, resource, scope, adapter|
  case event
  when :success, :busy, :circuit_open, :state_change
    StatsD.increment("semian.#{event}", tags: {
      resource: resource.name,
      adapter: adapter,
      type: scope,
    })
  else
    StatsD.increment("semian.#{event}")
  end
end

FAQ

How does Semian work with containers? Semian uses SysV semaphores to coordinate access to a resource. The semaphore is only shared within the IPC. Unless you are running many workers inside every container, this leaves the bulkheading pattern effectively useless. We recommend sharing the IPC namespace between all containers on your host for the best ticket economy. If you are using Docker, this can be done with the --ipc flag.

Why isn't resource access shared across the entire cluster? This implies a coordination data store. Semian would have to be resilient to failures of this data store as well, and fall back to other primitives. While it's nice to have all workers have the same view of the world, this greatly increases the complexity of the implementation which is not favourable for resiliency code.

Why isn't the circuit breaker implemented as a host-wide mechanism? No good reason. Patches welcome!

Why is there no fallback mechanism in Semian? Read the Failing Gracefully section. In short, exceptions is exactly this. We did not want to put an extra level on abstraction on top of this. In the first internal implementation this was the case, but we later moved away from it.

Why does it not use normal Ruby semaphores? To work properly the access control needs to be performed across many workers. With MRI that means having multiple processes, not threads. Thus we need a primitive outside of the interpreter. For other Ruby implementations a driver that uses Ruby semaphores could be used (and would be accepted as a PR).

Why are there three semaphores in the semaphore sets for each resource? This has to do with being able to resize the number of tickets for a resource online.

Can I change the number of tickets freely? Yes, the logic for this isn't trivial but it works well.

What is the performance overhead of Semian? Extremely minimal in comparison to going to the network. Don't worry about it unless you're instrumenting non-IO.

Developing Semian

Semian requires a Linux environment for Bulkheading. We provide a docker-compose file that runs MySQL, Redis, Toxiproxy and Ruby in containers. Use the steps below to work on Semian from a Mac OS environment.

Prerequisites :

# install docker-for-desktop
$ brew cask install docker

# install latest docker-compose
$ brew install docker-compose

# install visual-studio-code (optional)
$ brew cask install visual-studio-code

# clone Semian
$ git clone https://github.com/Shopify/semian.git
$ cd semian

Visual Studio Code

  • Open semian in vscode
  • Install recommended extensions (one off requirement)
  • Click reopen in container (first boot might take about a minute)

See https://code.visualstudio.com/docs/remote/containers for more details

If you make any changes to .devcontainer/ you'd need to recreate the containers:

  • Select Rebuild Container from the command palette

Running Tests:

  • $ bundle exec rake Run with SKIP_FLAKY_TESTS=true to skip flaky tests (CI runs all tests)

Everything else

Test semian in containers:

  • $ docker-compose -f .devcontainer/docker-compose.yml up -d
  • $ docker exec -it semian bash

If you make any changes to .devcontainer/ you'd need to recreate the containers:

  • $ docker-compose -f .devcontainer/docker-compose.yml up -d --force-recreate

Run tests in containers:

$ docker-compose -f ./.devcontainer/docker-compose.yml run --rm test

Running Tests:

  • $ bundle exec rake Run with SKIP_FLAKY_TESTS=true to skip flaky tests (CI runs all tests)

Running tests in batches

  • TEST_WORKERS - Total number of workers or batches. It uses to identify a total number of batches, that would be run in parallel. Default: 1
  • TEST_WORKER_NUM - Specify which batch to run. The value is between 1 and TEST_WORKERS. Default: 1
$ bundle exec rake test:parallel TEST_WORKERS=5 TEST_WORKER_NUM=1

Debug

Build a semian native extension with debug information.

$ bundle exec rake clean --trace
$ export DEBUG=1
$ bundle exec rake build
$ bundle install

More Repositories

1

draggable

The JavaScript Drag & Drop library your grandparents warned you about.
JavaScript
17,927
star
2

dashing

The exceptionally handsome dashboard framework in Ruby and Coffeescript.
JavaScript
11,025
star
3

liquid

Liquid markup language. Safe, customer facing template language for flexible web apps.
Ruby
10,419
star
4

toxiproxy

⏰ 🔥 A TCP proxy to simulate network and system conditions for chaos and resiliency testing
Go
9,412
star
5

react-native-skia

High-performance React Native Graphics using Skia
TypeScript
6,746
star
6

flash-list

A better list for React Native
TypeScript
5,489
star
7

polaris

Shopify’s design system to help us work together to build a great experience for all of our merchants.
TypeScript
5,352
star
8

hydrogen-v1

React-based framework for building dynamic, Shopify-powered custom storefronts.
TypeScript
3,747
star
9

go-lua

A Lua VM in Go
Go
2,773
star
10

bootsnap

Boot large Ruby/Rails apps faster
Ruby
2,614
star
11

graphql-design-tutorial

2,335
star
12

restyle

A type-enforced system for building UI components in React Native with TypeScript.
TypeScript
2,331
star
13

dawn

Shopify's first source available reference theme, with Online Store 2.0 features and performance built-in.
Liquid
2,279
star
14

identity_cache

IdentityCache is a blob level caching solution to plug into Active Record. Don't #find, #fetch!
Ruby
1,874
star
15

quilt

A loosely related set of packages for JavaScript/TypeScript projects at Shopify
TypeScript
1,703
star
16

shopify_app

A Rails Engine for building Shopify Apps
Ruby
1,649
star
17

kubeaudit

kubeaudit helps you audit your Kubernetes clusters against common security controls
Go
1,624
star
18

shipit-engine

Deployment coordination
Ruby
1,406
star
19

graphql-batch

A query batching executor for the graphql gem
Ruby
1,388
star
20

packwerk

Good things come in small packages.
Ruby
1,346
star
21

krane

A command-line tool that helps you ship changes to a Kubernetes namespace and understand the result
Ruby
1,309
star
22

slate

Slate is a toolkit for developing Shopify themes. It's designed to assist your workflow and speed up the process of developing, testing, and deploying themes.
JavaScript
1,283
star
23

ejson

EJSON is a small library to manage encrypted secrets using asymmetric encryption.
Go
1,246
star
24

superdb

The Super Debugger, a realtime wireless debugger for iOS
Objective-C
1,158
star
25

shopify_python_api

ShopifyAPI library allows Python developers to programmatically access the admin section of stores
Python
1,072
star
26

storefront-api-examples

Example custom storefront applications built on Shopify's Storefront API
JavaScript
1,069
star
27

themekit

Shopify theme development command line tool.
Go
1,068
star
28

Timber

The ultimate Shopify theme framework, built by Shopify.
Liquid
992
star
29

shopify-cli

Shopify CLI helps you build against the Shopify platform faster.
Ruby
987
star
30

shopify-api-ruby

ShopifyAPI is a lightweight gem for accessing the Shopify admin REST and GraphQL web services.
Ruby
982
star
31

hydrogen

Hydrogen is Shopify’s stack for headless commerce. It provides a set of tools, utilities, and best-in-class examples for building dynamic and performant commerce applications. Hydrogen is designed to dovetail with Remix, Shopify’s full stack web framework, but it also provides a React library portable to other supporting frameworks. Demo store 👇🏼
TypeScript
966
star
32

js-buy-sdk

The JS Buy SDK is a lightweight library that allows you to build ecommerce into any website. It is based on Shopify's API and provides the ability to retrieve products and collections from your shop, add products to a cart, and checkout.
JavaScript
932
star
33

job-iteration

Makes your background jobs interruptible and resumable by design.
Ruby
907
star
34

cli-ui

Terminal user interface library
Ruby
869
star
35

react-native-performance

Performance monitoring for React Native apps
TypeScript
860
star
36

ruby-lsp

An opinionated language server for Ruby
Ruby
851
star
37

active_shipping

ActiveShipping is a simple shipping abstraction library extracted from Shopify
Ruby
809
star
38

shopify-api-js

Shopify Admin API Library for Node. Accelerate development with support for authentication, graphql proxy, webhooks
TypeScript
765
star
39

tapioca

The swiss army knife of RBI generation
Ruby
733
star
40

maintenance_tasks

A Rails engine for queueing and managing data migrations.
Ruby
705
star
41

shopify-app-template-node

JavaScript
701
star
42

remote-ui

TypeScript
701
star
43

erb_lint

Lint your ERB or HTML files
Ruby
651
star
44

shopify_theme

A console tool for interacting with Shopify Theme Assets.
Ruby
640
star
45

pitchfork

Ruby
630
star
46

ghostferry

The swiss army knife of live data migrations
Go
596
star
47

yjit

Optimizing JIT compiler built inside CRuby
593
star
48

statsd-instrument

A StatsD client for Ruby apps. Provides metaprogramming methods to inject StatsD instrumentation into your code.
Ruby
546
star
49

autotuner

Get suggestions to tune Ruby's garbage collector
Ruby
511
star
50

shopify.github.com

A collection of the open source projects by Shopify
CSS
505
star
51

ruby-style-guide

Shopify’s Ruby Style Guide
Ruby
475
star
52

theme-scripts

Theme Scripts is a collection of utility libraries which help theme developers with problems unique to Shopify Themes.
JavaScript
470
star
53

livedata-ktx

Kotlin extension for LiveData, chaining like RxJava
Kotlin
468
star
54

starter-theme

The Shopify Themes Team opinionated starting point for new a Slate project
Liquid
459
star
55

shopify-demo-app-node-react

JavaScript
444
star
56

web-configs

Common configurations for building web apps at Shopify
JavaScript
433
star
57

mobile-buy-sdk-ios

Shopify’s Mobile Buy SDK makes it simple to sell physical products inside your mobile app. With a few lines of code, you can connect your app with the Shopify platform and let your users buy your products using Apple Pay or their credit card.
Swift
433
star
58

shopify_django_app

Get a Shopify app up and running with Django and Python Shopify API
Python
425
star
59

deprecation_toolkit

⚒Eliminate deprecations from your codebase ⚒
Ruby
390
star
60

ruby-lsp-rails

A Ruby LSP extension for Rails
Ruby
388
star
61

bootboot

Dualboot your Ruby app made easy
Ruby
374
star
62

FunctionalTableData

Declarative UITableViewDataSource implementation
Swift
365
star
63

shadowenv

reversible directory-local environment variable manipulations
Rust
349
star
64

shopify-node-app

An example app that uses Polaris components and shopify-express
JavaScript
327
star
65

polaris-viz

A collection of React and React native components that compose Shopify's data visualization system
TypeScript
317
star
66

better-html

Better HTML for Rails
Ruby
311
star
67

theme-check

The Ultimate Shopify Theme Linter
Ruby
306
star
68

product-reviews-sample-app

A sample Shopify application that creates and stores product reviews for a store, written in Node.js
JavaScript
300
star
69

tracky

The easiest way to do motion tracking!
Swift
295
star
70

shopify-api-php

PHP
279
star
71

measured

Encapsulate measurements and their units in Ruby.
Ruby
275
star
72

cli

Build apps, themes, and hydrogen storefronts for Shopify
TypeScript
273
star
73

money

Manage money in Shopify with a class that won't lose pennies during division
Ruby
265
star
74

javascript

The home for all things JavaScript at Shopify.
253
star
75

ruvy

Rust
252
star
76

limiter

Simple Ruby rate limiting mechanism.
Ruby
244
star
77

vscode-ruby-lsp

VS Code plugin for connecting with the Ruby LSP
TypeScript
232
star
78

ruby_memcheck

Use Valgrind memcheck on your native gem without going crazy
Ruby
230
star
79

polaris-tokens

Design tokens for Polaris, Shopify’s design system
TypeScript
230
star
80

buy-button-js

BuyButton.js is a highly customizable UI library for adding ecommerce functionality to any website.
JavaScript
230
star
81

android-testify

Add screenshots to your Android tests
Kotlin
225
star
82

spoom

Useful tools for Sorbet enthusiasts
Ruby
220
star
83

turbograft

Hard fork of turbolinks, adding partial page replacement strategies, and utilities.
JavaScript
213
star
84

mobile-buy-sdk-android

Shopify’s Mobile Buy SDK makes it simple to sell physical products inside your mobile app. With a few lines of code, you can connect your app with the Shopify platform and let your users buy your products using their credit card.
Java
202
star
85

graphql-js-client

A Relay compliant GraphQL client.
JavaScript
187
star
86

shopify-app-template-php

PHP
186
star
87

skeleton-theme

A barebones ☠️starter theme with the required files needed to compile with Slate and upload to Shopify.
Liquid
185
star
88

sprockets-commoner

Use Babel in Sprockets to compile JavaScript modules for the browser
Ruby
182
star
89

rotoscope

High-performance logger of Ruby method invocations
Ruby
180
star
90

shopify-app-template-remix

TypeScript
178
star
91

git-chain

Tool to rebase multiple Git branches based on the previous one.
Ruby
176
star
92

verdict

Framework to define and implement A/B tests in your application, and collect data for analysis purposes.
Ruby
176
star
93

hydrogen-react

Reusable components and utilities for building Shopify-powered custom storefronts.
TypeScript
174
star
94

ui-extensions

TypeScript
173
star
95

storefront-api-learning-kit

JavaScript
171
star
96

heap-profiler

Ruby heap profiler
C++
159
star
97

autoload_reloader

Experimental implementation of code reloading using Ruby's autoload
Ruby
158
star
98

app_profiler

Collect performance profiles for your Rails application.
Ruby
157
star
99

graphql-metrics

Extract as much much detail as you want from GraphQL queries, served up from your Ruby app and the graphql gem.
Ruby
157
star
100

active_fulfillment

Active Merchant library for integration with order fulfillment services
Ruby
155
star