• Stars
    star
    423
  • Rank 99,027 (Top 3 %)
  • Language
  • Created about 9 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A guide to service principles at Yelp for our service oriented architecture

Service Principles

Creation

Discuss the organization of your overall system first

As you start thinking about designing your service, talk with your teammates and other service experts before writing any code or even design documents. Think about the features you want to add in addition to how they fit in with any existing features, products, and services. Think about the most sensible way to organize the entire feature set. Sometimes adding new features warrants a reorganization of existing components.

We want to avoid "append-only" service architecture where development only happens in the form of new services.

Check whether you can add your feature to an existing service

Before writing a new service, you should check that your feature doesn’t belong in any existing service. It may overlap with existing functionality, deal with the same information, or be a natural progression of an existing service’s scope. On the other hand, if adding the new functionality to an existing service would be surprising and confusing to users of that service, it probably doesn't belong there.

Consider whether your feature is better suited to a library

Even though this is a document about services, it's important to point out that some features are better suited to libraries. To help you make the right call for your feature, we describe some of the tradeoffs between libraries and services:

Upgrade speed Libraries are best suited to situations where it's acceptable for users to upgrade over long periods of time (normally weeks to months, sometimes years). Typically, this requires that your feature is relatively small and self-contained, resulting in a low rate of change. In contrast, if you anticipate lots of ongoing development, or you may need to quickly roll out a bugfix to all users, then your feature may be better suited to a service. This will tend to occur with more complicated features, potentially with external dependencies.

Performance and reliability Libraries live in the same address space as the calling process, whereas services live on the other end of a network cable. All other things being equal, it's going to be faster and more reliable to access your feature as a library. However, if your feature has high resource requirements (e.g. CPU, memory or disk) then implementing it as a service can allow clients to run more efficiently and allow the service hardware to be scaled independently of the client hardware.

Technology independence For the most part, Yelp is a Python shop, with a few Java systems on the backend. If your feature does need to be accessed by both Python and Java clients, then writing it as a service is going to allow you to avoid writing two different library implementations (one for each of Python and Java).

Services are curated by teams, not individuals

Each service should be curated by a team instead of an individual. This prevents situations where only one developer knows how to fix the service. In practice, this means that every service should be developed by more than one person from the very beginning, and operational responsibilities must be shared.

Services are a long-term commitment

When you write any code, you are committing your team to providing ongoing maintenance and operational support. Over time you will build up a set of consumers of your service that will depend on it to stay operational. You need on-going monitoring for performance, consistency, and brown-outs. You may also need to respond to failures in downstream systems.

Factor in the overhead of deploying a distributed system

The initial rollout of a service often takes longer than you might think. Sometimes a lot longer. This could be due to unexpected interactions between other services or difficulties in setting up the right monitoring. Load testing is rarely perfect so plan on going through several stages and iterations before being 100% deployed.

Prefer larger services

Aim to place related functionality into a single, larger service instead of multiple, smaller services. Note that your larger service should be logically cohesive i.e. you should still be able to concisely describe its behavior. The rationale for this guideline is as follows:

  • In-process calls are more reliable and faster than inter-service calls.
  • It is often easier to change just a single service instead of coordinating changes across multiple services. In particular, it is relatively expensive to change service interfaces.
  • At the operations level, it tends to be easier to keep a smaller number of more uniform processes correctly running.
  • There are some maintenance tasks which must be separately performed on each service e.g. upgrading to new versions of libraries. It is easier to do this if there are fewer services.

Minimize the depth of the service call-graph

When architecting services, prefer a shallow inter-service call-graph (service A calls services B, C and D) to a deep one (A calls B calls C calls D). The rationale for this principle is:

  • It is often easier to reason about shallow call-graphs; in the shallow case, B, C and D have no external service dependencies, whereas in the deep case both B and C depend on another service. Note that this fits naturally with the three-tier architecture.
  • It tends to increase performance; in the shallow case, the service calls can be executed in parallel (as long as they are independent), whereas in the deep case all code runs serially.

Minimize the number of services owned by your team

Your team is [hopefully] organized to deliver some product or related set of products. The services you develop should be aligned with that focus. If you have too many services, it means either your team is spread too thinly across dissimilar systems, or you're not unifying your services enough.

If your team is already responsible for a large number of critical services (ex: more services than developers), and the new feature does not fit into any of the existing services, you may need to re-distribute responsibility before your team is able to take on the task of building another service. You should also avoid sharing responsibility for services across more than one team.

Interfaces

The interface to your service may be more than just its public REST interface. Logs, data dumps and streams that are consumed by other services should be considered. Your interface is the sum of all the parts that are usable by clients.

For examples of good interface design we try to follow, see the GitHub API v3 and the PayPal REST API

Interfaces should be easy to understand

When designing an interface, follow best-practices such as:

  • Use self-describing names. Be consistent internally and externally.
  • Have domain experts review your interface.
  • Have one obvious way of performing each operation.
  • When porting to a service your existing functions won’t necessarily make the best network endpoints. Remote execution changes the nature of consistency, reliability, and performance. Design your service boundary to be loosely coupled with other systems.
  • Separate the interfaces that read from those that update. (See CQRS as an example)
  • Minimise and simplify the interface as much as possible. This will also aid in testing.
  • Document any areas where confusion may arise.
  • Ask yourself: "Could a new-hire understand your interface without having to talk to you?"

Interfaces should be robust

Design your interface as if it is going to be exposed to the wider Internet. Only expose information that is strictly required by clients. Where possible, avoid providing unsafe or expensive operations.

Differentiate between read-only methods vs those that change state. Ideally your read-only methods are idempotent and cacheable.

Changes to interfaces should be backwards compatible

Your interfaces should have some versioning mechanism. When you change a service interface, pre-existing clients will still be invoking the previous version of your interface. You must not break those clients. Plan on continuing to support these older clients, or to be spending additional time coordinating with them about upgrades. It’s acceptable to have several versions of an interface in use at the same time.

Testing

Any changes to your service should be able to be tested automatically

Yelp doesn't employ separate QA engineers. Instead, we rely on computers to do our validation. You are responsible for maintaining a good test suite for your service. Your tests should run quickly and reliably in both dev and test environments.

A good test suite is like a pyramid. At the base are your unit tests: many fast, small tests that validate individual implementation details of your code. In the middle are your integration tests, in which you validate how various components of your service interact. At the top is a small but critical set of acceptance tests that validate your service works correctly with dependent services.

Your interface is the most important thing to test

Your interface is an important part of your contract. The interface is what your clients use and interact with. If you change your interface and break or change how it behaves you're affecting all clients of your service. This can have widespread impact.

This is why your interface is the most vital component to test. Your interface tests will tell you what your client actually sees, while your remaining tests will inform you on how to ensure your clients see those results. Ensure that all active versions of your interface perform consistently. Write these tests as early as possible, since the desired behavior from a blackbox testing perspective should be driving your interface design.

Operations

You are responsible for running your service

Your team must be committed to the on-going maintenance of your service. Your friendly Operations team is the first line of defense when there are overall site problems, but they're mostly doing triage when services are involved. It's your responsibility to monitor the health of your service, have meaningful alerts, and have a plan of action in the event of problems. You are the one who knows best how your service works, so you're best positioned to identify and fix problems when they arise.

Write a runbook for your service. Fill it with common operational cases (such as deployment, monitoring, and troubleshooting). Document known errors.

Guide your clients’ expectations

You need to be able to clearly communicate to clients the operational characteristics of your service. We suggest that you actively monitor and performance test your service in order to understand these characteristics and that you commit to certain thresholds that the service should maintain.

For example, if I tell my clients that "99% of requests return in under 100ms with 99.99% uptime", it means I should be actively monitoring the performance and availability of my service to ensure that I'm sticking to my guarantee.

It's okay for different services to have different operational guarantees. The uptime of a "cat picture of the day" service that's only used internally by your team for laughs may not even be actively monitored or alerted. A service that's production-facing and part of the critical path to major endpoints should have a much higher commitment to uptime, performance, and any problems that arise.

Plan for failure

One of the major defining features of a Service Oriented Architecture is a vast increase in the number of possible failure scenarios in your distributed system. Operationally, this means your service must strive to be honest about the collaborating services and datastores it requires to operate, and those whose downtime can be gracefully handled.

Plan from the start of your service to identify external dependencies and document your relationship to them. Place metrics on external links, and put in place both passive and developer-operated protections so you can protect your service against external failures. Finally, make sure you actively monitor these potential weak points, and alert your on-call team appropriately depending on their severity.

In addition, your service will likely be running across multiple machines, racks, and datacenters, each with their own possibility of failure. Understand how your service will behave under each scenario.

Well-designed services consider all and every failure points and make a conscious decision on how to degrade in every single one of these.

Additional Reading

More Repositories

1

elastalert

Easy & Flexible Alerting With ElasticSearch
Python
7,926
star
2

dumb-init

A minimal init system for Linux containers
Python
6,624
star
3

detect-secrets

An enterprise friendly way of detecting and preventing secrets in code.
Python
3,395
star
4

mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
Python
2,609
star
5

osxcollector

A forensic evidence collection & analysis toolkit for OS X
Python
1,858
star
6

paasta

An open, distributed platform as a service
Python
1,655
star
7

undebt

A fast, straightforward, reliable tool for performing massive, automated code refactoring
Python
1,632
star
8

MOE

A global, black box optimization engine for real world metric optimization.
C++
1,306
star
9

dockersh

A shell which places users into individual docker containers
Go
1,282
star
10

dataset-examples

Samples for users of the Yelp Academic Dataset
Python
1,189
star
11

yelp.github.io

A showcase of projects we've open sourced and open source projects we use
JavaScript
701
star
12

bravado

Bravado is a python client library for Swagger 2.0 services
Python
600
star
13

yelp-api

Examples of code using our v2 API
PHP
580
star
14

swagger-gradle-codegen

💫 A Gradle Plugin to generate your networking code from Swagger
Kotlin
407
star
15

pyleus

Pyleus is a Python framework for developing and launching Storm topologies.
Python
406
star
16

mysql_streamer

MySQLStreamer is a database change data capture and publish system.
Python
405
star
17

yelp-fusion

Yelp Fusion API
Python
396
star
18

docker-custodian

Keep docker hosts tidy
Python
354
star
19

android-school

The best videos from the Android community and beyond
349
star
20

Tron

Next generation batch process scheduling and management
Python
340
star
21

kafka-utils

Python
312
star
22

bento

A delicious framework for building modularized Android user interfaces, by Yelp.
Kotlin
305
star
23

Testify

A more pythonic testing framework.
Python
303
star
24

clusterman

Cluster Autoscaler for Kubernetes and Mesos
Python
295
star
25

kotlin-android-workshop

A Kotlin Workshop for engineers familiar with Java and Android development.
Kotlin
289
star
26

threat_intel

Threat Intelligence APIs
Python
264
star
27

python-gearman

Gearman API - Client, worker, and admin client interfaces
Python
242
star
28

nrtsearch

A high performance gRPC server on top of Apache Lucene
Java
239
star
29

py_zipkin

Provides utilities to facilitate the usage of Zipkin in Python
Python
223
star
30

fuzz-lightyear

A pytest-inspired, DAST framework, capable of identifying vulnerabilities in a distributed, micro-service ecosystem through chaos engineering testing and stateful, Swagger fuzzing.
Python
193
star
31

yelp-python

A Python library for the Yelp API
Python
182
star
32

venv-update

Synchronize your virtualenv quickly and exactly.
Python
178
star
33

firefly

Firefly is a web application aimed at powerful, flexible time series graphing for web developers.
JavaScript
171
star
34

amira

AMIRA: Automated Malware Incident Response & Analysis
Python
151
star
35

YLTableView

Objective-C
146
star
36

love

A system to share your appreciation
Python
141
star
37

aactivator

Automatically source and unsource a project's environment
Python
139
star
38

lemon-reset

Consistent, cross-browser React DOM tags, powered by CSS Modules. 🍋
JavaScript
131
star
39

detect-secrets-server

Python
109
star
40

bravado-core

Python
108
star
41

data_pipeline

Data Pipeline Clientlib provides an interface to tail and publish to data pipeline topics.
Python
108
star
42

dataloader-codegen

🤖 dataloader-codegen is an opinionated JavaScript library for automatically generating DataLoaders over a set of resources (e.g. HTTP endpoints).
TypeScript
107
star
43

yelp-ruby

A Ruby gem for communicating with the Yelp REST API
Ruby
105
star
44

swagger_spec_validator

Python
103
star
45

ybinlogp

A fast mysql binlog parser
C
97
star
46

beans

Bringing people together, one cup of coffee at a time
Python
90
star
47

casper

A fast web application platform built in Rust and Luau
Rust
86
star
48

schematizer

A schema store service that tracks and manages all the schemas used in the Data Pipeline
Python
85
star
49

requirements-tools

requirements-tools contains scripts for working with Python requirements, primarily in applications.
Python
81
star
50

osxcollector_output_filters

Filters that process and transform the output of osxcollector
Python
76
star
51

sensu_handlers

Custom Sensu Handlers to support a multi-tenant environment, allowing checks themselves to emit the type of handler behavior they need in the event json
Ruby
75
star
52

kegmate

Arduino/iPad powered kegerator
Objective-C
72
star
53

graphql-guidelines

GraphQL @ Yelp Schema Guidelines
Makefile
70
star
54

ephemeral-port-reserve

Find an unused port, reliably
Python
66
star
55

parcelgen

Helpful tool to make data objects easier for Android
Python
65
star
56

yelp-ios

Objective-C
62
star
57

salsa

A tool for exporting iOS components into Sketch 📱💎
Swift
62
star
58

docker-observium

Observium docker image with both professional and community edition support, ldap auth, and easy plugin support.
ApacheConf
57
star
59

yelp-android

Java
55
star
60

terraform-provider-signalform

SignalForm is a terraform provider to codify SignalFx detectors, charts and dashboards
Go
44
star
61

mycroft

Python
42
star
62

terraform-provider-gitfile

Terraform provider for checking out git repositories and making changes
Go
40
star
63

pidtree-bcc

eBPF tool for logging process ancestry of outbound TCP connections
Python
40
star
64

ffmpeg-android

Shell
39
star
65

pushmanager

Pushmanager is a web application to manage source code deployments.
Python
38
star
66

zygote

A Python HTTP process management utility.
Python
38
star
67

yelp_kafka

An extension of the kafka-python package that adds features like multiprocess consumers.
Python
38
star
68

pgctl

Manage sets of developer services -- "playground control"
Python
31
star
69

EMRio

Elastic MapReduce instance optimizer
Python
31
star
70

s3mysqldump

Dump mysql tables to s3, and parse them
Python
31
star
71

pyramid_zipkin

Pyramid tween to add Zipkin service spans
Python
28
star
72

android-varanus

A client-side Android library to monitor and limit network traffic sent by your apps
Kotlin
27
star
73

puppet-netstdlib

A collection of Puppet functions for interacting with the network
Ruby
27
star
74

sqlite3dbm

sqlite-backed dictionary conforming to the dbm interface
Python
27
star
75

send_nsca

Pure-python NSCA client
Python
26
star
76

data_pipeline_avro_util

Provides a Pythonic interface for reading and writing Avro schemas
Python
26
star
77

cocoapods-readonly

Automatically locks all CocoaPod source files.
Ruby
26
star
78

uwsgi_metrics

Python
26
star
79

docker-push-latest-if-changed

Python
25
star
80

WebImageView

An enhanced and improved ImageView for Android that displays images loaded over the interwebs
Java
25
star
81

task_processing

Interfaces and shared infrastructure for generic task processing at Yelp.
Python
23
star
82

PushmasterApp

(Legacy) Yelp pushmaster application built on Google App Engine
Python
22
star
83

tlspretense-service

A Docker container that exposes tlspretense on a port.
Makefile
20
star
84

puppet-uchiwa

Puppet module for installing Uchiwa
Ruby
20
star
85

yelp_cheetah

cheetah, hacked by yelpers
Python
20
star
86

logfeeder

Python
20
star
87

fido

Asynchronous HTTP client built on top of Crochet and Twisted
Python
20
star
88

pyramid-hypernova

A Python client for Airbnb's Hypernova server, for use with the Pyramid web framework.
Python
19
star
89

swagger-spec-compatibility

Python library to check Swagger Spec backward compatibility
Python
19
star
90

mr3po

protocols for use with mrjob
Python
16
star
91

YPFastDateParser

A class for parsing strings into NSDate instances, several times faster than NSDateFormatter
Objective-C
15
star
92

yelp_uri

Utilities for dealing with URIs, invented and maintained by Yelp.
Python
14
star
93

pysensu-yelp

A Python library to emit Sensu events that the Yelp Sensu Handlers can understand for Self-Service Sensu Monitoring
Python
14
star
94

terraform-provider-cloudhealth

Terraform provider for Cloudhealth
Go
14
star
95

yelp-rails-example

An example Rails application that uses the Yelp gem to integrate with the API
Ruby
13
star
96

named_decorator

Dynamically name wrappers based on their callees to untangle profiles of large python codebases
Python
12
star
97

pt-online-schema-change-plugins

Perl
11
star
98

puppet-cron

A super great cron Puppet module with timeouts, locking, monitoring, and more!
Ruby
11
star
99

doloop

Task loop for keeping things updated
Python
10
star
100

environment_tools

Tools for programmatically describing Yelp's different environments (prod, dev, stage)
Python
10
star