How can Microservices help make systems more resilient?
I think this is a topic that needs more attention when people consider moving to microservices. There are subtle issues to address.
I’ll say that microservices can help make a system more resilient, depending on how you decompose your system into services and how you build each service. For me, the question is not about how large the service is, but how large the “failure domains” are. One microservice architecture may in fact be a single large failure domain, whereas another may be split into a dozen isolated failure domains.
There are two dimensions to consider. First, what services are needed for any given feature? I talk about an “activation graph” of services. That is, given a particular request, what services must be involved to fully deliver that request? The more overlap between the activation graphs of all your request types, the less resilient your architecture is. This is why I think that so-called “entity services” are a bad idea; they tend to produce activation graphs with huge overlap.
The second dimension to consider is whether individual services have an easy way to handle failure in their dependencies. The caller must still respond to requests even when some of its dependencies are timing out or refusing connections. This means that resilience should be built into every service, preferrably with a common framework so that monitoring and administration is simplified.
The Circuit Breaker Pattern from “Release It” is now a major principle of Resilient Design. Netflix’s Hystrix is one of the most prominent implementations of it. What is the essence of this pattern?
In your house, a short circuit produces high electrical current that leads to a fire. To prevent that, electricians use a circuit breaker. It detects a dangerous condition (excess current) and intervenes to prevent the catastrophe. The software circuit breaker performs the same function: it prevents a partial failure from becoming a catastrophic outage.
In implementation terms, all calls to an external interface go through a component that can watch for failures in requests to the provider. If too many calls fail, it means the provider is unavailable. Whether that is due to the provider or the network being down is immaterial. From the caller’s perspective, all that matters is that the service cannot be reached.