00011 : Service discovery, load balancing and routing

00011 : Service discovery, load balancing and routing

ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Distributed debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

In the previous post, I covered the challenges associated with debugging and logging RPC calls across distributed systems. Now let's turn our attention to how those RPC calls work in your services.

This boils down to the fact that services using RPC calls rely on services that are in another process.

As a system grows, and services are added or removed, keeping track of what services are available and where they are becomes an issue.

You could hard-code in each service the locations of the services it depends on, but that tends to break down once you have two services and need to add the third!

Once you need to run multiple instances of the same service, or use containers and elastic scaling, suddenly DNS propagates too slowly and you don't know where everything is.

Re-deploying your service every time another service on which it depends is updated or moved means you must decouple these type of dependencies between services.

It quickly becomes apparent that you need a more dynamic solution.

You need service discovery.

O Services, Services, wherefore art thou Services?

There are a number of tried-and-tested methods for discovery to be found in DHCP, Bonjour, uPnP, SSDL and DNS-SD. For web-based services, UDDI and WS-Discovery have come and - for the most part - gone.

Newer solutions like Zookeeper, Etcd and Consul have emerged to offer service discovery.

Gateways like NGINX also provide routing options which can be used for decoupling service-to-service calls.

Enterprise Service Bus systems like NServiceBus and MassTransit also can be used in a pub/sub messaging pattern to decouple service-to-service calls.

I've mentioned just a few but there are many more. You have a lot of options here, so how do you choose?

Let's first briefly cover some different patterns before I cover what we have chosen to use and why.

Centralised Registry vs. Self-Discovery

There are two common patterns that you find in solutions for Service Discovery.

The first is the service registry, a centralised database that stores the location of a service.

The second is self or auto-discovery where there is no central database and is often found in zero-configuration networking. Instead, clients use a variety of approaches to broadcast packets across a network to request a remote service and wait for the required service to respond with its location.

The service registry is another single point of failure (SPF) in your infrastructure but can provide more operational control. When used with server-side discovery, which is often found in gateways, it can completely decouple any discovery logic from the services.

Zero-configuration networking can be generous on security within networks to permit devices to 'just work' but can be more challenging to secure as systems span networks. It is often more suitable for smaller networks (uPnP, Bonjour etc.).

Communication

There are four common types of service-to-service communication.

Communication

  1. Point-to-point : services talk directly to each other.

  2. Gateway : acts as the middleman, handling the routing of requests and responses between services.

  3. Gateway Request : the responding service replies directly to the calling service rather than return through the gateway.

  4. Message Queue: services publish messages to a queue, the responding service subscribes to the messages published and in turn publishes its response to the queue for the original service to subscribe to.

Point-to-point involves the shortest route so is often the quickest but requires each end-point to take a dependency on your discovery mechanism.

The gateway can decouple many concerns from your services, handling not just routing, but caching, front-end to back-end bridging with HTTPS termination, transport conversions like HTTP to TCP/IP, formats, aggregation and load-balancing, to name just a few.

The message-queue pub/sub model is slower and is more suited for longer running processes.

Registration

For service registries only, the registration can be handled by each client directly or by the server.

As with server-side discovery, server-side registration completely decouples registration from your services.

Further reading

I've only really scratched the surface on the above keeping the explanations as brief as possible, as I want to get on to some specifics, but you can find a much better, more detailed overview of Service Discovery in Chris Richardson's excellent post as part of his series on Microservices.

Chris also has many video talks and articles available online and speaks very eloquently on all matters relating to distributed design which I have greatly enjoyed during my own research. I highly recommend checking them out.

It's make your mind up time.

So this is the first critical point where we had a variety of choices to make in our design.

Do we want smart versus dumb pipes? How about decentralised control with auto-discovery? How does our communication behave? Who controls registration? Is one single approach for all scenarios even practical?

For us they are opinionated and deliberate choices.
Our approach that follows is not inherently better or worse, but each choice has consequences for many of the subsequent design decisions. In many cases, they can actually remove choice.

We will come back to reference these choices in the rest of this series.

It is also worth pointing out that I couldn't try out everything available, so our choice is not a reflection on other solutions out there, it is just the one I felt best fit ServiceStack and suited our needs.

And the winner is

Consul, let's cover the basics of Consul before we tackle how it fits in with ServiceStack.

Consul is a single binary executable that can run on Windows, Linux or iOS. It can run either as a Service, an Agent or for sending Commands to other Consul instances.

We use it as a service registry with client-side self-registration, client-side discovery and this enables point-to-point service RPCs.

Consul Datacenter

Consul, like all service registry patterns is a potential SPF, but is designed for High Availability in mind.

In production, you run an odd number of Server nodes which form a DataCenter (DC), typically three or five. You can scale Consul to connect multiple datacenters.

The odd number is because it implements a consensus protocol based on RAFT which holds leadership elections, and they need a deciding vote to elect a leader.

For the best possible resiliency, server nodes can be spread across physical hardware, network locations and operating systems. Running three instances allows a single node to fail while running five can tolerate two node failures.

Consul is actually a hybrid model of server and client-side, something also found in Netflix's Eureka. This approach avoids one typical drawback of client-side discovery and self-registration systems i.e. network availability and latency.

It avoids this by using local agents on a loopback address.

Consul DataCenter and Agent

Each service has access to an agent co-located on the same physical hardware. Consul uses a gossip protocol Serf for managing membership, failure detection and message broadcasting and RAFT logs to keep each agent's list of services synchronised.

This means lookups and registrations are local and fast with no network hops.

ServiceStack [enters stage left]

This is my discovery solution, there are many just like it, but this one is mine.

So now we've made our first design choices, let me introduce our next plugin.

ServiceStack.Discovery.Consul

There is a detailed readme on the project which, as in previous posts, I won't cover here, but the minimum code to configure discovery in your ServiceStack AppHost is as follows:

public override void Configure(Container container)  
{
    SetConfig(new HostConfig
    {
        // the external url:port that other services will use to access this one
        WebHostUrl = "http://api.acme.com:1234",
    });

    // Register the plugin, that's it!
    Plugins.Add(new ConsulFeature());
}

Your ServiceStack instances can now communicate with each other requiring nothing more than a copy of the DTO POCO. This is where ServiceStack and it's DTO message-driven style really shines.

You interact with local and remote services solely through simple DTO POCO message contracts.

For most service discovery solutions, you have to know first which service you want to call. Not so for our plugin.

The difference in calling a local or remote service is indistinguishable in your code.

public class MyService : Service  
{
    public void Any(RequestDTO dto)
    {
        // The gateway will automatically use the DTO type to find the correct service
        var internalResponse = Gateway.Send(new InternalDTO { ... });
        var externalResponse = Gateway.Send(new ExternalDTO { ... });
    }
}

This makes it easy to develop all your services in a single instance. You can then split them out as you need to scale, but your calling code remains exactly the same.

There are no references and no uris.

Just look at the code and let that all sink in for a second.... it's more ServiceStack magic and it's so simple, it has caused a few WTFs!

Behind the curtain, the wizard is revealed

So how does it work?

Discovery

When the AppHost starts up, it registers itself with Consul. In doing so it passes a list of all the DTOs it is able to process.

Combined with ServiceStack's ability to export its DTO's and its native pre-defined-routes this makes it easy to move service methods between projects.

To call a remote method, the callee service only needs to have a copy of the DTO (the contract) with the correct name and structure as the remote service.

The gateway will recognise any DTO it cannot process itself and instead look up the correct service from Consul.

This allows our plugin, with Consul's help, to provide automatic and completely transparent DTO routing.

This also avoids the overheads of message-bus and gateway-style discovery by allowing point-to-point communication between services.

The verbiage on verbs

It is worth expanding slightly to cover how HTTP Verbs work in ServiceStack.

By default on the ServiceClient.Send() and Gateway.Send() or Gateway.SendAsync(), the verb will default to use POST.

There are two methods by which you can control this behaviour.

The first is to use the verb specific methods available on the ServiceClient:

var externalDto = new ExternalDTO();  
var client = new JsonServiceClient("http://myservice");

// HTTP GET
client.Get(externalDto);

// HTTP PUT Async call
client.PutAsync(externalDto);

// HTTP DELETE call
client.Delete(externalDTO);

The second method, which is only available to the Gateway, and the one we therefore have to use, is the IVerb interface markers on the DTOs.

public class ExternalDTO : IGet, IReturn<ExternalDTOResponse>  
{
  ...
}

// Gateway.Send() + IGet is an alias for Gateway.Get()
Gateway.Send(new ExternalDTO());  

The approach also helps decouple the HTTP verb specifics of any external calls from your call site and instead makes the DTO responsible for defining how it is sent.

But wait, there's more...

In addition, Consul provides another piece of the infrastructure jigsaw which our plugin handles for you - service health which we will cover in our next topic.

The gateway will also select the correct format for retrieving the DTO. If your remote service only communicates in XML, it will transparently call it using XML but return you a POCO.

It will also automatically cache responses from a GET request according to the remote service's cache settings. In some cases, it will not even issue an RPC, instead returning you the DTO response straight from the cache.

Our future roadmap also includes configurable time-out, retry and cache fall-back policies.

Let's get down to brass-tacks, how much for the API..?

We think the simplicity and low-ceremony approach above is really compelling, but it doesn't come for free. There are opinionated choices we've made to allow it to work this way.

So this is where we cover the consequences of those decisions and the first one is a whopper.

We've thrown RESTful routing under a bus

Oh my!

Hiding from RESTafarians

Now we have reasons for this which I cover next in routing. It may be possible to make this work with Consul, but I don't yet see a way to make it robust nor elegant.

DTOs MUST be globally unique.

This one is actually part of the ServiceStack guidelines anyway so we don't feel bad about this at all.

The third is another whopper which I have a whole topic devoted to later on so for now, I won't clarify further but instead lob this like a grenade into the fire-pit.

You cannot EVER make a breaking-change to a DTO

Run Away!!! <Runs away>

Routing

Instead of REST and all the great custom and fallback routing options in ServiceStack, we have chosen to use only ServiceStack's pre-defined-routes.

Together with our second consequence of globally unique DTOs, this allows the RPC routing to just work with Consul.

So let me try and explain why we've not only ignored RESTful routing, but will actively seek to prevent it being used directly in our Services.

There are a few reasons behind this but first it might help to clarify that we plan to use services internally at first, but later on expose them externally using a Gateway to be built on top of Consul.

Internally, with ServiceStack's ServiceClient and the DTOs, you already have fully end-to-end typed API calls so never really need to see a URI, let alone care what they are, this isn't so bad for them.

We expect that most of the internal calls will use this typed approach.

You can use custom routes, and the service-to-service calls will even use them. This is not really the problem area though.

Any non-ServiceStack client that wants to consume the services would have to go via Consul to find the right service, and Consul doesn't know a thing about your custom routes.

This affects the few internal apps or services that do not use the ServiceStack client and probably the MOST important group, the external clients.

Friends don't let friends break contracts

Hey Bob,

thank you for being a loyal customer, you mean the world to us.

Because we love you so much Bob, were superduper excited to announce our brand new [feature] and tell you how it will change your life.

You'll literally forget your own name, that's how amazing it is!

Here is our super-secret incrementing beta code, just for our most special customers, like you Bob.

Code: 37,027,491

Thanks again Bob, you're so amazing!!!

p.s. [Feature] requires you re-write all existing integration before launch at 3pm EST tomorrow :)


$#c*$%g WHAT?!

In accessing any external resource, the last thing you want as a consumer, is for that contract to change.

...ever.

It's painful, it involves additional work you can't plan for, work you don't have time to do.

In HTTP, these are contracts:

// Fragile, things which could change are both 'ordered' and 'embedded'
http://api.acme.com/account/123/orders/12352/shipped/2016/01

// Fragile, change requires running multiple endpoints and causes 'churn' for clients
http://api.v2.acme.com/anything

// predefined route *never* changes, DTO is the contract and *will not change* 
http://api.acme.com/sync/reply/accountorders  

In code, these are contracts

// Fragile, change to signature or return type, breaks clients (see WCF, WebAPI)
public string GetAccountOrders(int id, bool includeCompleted) { ... }

// message contract, any change to DTO, does not *have* to break clients
public AccountOrdersResponse Get(AccountOrders request) { ... }  

Contract stability is of paramount importance, but addendum's to contracts are OK.

So clumsily put, if we ensure our DTOs are backward-compatible, we have far more stability in our contracts. Contracts that can tolerate change. Contracts that instil confidence and the trust of consumers.

Another reason for avoiding custom routing in ServiceStack is the complexity of making it work correctly.

In what order do I add this service's routes to the routing table?

Will a fall-back or over-generous catchall route suddenly grab all other services requests?

Will the new dev/team remember to respect the guidelines?

As I mentioned previously, adding an external gateway is part of our future plans and we expect it to handle things like load-balancing, traffic shaping and SSL termination, all in one place, rather than in each service.

If in that future, we must have RESTful routing, it will be as a decoupled, globally managed concern in that gateway, carefully managing the mapping of routes to services. Even this though, by its nature, is static and prone to 'churn' in such a dynamic environment. (see schema changes in ORMs)

We are currently looking at a few options for Gateways so I'll simply mention one that stands out so far, Fabio

It looks to have great integration with Consul and avoids the need for more complex Consul-template solutions. Another one for the roadmap.

load-balancing

Finally, for this (not so micro)-post we come to load-balancing or
the ability to distribute requests between multiple instances of a service.

Definitely our weakest area of the three right now, we have some plans and ideas but they are still in their infancy.

Consul provides service-to service calls with a not-really load-balancing version of load-balancing.

It keeps track of round trip times (RTT) for its agents using network co-ordinates.

If you have multiple instances of a service available to process a DTO, Our plugin will sort these by the agent RTT, giving you the most responsive.

This isn't really load-balancing, more QoS, but it is useful nonetheless and worth mentioning.

Another thing Consul gives us is in how it maintains separate service catalogs per datacenter. Using this ability, we could locate datacenters and their services in different geographic regions to even out global traffic loads.

For true load-balancing though, we have to look for other solutions and they lie outside of each service.

A gateway is the most obvious candidate for this and Fabio allows you to split traffic between services based on rules, useful for things like canary deployments as well as more traditional load-balancing.

In the world of microservices however, we actually have all the ingredients we need to make something ourselves if we need to.

Having a service registry in Consul with RTT, Health and performance metrics information from logging for every service end-point opens up interesting possibilities for using that data. Combined with a good automated deployment pipeline, there are possibilities for elastic scaling. I'll explore this in more detail in the deployment topic.

Wrap it up, chuck!

There is a constant tension between how much 'smarts' you put into each service and how much is centrally managed. We are trying to find a good balance.

The service discovery and registry is a fundamental part of our overall design though. I think it allows us to decouple a lot of the other parts we will need on our journey to microservices.

Parts that can be independent, composable, infrastructure-centric microservices of their own because of this design.


So at last we come to the end of part III.

There was a lot to cover here and there are parts I feel I haven't explained as well as I could, and parts I have skimmed over or left out entirely.

Definitely a couple of things to divide opinions.

If I've missed anything, or you have your own great ideas or projects, let me know in the comments.

Also, we'd love others in the community to get involved with our plugins on Github so don't be shy.

:)

so without further ado...NEXT!

Let's do microservices!

next up: Service health, metrics and performance [coming soon]

00010 : Logging and Debugging

00010 : Logging and Debugging

Foreword:

This post on microservices started innocently as a single post. They always do, don't they?

Before I knew it, it was 10k words and showing no sign of stopping and I was advised to split it up into a series, "nah, I said, it'll be fine".

I was then advised again to split it at least into two, the first few thousand in one and the rest in the other. I relented this time.

Perhaps it is just the nature of software, or perhaps it is 'scope-creep' (we've all been there right!) but, I've only gone and written a monolithic microservices post!

I see the irony here, so I've decided to break it apart into 'micro-posts'; thanks @adamralph :)

Each 'micro-topic' will become a 'micro-post' of it's own in our journey; so my apologies to those who wanted the full meal, your dinner will now be served as Tapas!


ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Distributed debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

Fr♣m ng pro▓ llem th e

In a monolith world with all the great modern tooling available to you, it is easy to load a project up in your IDE, set some breakpoints, and hit 'Debug'.

You can step through your code line-by-line and inspect its state.

Or you can write integration tests to assess the state of a system and to verify that the components of the system are interacting correctly with each other.

The single-threaded process has rock-solid reliability and low-latency for inter-process communications.

If you call a method on a class, you aren't concerned if it will reach that method-call or if will it take an unacceptable amount of time to get there.

Even at the boundaries of a process, things are often binary. If the database that runs the application or a file resource is unavailable, the application will crash or display an error.

You can check application logs, the system eventlog, the coredump or WinDbg type listeners to help you find, reproduce and fix the problem.

It's all in one place.

Next comes the multi-threaded process, to which nearly all of the above also applies.

How many of you have experienced race-conditions in one of your multi-threaded applications?

I know I have.

Even with the great tooling available, they are often subtle and hard to pin down. Introduce state mutation into the equation and you now need to worry about concurrency. There are locks and semaphores to deal with.

Is that class I used even 'thread-safe'?

Time and the order of execution between threads is no longer reliable, and the side-effects of this increase the level of complexity, of both your code and of your ability to reason about run-time state.

But they still have reliable, fast inter-process communication.

Now let's introduce asynchronous processing. Single-threaded and multi-threaded applications can be synchronous or asynchronous.

Now, within even a single thread, the order of execution is no longer reliable.

There is an excellent overview on the differences, but the take-away is that introducing multi-threading or asynchronous programming to your application can significantly increase the difficulty of reasoning about your system state as well as finding, reproducing and fixing bugs.

Yet they still have reliable, fast inter-process communication.

Now enter the distributed system, you probably see where this is going, don't you?

You are now in the world of remote procedure calls (RPC) across processes or networks. Each remote call is an order of magnitude slower and therefore, so is your application performance.

In our case, with ServiceStack, the RPCs use a request/response message-passing style.

Just as before, RPCs can be done either synchronously or asynchronously, but due to the performance of remote calls, using asynchronous processing is not really a choice, it's practically a must-have..

Now we have to ask ourselves - will calls be inexplicably duplicated? Will they even be invoked at all?

We have just lost cabin pressure. Communication is no longer reliable.

Down the rabbit hole we go, this stuff is hard and it gets weird, really fast.

Debugging this kind of system, which is executed piecemeal across processes and networks, just cannot be done with an IDE and access to a machine or applications logs.

Trying to reconstruct it in a test environment is an exercise in futility.

In the distributed system, debugging must be done in production.

This realisation forces you to approach the design of the system differently from the very beginning; if, you want to avoid creating the distributed equivalent of the Titanic that is!

So first up, logs have to be centralised to be able to reason about your system state, find errors and be able to trace the flow of events across each node.

Logging

Now, let's get specific shall we?

For ServiceStack, I've created a plugin on Github that builds upon the Request Logger to log to Seq.

ServiceStack.Seq.RequestLogsFeature

ServiceStack is fortunate to have some great people in its community and the plugin was quickly improved by a fellow member, thank you Richard.

Seq is an installable self-hosted service, with an HTTP API that is designed for log aggregation.

There is a readme available on the project, which I won't duplicate here, that covers setting up the plugin and using it in more detail, but I'll cover the basic code required to use it in your ServiceStack AppHost.

public override void Configure(Container container)  
{
   // Define your seq location, add the plugin
  var settings = new SeqRequestLogsSettings("http://localhost:5341");
  Plugins.Add(new SeqRequestLogsFeature(settings));
} 

That's it!

See everything!

The plugin now captures every incoming request to ServiceStack in Seq and is capable of logging every detail, not just the path but the headers, the request and response DTOs, execution timing, service exceptions and errors.

In addition, the logging detail can be modified at runtime, so when you need to debug in production, you can ramp up the level of detail logged. I'll come back to this again a few times in later posts.

So the first thing to note is that you are not storing plain text, you are storing structured data and it makes all the difference in the world.

With Seq that data is now easy to search, filter and aggregate using Seq's powerful query language and UI.

Logging in action

Having used Seq for a while now in other applications, I know how quickly it can help you identify and fix issues in your production systems. It's very easy to use and it comes with a free single-user licence. I highly recommend you try it out for yourself.

Tradeoffs:

Unless your logging receiver has high-availability (which Seq at this time does not have), we have just created our first piece of critical infrastructure as a single point of failure (SPF).

When any SPF goes down, bad things can happen, so throughout the series, I'll point these out.

Seq does however have a forwarder. This works against a local loopback address to buffer requests for forwarding onto your log server.

This helps with network unreliability by eliminating remote calls and the performance penalties that are associated with them.

An alternative is to use a UDP broadcast style of logging, like statsd. It may serve your circumstances better.

Our logging uses an http async 'fire-and-forget' style, so the network performance cost is reduced, but, if your service, network or Seq fails and you do not persist ServiceStack logging locally to disk or use a forwarder, you lose potentially valuable log data.

We could improve this plugin in future (PR's welcome!) to include a more resilient, guaranteed delivery to survive network outages; but for now we are not too concerned about this.

Our rationale will become more apparent as our design is revealed.

Debugging

The second part, distributed debugging, is a much trickier problem.

In distributed systems where there are many moving parts involved, the ability to reconstruct a timeline of events and state-mutations across your services is essential to being able to effectively debug and reason about system's state and overall health.

Enter the correlation identifier, our next plugin on the road to microservices

ServiceStack.Request.Correlation

Again this captures every incoming ServiceStack request and adds an Id to the request header. The service gateway's internal and external calls will pass this identifier on so that you can identify service to service calls from their point of origin in your logs.

Correlation

It is very early days for this plugin though. We need to refine it to be able to reconstruct a full map of service-to-service calls at each node, but for now, that is on our future road-map.

Being able to map out the calls is important for a couple of reasons which are worth mentioning at this stage though.

re'Curse' of the infinite loop

As your services become distributed, it is easy to create recursive calls, recursive calls, recursive calls, recu... :(

Recursive Call

Having a good timeout policy on all remote calls can help with this, but being able to add self-referencing checks in the correlation plugin can help with this by cancelling such requests.

Am I already in the call-map of the thing calling me?

Yes, byeee

the > never > ending > chain > of > calls > . > . >

Making remote calls easy and transparent to your services is really powerful, but it is also just as easy to abuse that power.

Long call chain

With each network hop, you increase the likely-hood of timeouts and the responsiveness of your API and their consumers suffers.

Having set limits on the length of call-chains can force distributed teams to be judicious in their use of dependencies (see left-pad!) and help foster collaboration between them instead to scale horizontally and keep the stack thin.

Further reading

There is some great research material if you are interested in reading more on this topic.

Google has a whitepaper on Dapper, a large-scale distributed systems tracing infrastructure, and there are a few implementations out in the wild to be found.

I am also looking at Vector Clocks of which you can find a C# implementation by the brilliant @kellabyte, but the fixed length limitations of this algorithm have led me towards Interval tree clocks as another possibility.

Another task for the future, is to capture requests at a lower level, like a proxy, so that tracing can work beyond ServiceStack calls to include data stores, files and external infrastructure resources.

Finally, there are also very interesting possibilities emerging from Joyent for run-time inspection and tracing using dTrace and containers worth keeping an eye on too.


That concludes part II, if I've missed anything, or you have your own great ideas or projects, let me know in the comments.

Also, we'd love others in the community to get involved with our plugins in Github so don't be shy.

:)

OK, enough of that for now, NEXT!

Let's do microservices!

next up: Service discovery, load balancing and routing

00001 : the what and the why?

00001 : the what and the why?

I've been toying with calling this post:

ServiceStack, "talk DTO to me ;)"

...but will perhaps save that one for a t-shirt, it tickles the inner geek!

My team has recently released a number of plugins geared towards microservices for ServiceStack and have more on the way. So now is perhaps the right time to provide some context into what we are doing and why.

This is part one of a series. This first part covers the background and rationale behind the decision to explore microservices and in the subsequent parts, we cover the specifics of tools, technologies and code to make it happen.

The posts are quite long and information-dense, but I fear if I unpacked all that information, it would end up as a book, which I don't have the time to write - too busy coding!

If you are interested in ServiceStack, or even just the practicalities of implementing microservices, I hope this series is of some interest or help to you, dear reader.


ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

Uh oh!

Now first the elephant in the room: we have used the term "microservices" which, I fear has already become a dirty word in software development from overuse.

If I hear micro services one more time - Kelly Sommers

Suffice to say the gif wasn't of kittens smooching!
I'm sorry, Kelly, this post is full of them.

History is often not kind to over-hyped terms like microservices and it may end up being obscured by failed projects, poor implementations or the same vendor marketing techno-babble that SOA suffered before it is proclaimed to be dead, and Self-Contained Systems or Serverless Architectures are the new hotness!

I've read and heard many times over the past few years variations of "microservices is just describing what I always thought SOA was" and I think they're right, it is easy for these terms to become so overused that fatigue sets in and there is a backlash.

Nothing about microservices or SOA is new to me, they are just concepts that sound like the answer to every problem.

Instead of writing one large complex thing, write smaller simpler things, that do one thing well and compose them together

Both OO Composition and Functional Composition describe this approach. In fact, this idea is found countless times in other places like the UNIX philosophy and the actor model both from 1973 and have their roots even further back than that. I'll leave that exercise for the historians among you.

The take-away here is that these concepts are not new at all, so regardless of the term used, they are trying to convey concepts of composability.

Yet, I think that microservices really isn't the solution for most problems. If you have a problem and use microservices to solve it, you now have 21 problems and counting!

This is why the advice you hear often is to build a monolith first.

Then, once it matures, and you understand the problem domain and its bounded contexts, should you consider making it a distributed system by breaking it apart into smaller 'micro' services.

So what's the plan, Stan?

So broadly this post is about 'things you need to think about when building microservices', but more specifically it is about building them with ServiceStack.

I've been using ServiceStack for years and it's been running a bunch of internal business-critical API's for us, I'm a huge fan of the DTO-first approach that is at the heart of ServiceStack's design philosophy.

So for those who aren't familiar with ServiceStack, here is the over-simplistic introduction.

The premise is simple to grasp but deceptively powerful.

Your service methods take in a single parameter, the Request Data Transfer Object (DTO); which is just an implementation-free Plain Old CLR Object (POCO) class.

These DTOs are what is used to define your Services Typed API Contract.

So DTOs are the means by which to send data to your services which optionally receive DTO data in return.

Once you accept that simple but fundamental premise and start to build API's, you will discover that it is amazingly flexible.

This is messaging.

A simple DTO can be routed to the right method depending on the HTTP Verb it is sent with.

You can send and receive XML DTOs in one app and JSON in another. Need to switch to CSV or MessagePack? No problem, it can do that too. How about asynchronously? No problem, supported out the box.

Want to send a batch of DTOs at once? Again it will just work.

This is just the tip of the iceberg really, the DTO-first approach abstracts away details like formats, transports, persistence, caching and much, much more.

I've battle-tested it, I understand where it's coming from and it has proved to be an excellent replacement for everything web-related that was previously prefixed "MS" and I couldn't be more happy and comfortable with that decision.

ServiceStack however, is really only the service part. It doesn't include many of the pieces required to glue distributed systems together. That isn't a failing in any way on its part, its simply not where it's coming from.

So, why are we trying to do mean things like implement microservices with it?

You see, creating services in ServiceStack is easy. Once you get used to the DTO-first coarse-grained style, you find that your services are easy to extend, update, bend, reconfigure and generally cater for your changing needs, but your service logic doesn't have to be rewritten to support this.

Perhaps you need to experience the pain of refactoring and rolling out brittle or chatty RPC contract-based services that break consuming clients (WCF, SOAP and WebAPI, I'm looking at you!) before this approach just clicks and you finally "get it". Once you do, you're unlikely to want to go back.

Another thing I have found over the years is that API's tend to outlive applications by an order of magnitude. The well-designed API is still serving data long after the apps that it was built for are long gone.

But I digress. We are exploring microservices for two primary reasons, code complexity and scale.

Complexity

We have a mature set of monolith apps and services that operate well, but, they have shared data sources and some shared dependencies in the form of Nuget packaged libraries.

It is difficult to maintain good separation of concerns with the shared libraries and shared data sources. Over time boundaries begin to become blurred.

These are very real pain points in our development.

We have also reached a point where system complexity makes it challenging to reason about the current state within the system, and lastly when releasing updates, we have issues with lock-step.

Moving toward a coarse-grained set of loosely coupled API services with isolated datastores and some event-sourcing is where I think we can begin to rein in these problems.

We have the following posted on our internal wiki design-notes to provide us with a constant reminder of the challenge we are facing in choosing this path.

Monolith to Microservices

Scale

Now scale can mean different things. In this context I am not referring to traffic scaling but of scaling development. To enable distributed development teams to collaborate on a message-driven system, where communication and development is achieved through DTO-driven contracts.

This is where ServiceStack comes in. I personally see ServiceStack as a transport-agnostic and format-agnostic messaging platform than a web services framework.

Oh dear, that sounds like a lot of marketing techno-babble!

I'm not saying it isn't great for serving websites and restful apis, but that isn't the focus of this post.

To try and explain then, ServiceStack might not support a nanomsg, zeroMQ or Redis-like wire protocol at the moment, but its architecture means it could and your core business-services wouldn't have to change a single line of code. Not one line!

One example of how you can layer new functionality on top of ServiceStack's architecture is our prototype event-sourcing and CQRS pattern delivered as a single nuget plugin. It has very little API surface-area of it's own, using instead what is already available in ServiceStack.

If you haven't already checked it out, it is well worth a look. It is, I think, a fantastic implementation by David Brower that distills a lot of the complexity of event-sourcing into a really natural DTO-first experience and makes it very easy to use in the right way. I'll discuss how this fits in later.

So the same exact same code to consume and produce http calls, can also subscribe to messages from a MQ and subscribe to events from eventstore streams. That for me is practically black magic and it's really compelling to build systems on.

With the awesome Demis Bellot's help, we have also been able to get a really nice service gateway pattern in ServiceStack from v4.0.56 onwards, and although we didn't tell him at the time, (shhhhh) but it is what we were hoping for as it's the gateway to everything else (pardon the pun), so we were delighted when it was added to ServiceStack.

It opens up many possibilities for where we are headed.

So where the heck are you heading?

Breaking apart systems (Distributing Systems), which is really what microservices is trying to do, introduces a lot of challenges, which, instead of eliminating complexity in your code, shifts it into infrastructure and operational complexity.

It should be a conscious choice as to whether you think this trade-off is worth making as you will encounter things like the fallacies of distributed computing and these are often non-trivial problems. You need solid foundations on which to build, and everything is harder when you have networks sitting between your services.

Magically solving these problems for you is where the vendors will pounce, offering single-source solutions or spout nonsense terminology that makes me chuckle, but, there are people out there who hear words like hyperconverged and get so excited they just start vomiting money. Get your buzzword bingo card's out cos we all gonna be winners, the author of that piece really seems to like using it!

But it raises an important point, why not use an existing Enterprise Service Bus (ESB) type solution?

If that is the right fit for your circumstances, then you absolutely should, but for me personally, perhaps I've been burned too many times by vendors becoming disinterested or rewriting the integration points every few years chasing new business, that the prospect of coupling my systems to them is unappealing.

It isn't a co-incidence that I have bug-free, dependable services running in production that I wrote ten years ago. They have no coupling to vendor specific solutions, yet all the integration middleware I wrote for 2007 versions were trashed in 2010 and again for 2013. I'll leave it for you to guess who that was for.

The system you write is the system you understand, debug, maintain and control.

Ultimately, I want systems where the infrastructure is composed, just like my services are.

When something better in any given area of infrastructure comes along, I want to be able to plug it in, without rewriting that whole system or trashing my business logic in 'the big rewrite'.

These for me, are core business systems that have lifespans similar to programming languages lasting decades if designed and maintained well, long after the vendor solutions are gone or you are forced through multiple painful upgrades because OS upgrades don't support the older versions.

No really, where ARE you heading?

Ok, so to recap , microservices are bad, really hard, not the solution; OK, you just don't do them...

..

We're doing microservices :)

... or at least a form of them that makes sense to us and doesn't stray far from the DTO-first ServiceStack philosophy.

If we want to fall into the pit of success though, there are a lot of things we need to get right. The glue that binds it all together into a coherent whole is hard to get right and involves a long list of non-trivial things you need to consider:

  • Distributed debugging and logging
  • Service discovery, load balancing and routing
  • Service health, metrics and performance
  • Configuration
  • Documentation
  • Versioning
  • Security and access control
  • Idempotency
  • Fault-tolerance, Cascading failures
  • Eventual consistency
  • Caching
  • Rate-limiting
  • Deployment, provisioning and scaling
  • Backups and Disaster Recovery
  • Security and access control (twice cos its really important)

There are probably a few missing from that list too, but the number of things required to make it all click is why there is no step-by-step guide to properly implementing microservices anywhere.

Every article, talk and presentation that discusses microservices touches on this complexity, but the variety of options means that it often ends up lot like 'drawing the owl'
Implementing microservices

Remember, don't do microservices.

Still with me?

In each part, we will cover a topic on the list in more depth, as well as discussing how we arrived at the design decisions we made, the tradeoffs we are making and our progress so far.

So, let's do microservices!

next up: Logging and Debugging