debugging

00010 : Logging and Debugging

00010 : Logging and Debugging

Foreword:

This post on microservices started innocently as a single post. They always do, don't they?

Before I knew it, it was 10k words and showing no sign of stopping and I was advised to split it up into a series, "nah, I said, it'll be fine".

I was then advised again to split it at least into two, the first few thousand in one and the rest in the other. I relented this time.

Perhaps it is just the nature of software, or perhaps it is 'scope-creep' (we've all been there right!) but, I've only gone and written a monolithic microservices post!

I see the irony here, so I've decided to break it apart into 'micro-posts'; thanks @adamralph :)

Each 'micro-topic' will become a 'micro-post' of it's own in our journey; so my apologies to those who wanted the full meal, your dinner will now be served as Tapas!


ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Distributed debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

Fr♣m ng pro▓ llem th e

In a monolith world with all the great modern tooling available to you, it is easy to load a project up in your IDE, set some breakpoints, and hit 'Debug'.

You can step through your code line-by-line and inspect its state.

Or you can write integration tests to assess the state of a system and to verify that the components of the system are interacting correctly with each other.

The single-threaded process has rock-solid reliability and low-latency for inter-process communications.

If you call a method on a class, you aren't concerned if it will reach that method-call or if will it take an unacceptable amount of time to get there.

Even at the boundaries of a process, things are often binary. If the database that runs the application or a file resource is unavailable, the application will crash or display an error.

You can check application logs, the system eventlog, the coredump or WinDbg type listeners to help you find, reproduce and fix the problem.

It's all in one place.

Next comes the multi-threaded process, to which nearly all of the above also applies.

How many of you have experienced race-conditions in one of your multi-threaded applications?

I know I have.

Even with the great tooling available, they are often subtle and hard to pin down. Introduce state mutation into the equation and you now need to worry about concurrency. There are locks and semaphores to deal with.

Is that class I used even 'thread-safe'?

Time and the order of execution between threads is no longer reliable, and the side-effects of this increase the level of complexity, of both your code and of your ability to reason about run-time state.

But they still have reliable, fast inter-process communication.

Now let's introduce asynchronous processing. Single-threaded and multi-threaded applications can be synchronous or asynchronous.

Now, within even a single thread, the order of execution is no longer reliable.

There is an excellent overview on the differences, but the take-away is that introducing multi-threading or asynchronous programming to your application can significantly increase the difficulty of reasoning about your system state as well as finding, reproducing and fixing bugs.

Yet they still have reliable, fast inter-process communication.

Now enter the distributed system, you probably see where this is going, don't you?

You are now in the world of remote procedure calls (RPC) across processes or networks. Each remote call is an order of magnitude slower and therefore, so is your application performance.

In our case, with ServiceStack, the RPCs use a request/response message-passing style.

Just as before, RPCs can be done either synchronously or asynchronously, but due to the performance of remote calls, using asynchronous processing is not really a choice, it's practically a must-have..

Now we have to ask ourselves - will calls be inexplicably duplicated? Will they even be invoked at all?

We have just lost cabin pressure. Communication is no longer reliable.

Down the rabbit hole we go, this stuff is hard and it gets weird, really fast.

Debugging this kind of system, which is executed piecemeal across processes and networks, just cannot be done with an IDE and access to a machine or applications logs.

Trying to reconstruct it in a test environment is an exercise in futility.

In the distributed system, debugging must be done in production.

This realisation forces you to approach the design of the system differently from the very beginning; if, you want to avoid creating the distributed equivalent of the Titanic that is!

So first up, logs have to be centralised to be able to reason about your system state, find errors and be able to trace the flow of events across each node.

Logging

Now, let's get specific shall we?

For ServiceStack, I've created a plugin on Github that builds upon the Request Logger to log to Seq.

ServiceStack.Seq.RequestLogsFeature

ServiceStack is fortunate to have some great people in its community and the plugin was quickly improved by a fellow member, thank you Richard.

Seq is an installable self-hosted service, with an HTTP API that is designed for log aggregation.

There is a readme available on the project, which I won't duplicate here, that covers setting up the plugin and using it in more detail, but I'll cover the basic code required to use it in your ServiceStack AppHost.

public override void Configure(Container container)  
{
   // Define your seq location, add the plugin
  var settings = new SeqRequestLogsSettings("http://localhost:5341");
  Plugins.Add(new SeqRequestLogsFeature(settings));
} 

That's it!

See everything!

The plugin now captures every incoming request to ServiceStack in Seq and is capable of logging every detail, not just the path but the headers, the request and response DTOs, execution timing, service exceptions and errors.

In addition, the logging detail can be modified at runtime, so when you need to debug in production, you can ramp up the level of detail logged. I'll come back to this again a few times in later posts.

So the first thing to note is that you are not storing plain text, you are storing structured data and it makes all the difference in the world.

With Seq that data is now easy to search, filter and aggregate using Seq's powerful query language and UI.

Logging in action

Having used Seq for a while now in other applications, I know how quickly it can help you identify and fix issues in your production systems. It's very easy to use and it comes with a free single-user licence. I highly recommend you try it out for yourself.

Tradeoffs:

Unless your logging receiver has high-availability (which Seq at this time does not have), we have just created our first piece of critical infrastructure as a single point of failure (SPF).

When any SPF goes down, bad things can happen, so throughout the series, I'll point these out.

Seq does however have a forwarder. This works against a local loopback address to buffer requests for forwarding onto your log server.

This helps with network unreliability by eliminating remote calls and the performance penalties that are associated with them.

An alternative is to use a UDP broadcast style of logging, like statsd. It may serve your circumstances better.

Our logging uses an http async 'fire-and-forget' style, so the network performance cost is reduced, but, if your service, network or Seq fails and you do not persist ServiceStack logging locally to disk or use a forwarder, you lose potentially valuable log data.

We could improve this plugin in future (PR's welcome!) to include a more resilient, guaranteed delivery to survive network outages; but for now we are not too concerned about this.

Our rationale will become more apparent as our design is revealed.

Debugging

The second part, distributed debugging, is a much trickier problem.

In distributed systems where there are many moving parts involved, the ability to reconstruct a timeline of events and state-mutations across your services is essential to being able to effectively debug and reason about system's state and overall health.

Enter the correlation identifier, our next plugin on the road to microservices

ServiceStack.Request.Correlation

Again this captures every incoming ServiceStack request and adds an Id to the request header. The service gateway's internal and external calls will pass this identifier on so that you can identify service to service calls from their point of origin in your logs.

Correlation

It is very early days for this plugin though. We need to refine it to be able to reconstruct a full map of service-to-service calls at each node, but for now, that is on our future road-map.

Being able to map out the calls is important for a couple of reasons which are worth mentioning at this stage though.

re'Curse' of the infinite loop

As your services become distributed, it is easy to create recursive calls, recursive calls, recursive calls, recu... :(

Recursive Call

Having a good timeout policy on all remote calls can help with this, but being able to add self-referencing checks in the correlation plugin can help with this by cancelling such requests.

Am I already in the call-map of the thing calling me?

Yes, byeee

the > never > ending > chain > of > calls > . > . >

Making remote calls easy and transparent to your services is really powerful, but it is also just as easy to abuse that power.

Long call chain

With each network hop, you increase the likely-hood of timeouts and the responsiveness of your API and their consumers suffers.

Having set limits on the length of call-chains can force distributed teams to be judicious in their use of dependencies (see left-pad!) and help foster collaboration between them instead to scale horizontally and keep the stack thin.

Further reading

There is some great research material if you are interested in reading more on this topic.

Google has a whitepaper on Dapper, a large-scale distributed systems tracing infrastructure, and there are a few implementations out in the wild to be found.

I am also looking at Vector Clocks of which you can find a C# implementation by the brilliant @kellabyte, but the fixed length limitations of this algorithm have led me towards Interval tree clocks as another possibility.

Another task for the future, is to capture requests at a lower level, like a proxy, so that tracing can work beyond ServiceStack calls to include data stores, files and external infrastructure resources.

Finally, there are also very interesting possibilities emerging from Joyent for run-time inspection and tracing using dTrace and containers worth keeping an eye on too.


That concludes part II, if I've missed anything, or you have your own great ideas or projects, let me know in the comments.

Also, we'd love others in the community to get involved with our plugins in Github so don't be shy.

:)

OK, enough of that for now, NEXT!

Let's do microservices!

next up: Service discovery, load balancing and routing

How to load debug symbols in Visual Studio during debugging

If you've been working in Visual Studio for any length of time then you'll probably be familiar with the following message.

No Symbols Loads

You've set a breakpoint in the code from a 3rd party dependency for which the debug symbol file .pdb has not been loaded.

How can you tell what symbols have been loaded? Easy, it's in the Modules windows in the debug menu.

Modules view

Once you launch the window, you can then view all the loaded modules and whether their symbols are loaded.

To load the symbols for your module, just right click on the module and select Load Symbols as below.

Load symbol

ode to the next guy

The villain

The villain

So today I'm working on a project that I have checked out but not run on my desktop before. I am primarily interesting in finding a trouble spot that is spewing out a lot of exceptions and making it hard to separate the actual problems from the noise.

I open the project and armed with a rather vague exception stacktrace, I go a-wandering through the codebase in search of my foe. After a couple of minutes of squinting I am not nearer the jackpot so I decided to run and debug the tests.

When is green, not green?

Our CI server builds and runs these tests all the time, currently the builds are green so I know it won't reveal much. Same goes for the integration tests which we normally disable locally but run on a CI nightly build.

So I run the unit tests and of course they all pass.

hmm lots of tests failed and nothing to do with my foe. They failed as I do not have magic setup on my machine the project expects in order to have it's jollies execute.

more hat-swop than hot-swap :(

We sometimes keep scripts in the repo root to do various setup bits but I forget to look there. The partridge 'AHA' moment comes eventually once I see them. So like a good citizen, the first change I make is to add a git repo readme and reference the script so the next guy doesn't just have to know.

Now I run the script and all is well

ConsoleFlash ...wait... it did something but I can't read at console buffer-flush speed and now it's closed. I think I saw an error message.

So I open, read and mentally parse the batch script. It launches a powershell script to configure some stuff. Nothing terribly taxing but it doesn't help that it errored and I don't know why.

My next task as a good citizen is to add a pause at the end of the script. You know the one, the 'press any key to exit'. Now the next guy doesn't have to have an eidetic memory.

I run the script again and now there are no errors, I guess that the error the first time is expected and proceed.

So! I'm done with the setup and I run the tests again.

Of course this time they all pass

damn, what now!

ok, I dig into the failing tests and rediscover that the project needs more magic installed which isn't covered by the setup script. I know this already as I'm familiar with the project. I know lots of things, and I forget them too.

To deploy or not to deploy...

I'm wondering how our deployment server pushes this thing out to servers, I don't recall it needing all this ceremony around setup so I go off to check it out.

I'm opening, reading and parsing powershell steps in Octopus deploy.
Of course! there it is in the project deployment steps, the additional magic.

why doesn't the code just set up all this magic for me?

High maintenance honey badger...

Honey Badger

Back to the good citizen routine, I update the readme with my magic gems and open and read the config files to get the values I need. You know; for the next guy.

I install and configure more magic.

Back to update the readme as I manually create the magic that it requires.

finally I run the tests again and they all pass

oh for [email protected]%* sake, what now!
I'm not running Visual Studio as an administrator.

back to update the readme

I run the tests and they pass

ahhh!

except; I need the integration tests and they don't run locally, those are the tests I really need to run, to recreate the application conditions which are generating the exceptions. The exact change I need to make to enable them is in the test output, thank you previous guy!.

I change the config option to enable local running of the integration tests.

of course now they all pass

whhhhhhhhhhhhhhhhhhhyyyyyyyyyyyyy!
I read the test output and the code and find the integration tests require a whole bunch more magic installed and configured. I am now beyond the realm of our deployment. This stuff is only ever configured not created during deployment.

back to the update the readme...

more setup

Finally after an hour of setting up magic, I run the integration tests and they all pass. I can now get to the actual code to find and fix the issue I'm interested in.

epilogging

Epilog I start out with a simple task to find and fix in the code and end up several hours later creating a readme with a getting started guide and improve the setup, all before I actually get to do the thing I want. Hopefully the next guy will not have to deal with this.

If any of this is familiar to you, perhaps you should find a new machine, spin up a dev vm or even ask the dev sitting next to you to set up and run one of your projects from scratch, especially if it isn't your current project. Getting up and running on a project from scratch should be easy, or have instructions for anything that isn't because; you know....

for the next guy!