Monday 7 May 2012

Stuff that works - Performance Driven Development


If you have read any of my other posts I will admit they have been a "tad negative", but this is with good reason. I feel the software industry needs a good kick up the backside if it's ever to become a real profession.

To this aim, here is a positive post, its a description of a development philosophy that I feel works for nearly all enterprise level applications. There are no frameworks, no toolkits, no fancy technologies, no documents, its simply an idea called "Performance Driven Development", implement it as you wish.

So what is Performance Driven Development all about?
Well its the missing link between what you think is going on and what is actually going on.
Things like BDD, TDD and continuous integration have their place but they dont address how the system really is performing during development and production.

You can have 100% code coverage and all the continuous integration you like but that is no guarantee that the system works as expected. In fact, I feel mystical green builds give developers an unwarranted feeling of confidence in the system they are developing.

"The build's green... it must work!" - Failed project #83571

This is where Performance Driven Development comes in, its goal is to help developers show everyone the heart and soul of what they are actually creating. It lets developers show off the fancy internal systems they have built, and as a handy side effect it also forces them to have a good hard look in the mirror at their code and themselves. When other people can easily see and judge how your creation is really working it forces the developer to produce better quality. Pride is a big motivator.

Key points:
- PDD is a common sense iterative development approach
- Developers have to create Key performance Indicator dashboards. (KPIs)
- These KPIs are created and updated continually throughout development. Not as an afterthought.
- These KPI dashboards are always available in realtime
- Its about delivery of real requirements with better quality and certainty.
- The same Performance information is available in both development and production systems.
- It helps both testers and developers build and support the system
- Its NOT about micro optimisation
- Its NOT about performance counters or WMI objects.


A real world example of using this technique is when building backend services.
These services could be hosted in IIS or be windows services serving requests and performing batch processing tasks. Pretty much all enterprise systems use them in some form or other.

The problem we have is these services tend to be black boxes with all internals hidden from view.
So how do we look into the soul of a service?

How do we ask questions like:
- How is the service running at the moment?
- What's the last 10 things it did? How long did they take?
- Are its caches working as expected?
- Is it running slow? What's the slowest operation?
- Are the inputs as expected?
- Whats are the details of the items in the cache? Are they as expected?
- Is cache size as expected? What was the cache size over the last 2hrs?
- Is database access speed as expected?
- Who are the top ten users?
- What are the top 10 widgets accessed today?

In the past many developers would add perfmons and WMI queries as an attempt to show information about whats going on inside a service. However in my experience these values are generally useless during development and not much better in production. They are used more for production alerts than general running of the service and these values are hidden away from everyone except system admin types.
In short WMI and perfmons are crap for development and limited in production.

What we need to do is create a window into the soul of this black box. A window that is accessible by developers, testers, support and system admins straight from their own machines (even in production). A simple way to achieve this is to create a performance KPI dashboard as a web page inside your service. This is simple with an IIS service, however for a windows service you need to create your own HTTP handler inside your existing service. This takes about a 100 lines of code or use a simple webserver something like nancyfx (nancyfx.org) or kayak (kayakhttp.com).

So now we have all the power of html to display performance data, using tables, links, css etc,  you can have drill down functionality and even show charts of KPIs using things such as google charts.(http://developers.google.com/chart/). There are lots of possibilities.

The goal is to allow access to these dashboards in realtime either in development or production via a simple http addresses. All the data is readonly and easily viewed with any browser.

i.e
http://23.23.23.23/kpi/widgetcache
http://23.23.23.23/kpi/processedcache.



Ok, so what's the big deal? A simple web dashboard is not exactly ground breaking.

This is where the Performance Driven Development KPI idea comes in. As you build your service you need to think about what would be useful to see on the dashboard to help developers and testers during development as well as support and sysadmins in production. You then build into the system ways to measure these values and add them to the dashboard as you go. The key point is you don't add them at the end of development as an after thought, instead you continually improve the KPI dashboards as you build the application.

The goal is too heavily use the dashboard as a debugging tool during development. You can have multiple dashboards broken up as you see fit. i.e Each cache could have its own dashboard.

For example say you have a cache that stores widgets that are loaded from a database. When the widget is loaded from the database and stored in the cache you simply wrap that widget in an object that not only stores the widget but in continually updates KPIs values such as:
- Time it took to load the widget from the database (milliseconds)
- Create time
- last accessed time
- total access count
- etc

The WidgetCache itself could have KPIs such as creation time, total widgets, last cache drop, total cache items added, deleted, etc. The cache manager also exposes readonly properties to things such as all cached widgets, top 10 accessed widgets etc. You create these as you see fit over time, continually improving them as you go along.

Then you simply give the widget dashboard access to these KPIs and then display the basic cache values as you see fit. You could have a grid displaying the top 10 accessed widgets showing columns with name, create time, access count , load time etc, a chart showing cache size over time. It's up to you to figure out what is useful.  To view the KPI dashboard you simply refreshes the web page to see latest values.

There will be a slight overhead calculating these values but once you start using this technique it lets you catch bugs well before you get to production. Things like low cache hits, empty caches and long cache load times for certain widgets will stick out like a sore thumb. This can point you in directions that help you find issues you never would have seen without the dashboard. For more detail you can add drill down links on a widget that display a new dashboard with full details of a particular widget in the cache.
The possibilites are large but keep it as simple as you can. These dashboards are only for internal use, they should not be accessed by real users.


PDD Dashboard example created for a financial modelling service  

The above example dashboard shows the soul of a very complex modelling cache running on a large enterprise system. You can drill down into each instrument to show a detailed google chart showing the modelled values in realtime. The great thing is both testers and developers are using this screen during development and production.



Creating these simple PDD dashboards doesn't really add to development time, in fact it reduces overall development time because it quickly exposes errors during development, reducing code rewrites and hard to find production bugs. It also simplifies handing over code to other developers, as it gives you a visual display of the inner workings of complex code and structures.

Simply put Performance Drive Development produces better quality code in a shorter timeframe by empowering developers to show off their skills for all to see.

Give it a go!


GCS Wiz




4 comments:

  1. I remember a discussion with the high priests when this "very complex modelling cache" was in the code base, revolving around the question of "let's think about it: do we actually need this?, do we actually need real-time modelling?". Do your PDD KPIs help with situations like that? ;-)

    ReplyDelete
  2. Hi Khash, its definitely better to stop it at the source if possible :)
    However PDD it what made it work in the end. Once we had the KPI dashboard we could easily see the bottlenecks and show the non tech people what was going on. Then it was much easier to reason with them, and discuss what was possible and what wasn't via the PDD dashboard.

    Remember the dashboard development was iterative, so it quickly showed the issues that existed as we built the system. It managed to force a change in thinking by the high priests. Thats a big win for everyone.

    ReplyDelete
  3. From an Engineer on the ground perspective. The issue tends to be the K in KPI. You just want performance Indicators. You can't necessarily tell what is Key until you see all the things you think may be relevant. This is a great post it describes what I've been shooting at for a long time!

    ReplyDelete
  4. Hi @Gesh, thanks for your comment, the K in KPI is where the skill of the developer and their team comes in. The developer uses their own experience and other peoples input and feedback from seeing the dashboards to create better and better performance indicators. This iterative approach leads you towards the holy grail of Key Performance Indicators. Start with basic things like I described and add more and more things as the system develops. it makes everyone involved think differently about the task at hand. Not just code but actual real performance of the system.

    Possible outcomes of this approach:
    Developer #57 - "hmm I cant get the KPI I want, why? hmm .... the system doesn't interact with system Z as we expected. We need a rethink of our approach here!"

    The great thing is this happened before it even got to testing. The developers flagged it themselves. Massive Win!

    ReplyDelete