Why would I use this rather than tools like jMeter or jUnitPerf?
There are a variety of different performance testing tools available in the Java world, some are more useful than others for different tasks, but we have run into a number of limitations. We have found that both load testing and micro-benchmarking tools are not designed with generalized performance benchmarking in mind. To be fair, they also don't intend to be generalized benchmarking tools, which is good. A benchmark is not necessarily a load test, a micro-benchmark test, or any kind of test. Load testing tools like jMeter take the client-server black-box approach; driven by the model of load testing the server side of web apps. Micro-benchmarking tools like jUnitPerf are test method approach; driven by the model of the annotation based mechanism that the unit test framework uses. This forces the performance test to include setup and tear-down times since it does not allow special setup hooks for the benchmark that are not included in the timings of the benchmark.
JUnitPerf has the following known limitations:
The elapsed time measured by a TimedTest decorating a single testXXX() method of a TestCase includes the total time of the setUp(), testXXX(), and tearDown() methods, as this is the granularity offered by decorating any Test instance. The expected elapsed time measurements should be adjusted accordingly to account for the set-up and tear-down costs of the decorated test. You can check out some examples of jUnitPerf style tests here: http://clarkware.com/software/JUnitPerf.html#howtouse
What is cool about speedy mcbenchmark?
run a benchmark and report its performance declare a constraint that an operation must complete within a certain amount of time run a benchmark a given number of times in order to smooth out noise in individual runs compare the performance of n different concrete implementations to see which is fastest declare a constraint that a particular implementation is the fastest among a group of n concrete implementations observe how an algorithm scales relative to its theoretical "Big O" complexity and see where the scaling function breaks down more cool stuff in the backlog at the end of this doc - go check it out! How do I impress my friends with the sweet nectar of benchmarking goodness?
Let's look at different benchmarking scenarios from the code base that this benchmarking project was extracted out of. These examples come from our development of a very fast K-Way Merge algorithm to serve at the core of our discrete event simulator for trading systems. The main problem in the simulator portion of a discrete event simulation environment is to take K collections of elements that are arranged temporally (in our case, this is historical time series data from stock, commodity, and currency markets) and perform a merge so that the simulator can emit the events from all the time series in the correct order. http://www.site.uottawa.ca/~nat/Courses/DFS-Course/DFS-Lecture-9/tsld013.htm
You want to run a benchmark and report the timings.
public class PriorityQueueImplementationComparison
private PriorityQueue queue;
private const int YEARS = 30;
private const int TRADING_DAYS_PER_YEAR = 265;
private const int INSTRUMENTS = 10000;
private const int TIMES = 1;
public void EnumerateDataStructure()
var events = new Collection();
var count = 0;
public void KWayMerge()
queue = new KWayMerge(FakeInstruments.produceNSeries(
INSTRUMENTS, YEARS * TRADING_DAYS_PER_YEAR));
you can call ReportAggreatedTimings or ReportIndividualTimings - aggregated timings aggregates timings for all runs of a benchamrk, whereas individual timings reports the timings for each individual run. Notice that EnumerateDataStructure is the name of the method that I have created a benchmark scenario in. Reports and test failures include the name of the benchmark, which, if you do not specify it in the constructor of benchmark that takes a string name, will default to the last method up the call stack - the name of your test. Create a well named test to compose your benchmarks in. If you prefer, you can just construct the benchmark with lambda and avoid creating a named method for your benchmark. the beforeEachRun() method is a hook that allows you to run a code block before each run of the benchmark. This is where you should prepare data sets and do any other work that is required to for each benchmark run, but that you do not want to include in the timings for each benchmark run. The times() extention method on the int type is a higher order loop generator function. If you are curious about it, you can find more in the section on looping constructs at the end of this doc. You have a benchmark that must complete within a specified time limit.
Here we are introducing the Should method, which takes a constrain, and applies the constraint to the Benchmark results after the run finishes. Note the argument to the complete.within() constraint. We have added extension methods on the int type for milliseconds, seconds, minutes, etc. You have several implementations and you want to benchmark to see which is fastest:
You can assert that one benchmark is faster than another like this:
new Benchmark(FasterBenchmark).whenRun(17.times()).isFasterThan(new Benchmark(SlowerBenchmark).whenRun(17.times()));You can assert that a benchmark is the fastest among a list of many benchmarks like this:
var slowerBenchmark = new Benchmark(SlowerBenchmark);
var slowestBenchmark = new Benchmark(SlowestBenchmark);
var fasterBenchmark = new Benchmark(FasterBenchmark);
note that isFasterThan (which takes another benchamrk as an argumkent) and isFastestAmong (which takes a list of benchmarks as an argument) are methods called directly on the Benchmark itself, which is different from the way that we pass a constraint into the Should method. This is a bit inconsistent, and we are not sure yet whether this is the best way to do things, but it leads to a level of readability that we are happy with for now. You want to see how the performance of an implementation scales with increasing n.
You might know the theoretical Big O complexity of your implementation and you want to see how different it is from the real life scaling properties of the implementation. You might also want to discover where performance starts to break down as n gets very large and issues such as context switching, page swapping, and being pushed up the memory hierarchy start to invalidate the performance scaling function. Ideally, you would fit a function to the sample of benchmark runs at intervals of increasing n, but for now we only provide a mechanism to run a benchmark a number of times with increasing n and report results.
public void FindOofNforJustSort()
all_the_series = FakeInstruments.produceNSeries(numberOfInstruments, YEARS * TRADING_DAYS_PER_YEAR);
If you are curious about the from.to.by construct see the section on loop generator constructs at the end of the doc. [Test]
public void FindOofNforIteratorHeap()
all_the_series = FakeInstruments.produceNSeries(numberOfInstruments, YEARS * TRADING_DAYS_PER_YEAR);
}How do I avoid making some of the same mistakes that you gentlemen have made?
use the beforeEachRun() hook to reset data between each run and make sure that you are not including time to set data up for your benchmark in your benchmark timings. put asserts in the benchmark scenario itself to make sure that the benchmark runs through all the data you expect it to. Especially when you specify multiple runs of a benchmark, you need to ensure that your benchmark runs as you expect. In one mistake, we noticed that a benchmark was getting faster and faster at an incredible rate and we thought that it must be an issue where the JIT was distorting my benchmark timings. Closer inspection revealed that we had completely forgotten to create fresh data for each run of the benchmark and some benchmark runs had been timing runs for empty data sets. As a side effect of putting asserts in your benchmarks, it might also help to decrease the likelihood of invalid benchmarks by stoping the JIT from optimizing out code because it is forced to execute operations in order to produce the result for the assert at the end of the benchmark. make sure that what you are benchmarking fully encompasses all the work that you want to benchmark. In one mistake, we had one implementation among our competing implementations of k-way merge algorithms that was doing a lot of work in its constructor, which was called in the beforeEachRun() hook and not included in the timing. It is bad style to be doing lots of work in constructors anyway so we changed things to be a more accurate benchmark. monitor any running processes on the machine you are using to run your benchmarks in order to ensure you are not experiencing interference. if you look in the core whenRun method on Benchmark, you can see that there is a call to GC.Collect before each run of the benchmark - this makes sure that inconsistent GC's don't add noise to your benchmark numbers. This can be a confusing source of inconsistent timings when benchmarking on garbage collected runtimes. What sort of benchmarking process do you follow when working in a code base with very high performance requirements?
Use the profiler to isolate hot spots. Write benchmarks around the hot spot and use the benchmarks as your feedback generator while you measure, profile, and tune. Once you get to the point of acceptable performance, specify constraints on the benchmark and turn it into a regression test. This approach maximizes our development speed while tuning and stops us from experiencing performance regressions or "creep." As a side effect of working in this style, we ended up writing a lot of benchmarking code, so we just factored it out into infrastructure as we went rather than trying to copy some other performance testing framework over to .Net. The result is a flexible, lightweight declarative benchmark specification language. It leverages existing NUnit infrastructure for the test runner and asserts (and could easily do the same for xUnit.net).
What is up with these gnarly loop generator constructs?
The times method is a higher order extension method on the integer type that, when called on an integer n, returns a function encapsulating an imperative for loop from 0 to n-1 that takes another function as an argument, and executes the function n times. This allows the higher order function that times() returns to be passed around and applied at a later time - in our case, it is applied within the whenRun method in order to run the benchmark n times.
The from.to.by construct is interesting because the by() method is a higher order function (it both takes a function as input and returns a function as output) that takes a scalar function for the iterator i and generates a looping function that takes a function and runs the function inside the loop - passing the function the value of the current iterator created by the scalar function. There is an interesting reason that we made the design choice to create the from.to.by looping structure rather than providing this functionality on the benchmark API directly. The whole point of testing the scaling properties of your benchmark is that you feed in increasing amounts of data, so you probably want to generate that data using the beforeEachRun() hook. As a result of this, you need the number, n, to be in scope for your data generator to generate increasingly large data sets in the before hook. This is also a strange brew of OO and functional style. We tried to implement the from.to.by construct in a purely functional curried style, but it does not read as well as from(10)(20)(5). We also though about using a style using extension methods; for example, 10.to(20).by(i=>i+5). For now we like the "from" at the front of the construct. So we ended up with this hybrid approach using an OO style builder syntax together with higher order functions.
What is in the backlog for this project?
historical benchmarks with regression constraints. o requires persisting benchmark data after each run
o requires the ability to make asserts based on historical data with the intent to fail if performance decreases outside acceptable bads from the last run or from the historical best runs.
constraint composition o right now you can only add one constraint to a Benchmark, it would be good to allow composition of arbitrary constraints.
function fitting and constraints for scaling order o right now we only print out the list of results for increasing n for scaling benchmarks.
o It would be nice to be able to fit an actual function to the output data and compare its order against the theoretical order.
load testing for web apps and such? Should I create something like this for
? where foo : programming_language
What about all those crazy issues with optimizing compilers, JIT, garbage collected runtimes, and whatnot?
We are pooling together all our links and working on munging together a killer blogpost.
What about and environment-dependent timings? Don't benchmark timings depend both on the machines and what is happening on the machines at the time of the benchmarking?