Stalagmite and its speed

Russian version of this post is here

This summer I participated in the project Google Summer of Code in the Scala organization. My task was writing a library (stalagmite), which could generate by scala.meta effective and customizable replacement of conventional case classes with some convenient optimizations. Let’s analyze this description:

Effective, that is comparable speed of functions already provided in the `case class’, and more or less fast implementation of additional ones
Customizable – the ability to enable and disable various modes of code generation
Replacement – support for already available methods and functionality
Optimizations – reducing memory usage, boilerplate, increasing speed via macro generation

This list looks great, but making complete replacement for the powerful and elaborated case classes is difficult. There are many ~~hacks~~ special cases that are handled by the Scala compiler. The aim of the project was to support the necessary subset of such “cases” along with additional settings, the implementation of which without the use of macro-generation would require writing boilerplate code and bulky structures within the class. It was also required to provide an acceptable working time of macro-generated code, that I will check further by benchmarks.

Current status of the project

Currently the project provides 3 main operation mods together with a set of small improvements:

The duplication of the behavior of case class, in this mode all the basic functions is generated: toString, hashCode, copy, apply, unapply, Product and Serializable handled, generation of Generic LabelledGeneric, Typeable from Shapeless
Memoisation, interning, this technique disallows to duplicate objects that are equal by value, thus keeps in memory only one object and multiple references to it. It reduces the memory usage and time for comparing objects (object equality is equivalent to reference equality), but the creation of each object requires more resources. Stalagmite supports the interning of both the class and the fields inside it. There are also two sub-mods: weak and strong memoisation. The first mode stores the class objects within a pool using a weak reference, allowing the garbage collector to remove unused instances, the second mode keeps the normal link, and prohibits removing from memory objects, that get to the pool.
“Packaging” the class fields (in the project called heap optimization), this mode reduces the amount of memory for each instance of the class using the Boolean and Option[AnyVal] (for primitives) “packaging” in the bitmask and converting fields of type Option[T] to the type T that accepts null for the original case of None.

Generated classes with these mods also takes into account important property of immutability. Memoisation and fields packing implicitly use it.

Work on the library is not finished yet, many problems and ideas are documented in the repository Issues. Visit it if you are interested in the fate of the project.

Benchmarks

Next there will be the long description of the benchmarks. For the impatient and those who want to learn only the results of performance testing there’s a TL;DR section at the end

Working time

First, let’s examine the time benchmarks, that used the JMH. Each of them represents the applying of some method to a collection of 5000 tuples or instances of a certain class. Number of such applyings in a second was counted for collection on average on each of the 15 threads. The results are given for processor Intel Core i5-4210U @ 1.70 GHz.

I will use the following notation for classes used in benchmarks:

caseClass – normal case class
caseClassMeta – macro-generated class that duplicates the functionality of caseClass
memoisedCaseClass – also an ordinary case class, created for comparison with memorization verison
memoisedMeta – the generated class with strong memoisation
memoisedWeak – the generated class with weak memoisation
optimizeHeapCaseClass – case class for comparison with heap optimization
optimizeHeapMeta – the generated class with heap optimization

TL;DR There are three groups of classes that were used during testing. caseClass.* tested duplication of case classes, memoised.* tested memorization, optimizeHeap* – fields packaging. *caseClass served as the baselines for each group, and were simple case classes.

The first group of benchmarks checks the speed of the main methods for case class: apply, copy, field access, hashCode, Product methods, serialization, toString,

apply

Collection consisted of tuples with fields for creating classes. For each tuple the apply method from corresponding class was called.

Class	Result
`ApplyBenchmark.caseClass`	`19136.554 ± 637.993 ops/s`
`ApplyBenchmark.caseClassMeta`	`18331.622 ± 510.110 ops/s`
`ApplyBenchmark.memoisedCaseClass`	`20007.336 ± 276.490 ops/s`
`ApplyBenchmark.memoisedMeta`	`2560.862 ± 30.396 ops/s`
`ApplyBenchmark.memoisedMetaWeak`	`3714.472 ± 21.630 ops/s`
`ApplyBenchmark.optimizeHeapCaseClass`	`19397.820 ± 509.561 ops/s`
`ApplyBenchmark.optimizeHeapMeta`	`3949.541 ± 54.173 ops/s`

memoised and optimizeHeap modes have the overhead of memoization and fields packaging, thus work slower than regular case class. caseClass shows the same speed as case class.

copy

For each class instance in the collection 2 copies with different values in one of the two fields were created.

Class	Result
`CopyBenchmark.caseClass`	`5080.050 ± 284.304 ops/s`
`CopyBenchmark.caseClassMeta`	`5432.313 ± 336.334 ops/s`
`CopyBenchmark.memoisedCaseClass`	`4998.074 ± 173.456 ops/s`
`CopyBenchmark.memoisedMeta`	`853.461 ± 8.121 ops/s`
`CopyBenchmark.memoisedMetaWeak`	`1159.225 ± 63.359 ops/s`
`CopyBenchmark.optimizeHeapCaseClass`	`6899.622 ± 129.428 ops/s`
`CopyBenchmark.optimizeHeapMeta`	`959.581 ± 11.433 ops/s`

The situation is similar to the previous benchmark. Method copy just creates a new object using apply.

Access to the fields

Each field of a class instance in the collection was read.

Class	Result
`FieldAccessBenchmark.caseClass`	`15390.926 ± 122.415 ops/s`
`FieldAccessBenchmark.caseClassMeta`	`15433.422 ± 254.224 ops/s`
`FieldAccessBenchmark.optimizeHeapCaseClass`	`22975.564 ± 803.286 ops/s`
`FieldAccessBenchmark.optimizeHeapMeta`	`4535.168 ± 50.179 ops/s`

caseClass mode doesn’t differ from case class, field access method returns the actual field from the class. optimizeHeap mode performs the “unpacking” of the fields while reading, so working time is longer. memoised mode wasn’t included, because it’s equivalent to caseClass in this context.

`.hashCode` and `.toString`

For each class instance methods .hashCode'and .toString` were called.

Class	Result
`HashCodeBenchmark.caseClass`	`11307.343 ± 1278.621 ops/s`
`HashCodeBenchmark.caseClassMeta`	`12106.911 ± 263.225 ops/s`
`HashCodeBenchmark.memoisedCaseClass`	`16266.437 ± 292.010 ops/s`
`HashCodeBenchmark.memoisedMeta`	`24262.046 ± 1876.971 ops/s`
`HashCodeBenchmark.optimizeHeapCaseClass`	`4208.655 ± 323.614 ops/s`
`HashCodeBenchmark.optimizeHeapMeta`	`3078.390 ± 210.247 ops/s`

Class	Result
`ToStringBenchmark.caseClass`	`1145.058 ± 34.190 ops/s`
`ToStringBenchmark.caseClassMeta`	`1296.309 ± 22.632 ops/s`
`ToStringBenchmark.memoisedCaseClass`	`1439.810 ± 106.486 ops/s`
`ToStringBenchmark.memoisedMeta`	`26745.501 ± 1026.738 ops/s`
`ToStringBenchmark.optimizeHeapCaseClass`	`548.055 ± 98.772 ops/s`
`ToStringBenchmark.optimizeHeapMeta`	`646.945 ± 29.493 ops/s`

For caseClass and optimizeHeap modes the situation is similar as in the field access. Methods .hashCode and .toString get fields of the class for building hashes and strings, though still doing a lot of other work, so the difference is small. memoised mode works faster than case class baseline because of the special settings: memoisedHashCode and memoisedToString. They save the values of the two methods, and don’t count them many times.

`productElement`

This benchmark tested method productElement.

Class	Result
`ProductElementBenchmark.caseClass`	`13921.991 ± 631.164 ops/s`
`ProductElementBenchmark.caseClassMeta`	`13871.084 ± 1241.714 ops/s`
`ProductElementBenchmark.optimizeHeapCaseClass`	`21984.665 ± 636.940 ops/s`
`ProductElementBenchmark.optimizeHeapMeta`	`4305.256 ± 73.872 ops/s`

Everything is the same as in benchmark on field access.

Serialization

The entire collection of class instances was serialized and then deserialized.

Class	Result
`SerializationBenchmark.caseClass`	`190.118 ± 19.615 ops/s`
`SerializationBenchmark.caseClassMeta`	`205.555 ± 29.683 ops/s`
`SerializationBenchmark.memoisedCaseClass`	`250.941 ± 34.648 ops/s`
`SerializationBenchmark.memoisedMeta`	`316.477 ± 30.002 ops/s`
`SerializationBenchmark.memoisedMetaWeak`	`338.591 ± 39.945 ops/s`
`SerializationBenchmark.optimizeHeapCaseClass`	`93.594 ± 18.978 ops/s`
`SerializationBenchmark.optimizeHeapMeta`	`66.927 ± 8.449 ops/s`

There are almost no differences from case class baselines. Complex operations of writing and reading serialized data outshine quick fields accessing and objects creating. Good conclusion – even difficult fields packing in optimizeHeap and memoization does not affect the objects serialization.

unapply

Each class instance was pattern-matched.

Class	Result
`UnapplyBenchmark.caseClass`	`14816.202 ± 939.746 ops/s`
`UnapplyBenchmark.caseClassMeta`	`13392.518 ± 431.781 ops/s`
`UnapplyBenchmark.optimizeHeapCaseClass`	`23635.361 ± 238.981 ops/s`
`UnapplyBenchmark.optimizeHeapMeta`	`4082.947 ± 61.865 ops/s`

optimizeHeap mode shows slow results due to the fields unpacking, caseClass mode shows good performance regarding the case class.

The next group of benchmarks tested the generating mods features: methods to support Shapeless and speed of .equals with memoization.

Shapeless

shapeless mod generates objects of type Generic and LabelledGeneric that allow conversion of class instances to Hlist and back. Such conversions were performed for the entire collection.

Class	Result
`ShapelessBenchmark.caseClass`	`8406.639 ± 342.925 ops/s`
`ShapelessBenchmark.caseClassMeta`	`6653.700 ± 38.209 ops/s`

Generated class works slower. Reasons aren’t clear, probably it’s because of Shapeless macro-generation.

`.equals` within `Vector`

This benchmark tested speed of .equals method in the case when data is placed in a Vector. A set of 1000000 random pairs of indices for comparisons was choosen. The indices were chosen with distance 10 or less and gradually increased, starting with 1. It allows to support data locality and data structure Vector is able to provide it.

Class	Result
`EqualsVectorBenchmark.caseClass`	`29.456 ± 1.851 ops/s`
`EqualsVectorBenchmark.caseClassMeta`	`30.252 ± 1.707 ops/s`
`EqualsVectorBenchmark.memoisedCaseClass`	`36.041 ± 3.207 ops/s`
`EqualsVectorBenchmark.memoisedMeta`	`36.401 ± 1.061 ops/s`
`EqualsVectorBenchmark.memoisedWeak`	`36.737 ± 2.372 ops/s`
`EqualsVectorBenchmark.optimizeHeapCaseClass`	`25.922 ± 1.398 ops/s`
`EqualsVectorBenchmark.optimizeHeapMeta`	`11.473 ± 0.828 ops/s`

caseClass and memoised modes work as well as case class. In the first case there aren’t any differences in the implementations, in the second comparison by reference, not by value was used in the generated classes. It should work faster, but it doesn’t. Time to access an element in Vector overlaps all. In the case of optimizeHeap unpacking operation slows down .equals, even compared to the accessing time to the elements.

`.equlas` within `HashSet`

HashSet uses .equals and .hashCode to build the hash table. The time for the entire collection of instances of class to be written into an empty HashSet was meausred.

Class	Result
`HashSetBenchmark.caseClass`	`3341.499 ± 44.082 ops/s`
`HashSetBenchmark.caseClassMeta`	`3665.179 ± 77.763 ops/s`
`HashSetBenchmark.memoisedCaseClass`	`3858.759 ± 49.616 ops/s`
`HashSetBenchmark.memoisedIntern`	`5009.943 ± 137.088 ops/s`
`HashSetBenchmark.memoisedMeta`	`6776.364 ± 39.766 ops/s`
`HashSetBenchmark.memoisedWeak`	`6401.936 ± 139.723 ops/s`
`HashSetBenchmark.optimizeHeapCaseClass`	`1566.175 ± 70.254 ops/s`
`HashSetBenchmark.optimizeHeapMeta`	`869.202 ± 22.643 ops/s`

caseClass and optimizeHeap modes work normally. The first is similar to the baseline, the second is slower. But memoised mode here is way more interesting! The new class memoisedIntern is introduced here, it uses memoisedHashCode setting. It caches the hashCode and reduces the accessing time to the hash code. But even without it the generated class is faster than case class because of accelerated .equals method. With the caching of the hash code speed is greatly increased.

Memory

Benchmarks on memory usage worked simple. Two characteristics were measured: how much memory was used during the iteration and how much was used from the start of the run. The second characteristic is necessary to check how much memory, that cannot be removed by the garbage collector, is being producing. Several iterations was performed with output of memory usage dynamic.

There were 4 benchmarks:

memory usage measurment for case class duplication mode
measurement for packing fields
measurement for memoization when the data cannot be removed by the garbage collector
and when it can be

`case class`

Memory usage of 500 thousand case classes and generated classes was compared. Each of them consisted of the following fields: i: Int, b: Boolean, s: String.

caseClass

Iteration	Memory for iteration	Memory after iteration
0	54659 kb	54571 kb
1	54574 kb	54499 kb
2	54877 kb	54689 kb
3	54801 kb	54570 kb
4	54691 kb	54595 kb

Generated class

Iteration	Memory for iteration	Memory after iteration
0	55032 kb	54934 kb
1	54740 kb	54626 kb
2	54896 kb	54835 kb
3	54687 kb	54612 kb
4	54920 kb	54845 kb

Absolutely identical.

Packaging fields

A similar comparison was performed: a class with fields packaging against case class. They contain the fields: i: Option[Int], s: Option[String], b1: Option[Boolean], b2: Option[Boolean], b3: Option[Boolean], b4: Option[Boolean]

caseClass

Iteration	Memory for iteration	Memory after iteration
0	111830 kb	111732 kb
1	113230 kb	112418 kb
2	112626 kb	112325 kb
3	112698 kb	112315 kb
4	113117 kb	112715 kb

The package fields

Iteration	Memory for iteration	Memory after iteration
0	40399 kb	40281 kb
1	42455 kb	39790 kb
2	42949 kb	40088 kb
3	42539 kb	39811 kb
4	43195 kb	40395 kb

Fileds packaging reduces memory usage by almost 3 times. This benchmark clearly shows how redundant are Option and Boolean types.

Memoization without deletable data

Let’s finally proceed to the most interesting and demonstrative memory benchmarks. In the first created instances of classes could not be removed by the garbage collector. There were 4 classes: case class, with strong memoization, with strong memoization and inner fields interning, and weak memoization. They contain the following fields: b: Boolean, s: String and the string fields s was interned in the 3rd case. Two types of data were used to create instances of these classes: data with a large number of repetitions and data with various elements.

Data with repetitions

500 thousand elements, the 2-length strings of digits and letters. The size of all different combinations of such strings and Boolean is much smaller than the size of the collection, so there were many duplicates.

caseClass

Iteration	Memory for iteration	Memory after iteration
0	45174 kb	50249 kb
1	46968 kb	50343 kb
2	47010 kb	50478 kb
3	46936 kb	50540 kb
4	46980 kb	50646 kb

memoisedMeta

Iteration	Memory for iteration	Memory after iteration
0	11831 kb	11537 kb
1	11781 kb	11570 kb
2	11770 kb	11601 kb
3	11729 kb	11601 kb
4	11729 kb	11601 kb

memoisedMeta with string interning

Iteration	Memory for iteration	Memory after iteration
0	11733 kb	11732 kb
1	11718 kb	11732 kb
2	11729 kb	11742 kb
3	11718 kb	11742 kb
4	11728 kb	11753 kb

memoisedWeak

Iteration	Memory for iteration	Memory after iteration
0	11666 kb	11659 kb
1	11684 kb	11680 kb
2	11663 kb	11680 kb
3	11662 kb	11680 kb
4	11683 kb	11700 kb

Memoization is doing its job. All repetitions in data are added to the cache and not duplicated as in the case of case class.

Data without repetition

Also 500 thousand elements, but the strings have length 5. The data hasn’t repetitions now.

caseClass

Iteration	Memory for iteration	Memory after iteration
0	54937 kb	54938 kb
1	53696 kb	53724 kb
2	54271 kb	53308 kb
3	54590 kb	53211 kb
4	54667 kb	53191 kb

memoisedMeta

Iteration	Memory for iteration	Memory after iteration
0	78619 kb	77984 kb
1	74883 kb	140421 kb
2	82160 kb	210863 kb
3	75171 kb	273346 kb
4	74882 kb	336093 kb

memoisedMeta with string interning

Iteration	Memory for iteration	Memory after iteration
0	95091 kb	94141 kb
1	86867 kb	168425 kb
2	102367 kb	259074 kb
3	86810 kb	333071 kb
4	86753 kb	407689 kb

memoisedWeak

Iteration	Memory for iteration	Memory after iteration
0	105562 kb	105562 kb
1	66527 kb	105683 kb
2	66392 kb	105669 kb
3	66423 kb	105686 kb
4	66406 kb	105686 kb

Here case class’s winning. Strong memoization is continuously writing new data to the cache, consuming a large amount of memory. String interning only worsens the situation. Weak memoization uses the memory better, but it has overhead in cache building.

Memoization with the garbage collector

In this benchmark after creating 500 thousand instances only 20000 references to them were left. This allowed the garbage collector to do its ~~dirty~~ work.

Data with repetitions

caseClass

Iteration	Memory for iteration	Memory after iteration
0	1914 kb	1923 kb
1	1871 kb	1920 kb
2	1852 kb	1897 kb
3	1856 kb	1879 kb
4	1859 kb	1863 kb

memoisedMeta

Iteration	Memory for iteration	Memory after iteration
0	1662 kb	1663 kb
1	445 kb	1376 kb
2	459 kb	1367 kb
3	448 kb	1347 kb
4	468 kb	1347 kb

memoisedMeta with string interning

Iteration	Memory for iteration	Memory after iteration
0	1347 kb	1348 kb
1	473 kb	1351 kb
2	458 kb	1341 kb
3	458 kb	1331 kb
4	468 kb	1331 kb

memoisedWeak

Iteration	Memory for iteration	Memory after iteration
0	1583 kb	1583 kb
1	900 kb	1521 kb
2	886 kb	1496 kb
3	906 kb	1494 kb
4	909 kb	1498 kb

All the different combinations of data aggin can be fully written to the cache. Strong memoization gives a good improvement over the case class, weak a little worse.

Data without repetition

caseClass

Iteration	Memory for iteration	Memory after iteration
0	2299 kb	2299 kb
1	2228 kb	2341 kb
2	2187 kb	2341 kb
3	2207 kb	2341 kb
4	2187 kb	2341 kb

memoisedMeta

Iteration	Memory for iteration	Memory after iteration
0	66468 kb	66468 kb
1	66920 kb	132920 kb
2	62015 kb	193819 kb
3	70686 kb	264037 kb
4	62875 kb	326444 kb

memoisedMeta with string interning

Iteration	Memory for iteration	Memory after iteration
0	82753 kb	82754 kb
1	81617 kb	163902 kb
2	75799 kb	239233 kb
3	91769 kb	330278 kb
4	75296 kb	403948 kb

memoisedWeak

Iteration	Memory for iteration	Memory after iteration
0	41861 kb	41861 kb
1	3089 kb	42294 kb
2	2744 kb	42382 kb
3	2676 kb	42402 kb
4	2676 kb	42423 kb

Here everything is much worse. Strong memoization stores all the data in the cache, nothing is deleted, and the cache is becoming huge. Weak memoisation stores only what is needed for iteration in the cache, but it still takes a lot of memory. This is the worst case to use memoization. It could easily lead to OutOfMemoryError.

TL;DR

The speed and memory consumption measurements show that there’s no such situation when the generated class is definitely better than the case class. There’s either the gain in speed, but with more memory usage, or vice versa. Generated classes, if no additional generation modes applied, duplicate the effectiveness of the case class. Field packaging consumes less memory, but behaves slowly in accessing to class fields and .apply method. Memoization speeds up the .equals method, but also has overhead in .apply. Memory usage depends on the context. With a large number of duplicates memoization caches them and stores only references. When data has almost no duplicates, especially in the case when data is being cleaned by the garbage collector, the memory consumption increases significantly.

In the end

The conclusion will be simple. Built-in case classes work very well in most cases. Stalagmite provides an alternative to them, exchanging memory for speed and vice versa. There are some ideas on how to reduce the boilerplate and add new optimizations in developing plans, but complete replacement the case class is impossible. So use this library wisely. I hope I showed you the strengths and weaknesses of the current implementation :)

Current status of the project

Benchmarks

Working time

apply

copy

Access to the fields

.hashCode and .toString

productElement

Serialization

unapply

Shapeless

.equals within Vector

.equlas within HashSet

Memory

case class

Packaging fields

Memoization without deletable data

Data with repetitions

Data without repetition

Memoization with the garbage collector

Data with repetitions

Data without repetition

TL;DR

In the end

`.hashCode` and `.toString`

`productElement`

`.equals` within `Vector`

`.equlas` within `HashSet`

`case class`