Stalagmite and its speed
Russian version of this post is here
This summer I participated in the project Google Summer of Code in the Scala organization. My task was writing a library (stalagmite), which could generate by scala.meta effective and customizable replacement of conventional case classes with some convenient optimizations. Let’s analyze this description:
- Effective, that is comparable speed of functions already provided in the `case class’, and more or less fast implementation of additional ones
- Customizable – the ability to enable and disable various modes of code generation
- Replacement – support for already available methods and functionality
- Optimizations – reducing memory usage, boilerplate, increasing speed via macro generation
This list looks great, but making complete replacement for the powerful and elaborated case classes is difficult. There are many hacks special cases that are handled by the Scala compiler. The aim of the project was to support the necessary subset of such “cases” along with additional settings, the implementation of which without the use of macro-generation would require writing boilerplate code and bulky structures within the class. It was also required to provide an acceptable working time of macro-generated code, that I will check further by benchmarks.
Current status of the project
Currently the project provides 3 main operation mods together with a set of small improvements:
- The duplication of the behavior of
case class, in this mode all the basic functions is generated:toString,hashCode,copy,apply,unapply,ProductandSerializablehandled, generation ofGeneric LabelledGeneric, TypeablefromShapeless - Memoisation, interning, this technique disallows to duplicate objects that are equal by value, thus keeps in memory only one object and multiple references to it. It reduces the memory usage and time for comparing objects (object equality is equivalent to reference equality), but the creation of each object requires more resources. Stalagmite supports the interning of both the class and the fields inside it. There are also two sub-mods: weak and strong memoisation. The first mode stores the class objects within a pool using a weak reference, allowing the garbage collector to remove unused instances, the second mode keeps the normal link, and prohibits removing from memory objects, that get to the pool.
- “Packaging” the class fields (in the project called
heap optimization), this mode reduces the amount of memory for each instance of the class using theBooleanandOption[AnyVal](for primitives) “packaging” in the bitmask and converting fields of typeOption[T]to the typeTthat acceptsnullfor the original case ofNone.
Generated classes with these mods also takes into account important property of immutability. Memoisation and fields packing implicitly use it.
Work on the library is not finished yet, many problems and ideas are documented in the repository Issues. Visit it if you are interested in the fate of the project.
Benchmarks
Next there will be the long description of the benchmarks. For the impatient and those who want to learn only the results of performance testing there’s a TL;DR section at the end
Working time
First, let’s examine the time benchmarks, that used the JMH.
Each of them represents the applying of some method to a collection of 5000 tuples or instances of a certain class. Number of such applyings in a second was counted for collection on average on each of the 15 threads. The results are given for processor Intel Core i5-4210U @ 1.70 GHz.
I will use the following notation for classes used in benchmarks:
caseClass– normalcase classcaseClassMeta– macro-generated class that duplicates the functionality ofcaseClassmemoisedCaseClass– also an ordinarycase class, created for comparison with memorization verisonmemoisedMeta– the generated class withstrong memoisationmemoisedWeak– the generated class withweak memoisationoptimizeHeapCaseClass–case classfor comparison withheap optimizationoptimizeHeapMeta– the generated class withheap optimization
TL;DR There are three groups of classes that were used during testing. caseClass.* tested duplication of case classes, memoised.* tested memorization, optimizeHeap* – fields packaging. *caseClass served as the baselines for each group, and were simple case classes.
The first group of benchmarks checks the speed of the main methods for case class: apply, copy, field access, hashCode, Product methods, serialization, toString,
apply
Collection consisted of tuples with fields for creating classes. For each tuple the apply method from corresponding class was called.
| Class | Result |
|---|---|
ApplyBenchmark.caseClass |
19136.554 ± 637.993 ops/s |
ApplyBenchmark.caseClassMeta |
18331.622 ± 510.110 ops/s |
ApplyBenchmark.memoisedCaseClass |
20007.336 ± 276.490 ops/s |
ApplyBenchmark.memoisedMeta |
2560.862 ± 30.396 ops/s |
ApplyBenchmark.memoisedMetaWeak |
3714.472 ± 21.630 ops/s |
ApplyBenchmark.optimizeHeapCaseClass |
19397.820 ± 509.561 ops/s |
ApplyBenchmark.optimizeHeapMeta |
3949.541 ± 54.173 ops/s |
memoised and optimizeHeap modes have the overhead of memoization and fields packaging, thus work slower than regular case class. caseClass shows the same speed as case class.
copy
For each class instance in the collection 2 copies with different values in one of the two fields were created.
| Class | Result |
|---|---|
CopyBenchmark.caseClass |
5080.050 ± 284.304 ops/s |
CopyBenchmark.caseClassMeta |
5432.313 ± 336.334 ops/s |
CopyBenchmark.memoisedCaseClass |
4998.074 ± 173.456 ops/s |
CopyBenchmark.memoisedMeta |
853.461 ± 8.121 ops/s |
CopyBenchmark.memoisedMetaWeak |
1159.225 ± 63.359 ops/s |
CopyBenchmark.optimizeHeapCaseClass |
6899.622 ± 129.428 ops/s |
CopyBenchmark.optimizeHeapMeta |
959.581 ± 11.433 ops/s |
The situation is similar to the previous benchmark. Method copy just creates a new object using apply.
Access to the fields
Each field of a class instance in the collection was read.
| Class | Result |
|---|---|
FieldAccessBenchmark.caseClass |
15390.926 ± 122.415 ops/s |
FieldAccessBenchmark.caseClassMeta |
15433.422 ± 254.224 ops/s |
FieldAccessBenchmark.optimizeHeapCaseClass |
22975.564 ± 803.286 ops/s |
FieldAccessBenchmark.optimizeHeapMeta |
4535.168 ± 50.179 ops/s |
caseClass mode doesn’t differ from case class, field access method returns the actual field from the class. optimizeHeap mode performs the “unpacking” of the fields while reading, so working time is longer. memoised mode wasn’t included, because it’s equivalent to caseClass in this context.
.hashCode and .toString
For each class instance methods .hashCode'and .toString` were called.
| Class | Result |
|---|---|
HashCodeBenchmark.caseClass |
11307.343 ± 1278.621 ops/s |
HashCodeBenchmark.caseClassMeta |
12106.911 ± 263.225 ops/s |
HashCodeBenchmark.memoisedCaseClass |
16266.437 ± 292.010 ops/s |
HashCodeBenchmark.memoisedMeta |
24262.046 ± 1876.971 ops/s |
HashCodeBenchmark.optimizeHeapCaseClass |
4208.655 ± 323.614 ops/s |
HashCodeBenchmark.optimizeHeapMeta |
3078.390 ± 210.247 ops/s |
| Class | Result |
|---|---|
ToStringBenchmark.caseClass |
1145.058 ± 34.190 ops/s |
ToStringBenchmark.caseClassMeta |
1296.309 ± 22.632 ops/s |
ToStringBenchmark.memoisedCaseClass |
1439.810 ± 106.486 ops/s |
ToStringBenchmark.memoisedMeta |
26745.501 ± 1026.738 ops/s |
ToStringBenchmark.optimizeHeapCaseClass |
548.055 ± 98.772 ops/s |
ToStringBenchmark.optimizeHeapMeta |
646.945 ± 29.493 ops/s |
For caseClass and optimizeHeap modes the situation is similar as in the field access. Methods .hashCode and .toString get fields of the class for building hashes and strings, though still doing a lot of other work, so the difference is small. memoised mode works faster than case class baseline because of the special settings: memoisedHashCode and memoisedToString. They save the values of the two methods, and don’t count them many times.
productElement
This benchmark tested method productElement.
| Class | Result |
|---|---|
ProductElementBenchmark.caseClass |
13921.991 ± 631.164 ops/s |
ProductElementBenchmark.caseClassMeta |
13871.084 ± 1241.714 ops/s |
ProductElementBenchmark.optimizeHeapCaseClass |
21984.665 ± 636.940 ops/s |
ProductElementBenchmark.optimizeHeapMeta |
4305.256 ± 73.872 ops/s |
Everything is the same as in benchmark on field access.
Serialization
The entire collection of class instances was serialized and then deserialized.
| Class | Result |
|---|---|
SerializationBenchmark.caseClass |
190.118 ± 19.615 ops/s |
SerializationBenchmark.caseClassMeta |
205.555 ± 29.683 ops/s |
SerializationBenchmark.memoisedCaseClass |
250.941 ± 34.648 ops/s |
SerializationBenchmark.memoisedMeta |
316.477 ± 30.002 ops/s |
SerializationBenchmark.memoisedMetaWeak |
338.591 ± 39.945 ops/s |
SerializationBenchmark.optimizeHeapCaseClass |
93.594 ± 18.978 ops/s |
SerializationBenchmark.optimizeHeapMeta |
66.927 ± 8.449 ops/s |
There are almost no differences from case class baselines. Complex operations of writing and reading serialized data outshine quick fields accessing and objects creating. Good conclusion – even difficult fields packing in optimizeHeap and memoization does not affect the objects serialization.
unapply
Each class instance was pattern-matched.
| Class | Result |
|---|---|
UnapplyBenchmark.caseClass |
14816.202 ± 939.746 ops/s |
UnapplyBenchmark.caseClassMeta |
13392.518 ± 431.781 ops/s |
UnapplyBenchmark.optimizeHeapCaseClass |
23635.361 ± 238.981 ops/s |
UnapplyBenchmark.optimizeHeapMeta |
4082.947 ± 61.865 ops/s |
optimizeHeap mode shows slow results due to the fields unpacking, caseClass mode shows good performance regarding the case class.
The next group of benchmarks tested the generating mods features: methods to support Shapeless and speed of .equals with memoization.
Shapeless
shapeless mod generates objects of type Generic and LabelledGeneric that allow conversion of class instances to Hlist and back. Such conversions were performed for the entire collection.
| Class | Result |
|---|---|
ShapelessBenchmark.caseClass |
8406.639 ± 342.925 ops/s |
ShapelessBenchmark.caseClassMeta |
6653.700 ± 38.209 ops/s |
Generated class works slower. Reasons aren’t clear, probably it’s because of Shapeless macro-generation.
.equals within Vector
This benchmark tested speed of .equals method in the case when data is placed in a Vector. A set of 1000000 random pairs of indices for comparisons was choosen. The indices were chosen with distance 10 or less and gradually increased, starting with 1. It allows to support data locality and data structure Vector is able to provide it.
| Class | Result |
|---|---|
EqualsVectorBenchmark.caseClass |
29.456 ± 1.851 ops/s |
EqualsVectorBenchmark.caseClassMeta |
30.252 ± 1.707 ops/s |
EqualsVectorBenchmark.memoisedCaseClass |
36.041 ± 3.207 ops/s |
EqualsVectorBenchmark.memoisedMeta |
36.401 ± 1.061 ops/s |
EqualsVectorBenchmark.memoisedWeak |
36.737 ± 2.372 ops/s |
EqualsVectorBenchmark.optimizeHeapCaseClass |
25.922 ± 1.398 ops/s |
EqualsVectorBenchmark.optimizeHeapMeta |
11.473 ± 0.828 ops/s |
caseClass and memoised modes work as well as case class. In the first case there aren’t any differences in the implementations, in the second comparison by reference, not by value was used in the generated classes. It should work faster, but it doesn’t. Time to access an element in Vector overlaps all. In the case of optimizeHeap unpacking operation slows down .equals, even compared to the accessing time to the elements.
.equlas within HashSet
HashSet uses .equals and .hashCode to build the hash table. The time for the entire collection of instances of class to be written into an empty HashSet was meausred.
| Class | Result |
|---|---|
HashSetBenchmark.caseClass |
3341.499 ± 44.082 ops/s |
HashSetBenchmark.caseClassMeta |
3665.179 ± 77.763 ops/s |
HashSetBenchmark.memoisedCaseClass |
3858.759 ± 49.616 ops/s |
HashSetBenchmark.memoisedIntern |
5009.943 ± 137.088 ops/s |
HashSetBenchmark.memoisedMeta |
6776.364 ± 39.766 ops/s |
HashSetBenchmark.memoisedWeak |
6401.936 ± 139.723 ops/s |
HashSetBenchmark.optimizeHeapCaseClass |
1566.175 ± 70.254 ops/s |
HashSetBenchmark.optimizeHeapMeta |
869.202 ± 22.643 ops/s |
caseClass and optimizeHeap modes work normally. The first is similar to the baseline, the second is slower. But memoised mode here is way more interesting! The new class memoisedIntern is introduced here, it uses memoisedHashCode setting. It caches the hashCode and reduces the accessing time to the hash code. But even without it the generated class is faster than case class because of accelerated .equals method. With the caching of the hash code speed is greatly increased.
Memory
Benchmarks on memory usage worked simple. Two characteristics were measured: how much memory was used during the iteration and how much was used from the start of the run. The second characteristic is necessary to check how much memory, that cannot be removed by the garbage collector, is being producing. Several iterations was performed with output of memory usage dynamic.
There were 4 benchmarks:
- memory usage measurment for
case classduplication mode - measurement for packing fields
- measurement for memoization when the data cannot be removed by the garbage collector
- and when it can be
case class
Memory usage of 500 thousand case classes and generated classes was compared. Each of them consisted of the following fields: i: Int, b: Boolean, s: String.
caseClass
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 54659 kb | 54571 kb |
| 1 | 54574 kb | 54499 kb |
| 2 | 54877 kb | 54689 kb |
| 3 | 54801 kb | 54570 kb |
| 4 | 54691 kb | 54595 kb |
Generated class
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 55032 kb | 54934 kb |
| 1 | 54740 kb | 54626 kb |
| 2 | 54896 kb | 54835 kb |
| 3 | 54687 kb | 54612 kb |
| 4 | 54920 kb | 54845 kb |
Absolutely identical.
Packaging fields
A similar comparison was performed: a class with fields packaging against case class.
They contain the fields: i: Option[Int],
s: Option[String],
b1: Option[Boolean],
b2: Option[Boolean],
b3: Option[Boolean],
b4: Option[Boolean]
caseClass
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 111830 kb | 111732 kb |
| 1 | 113230 kb | 112418 kb |
| 2 | 112626 kb | 112325 kb |
| 3 | 112698 kb | 112315 kb |
| 4 | 113117 kb | 112715 kb |
The package fields
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 40399 kb | 40281 kb |
| 1 | 42455 kb | 39790 kb |
| 2 | 42949 kb | 40088 kb |
| 3 | 42539 kb | 39811 kb |
| 4 | 43195 kb | 40395 kb |
Fileds packaging reduces memory usage by almost 3 times. This benchmark clearly shows how redundant are Option and Boolean types.
Memoization without deletable data
Let’s finally proceed to the most interesting and demonstrative memory benchmarks. In the first created instances of classes could not be removed by the garbage collector. There were 4 classes: case class, with strong memoization, with strong memoization and inner fields interning, and weak memoization. They contain the following fields: b: Boolean, s: String and the string fields s was interned in the 3rd case. Two types of data were used to create instances of these classes: data with a large number of repetitions and data with various elements.
Data with repetitions
500 thousand elements, the 2-length strings of digits and letters. The size of all different combinations of such strings and Boolean is much smaller than the size of the collection, so there were many duplicates.
caseClass
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 45174 kb | 50249 kb |
| 1 | 46968 kb | 50343 kb |
| 2 | 47010 kb | 50478 kb |
| 3 | 46936 kb | 50540 kb |
| 4 | 46980 kb | 50646 kb |
memoisedMeta
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 11831 kb | 11537 kb |
| 1 | 11781 kb | 11570 kb |
| 2 | 11770 kb | 11601 kb |
| 3 | 11729 kb | 11601 kb |
| 4 | 11729 kb | 11601 kb |
memoisedMeta with string interning
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 11733 kb | 11732 kb |
| 1 | 11718 kb | 11732 kb |
| 2 | 11729 kb | 11742 kb |
| 3 | 11718 kb | 11742 kb |
| 4 | 11728 kb | 11753 kb |
memoisedWeak
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 11666 kb | 11659 kb |
| 1 | 11684 kb | 11680 kb |
| 2 | 11663 kb | 11680 kb |
| 3 | 11662 kb | 11680 kb |
| 4 | 11683 kb | 11700 kb |
Memoization is doing its job. All repetitions in data are added to the cache and not duplicated as in the case of case class.
Data without repetition
Also 500 thousand elements, but the strings have length 5. The data hasn’t repetitions now.
caseClass
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 54937 kb | 54938 kb |
| 1 | 53696 kb | 53724 kb |
| 2 | 54271 kb | 53308 kb |
| 3 | 54590 kb | 53211 kb |
| 4 | 54667 kb | 53191 kb |
memoisedMeta
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 78619 kb | 77984 kb |
| 1 | 74883 kb | 140421 kb |
| 2 | 82160 kb | 210863 kb |
| 3 | 75171 kb | 273346 kb |
| 4 | 74882 kb | 336093 kb |
memoisedMeta with string interning
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 95091 kb | 94141 kb |
| 1 | 86867 kb | 168425 kb |
| 2 | 102367 kb | 259074 kb |
| 3 | 86810 kb | 333071 kb |
| 4 | 86753 kb | 407689 kb |
memoisedWeak
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 105562 kb | 105562 kb |
| 1 | 66527 kb | 105683 kb |
| 2 | 66392 kb | 105669 kb |
| 3 | 66423 kb | 105686 kb |
| 4 | 66406 kb | 105686 kb |
Here case class’s winning. Strong memoization is continuously writing new data to the cache, consuming a large amount of memory. String interning only worsens the situation. Weak memoization uses the memory better, but it has overhead in cache building.
Memoization with the garbage collector
In this benchmark after creating 500 thousand instances only 20000 references to them were left. This allowed the garbage collector to do its dirty work.
Data with repetitions
caseClass
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 1914 kb | 1923 kb |
| 1 | 1871 kb | 1920 kb |
| 2 | 1852 kb | 1897 kb |
| 3 | 1856 kb | 1879 kb |
| 4 | 1859 kb | 1863 kb |
memoisedMeta
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 1662 kb | 1663 kb |
| 1 | 445 kb | 1376 kb |
| 2 | 459 kb | 1367 kb |
| 3 | 448 kb | 1347 kb |
| 4 | 468 kb | 1347 kb |
memoisedMeta with string interning
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 1347 kb | 1348 kb |
| 1 | 473 kb | 1351 kb |
| 2 | 458 kb | 1341 kb |
| 3 | 458 kb | 1331 kb |
| 4 | 468 kb | 1331 kb |
memoisedWeak
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 1583 kb | 1583 kb |
| 1 | 900 kb | 1521 kb |
| 2 | 886 kb | 1496 kb |
| 3 | 906 kb | 1494 kb |
| 4 | 909 kb | 1498 kb |
All the different combinations of data aggin can be fully written to the cache. Strong memoization gives a good improvement over the case class, weak a little worse.
Data without repetition
caseClass
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 2299 kb | 2299 kb |
| 1 | 2228 kb | 2341 kb |
| 2 | 2187 kb | 2341 kb |
| 3 | 2207 kb | 2341 kb |
| 4 | 2187 kb | 2341 kb |
memoisedMeta
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 66468 kb | 66468 kb |
| 1 | 66920 kb | 132920 kb |
| 2 | 62015 kb | 193819 kb |
| 3 | 70686 kb | 264037 kb |
| 4 | 62875 kb | 326444 kb |
memoisedMeta with string interning
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 82753 kb | 82754 kb |
| 1 | 81617 kb | 163902 kb |
| 2 | 75799 kb | 239233 kb |
| 3 | 91769 kb | 330278 kb |
| 4 | 75296 kb | 403948 kb |
memoisedWeak
| Iteration | Memory for iteration | Memory after iteration |
|---|---|---|
| 0 | 41861 kb | 41861 kb |
| 1 | 3089 kb | 42294 kb |
| 2 | 2744 kb | 42382 kb |
| 3 | 2676 kb | 42402 kb |
| 4 | 2676 kb | 42423 kb |
Here everything is much worse. Strong memoization stores all the data in the cache, nothing is deleted, and the cache is becoming huge. Weak memoisation stores only what is needed for iteration in the cache, but it still takes a lot of memory. This is the worst case to use memoization. It could easily lead to OutOfMemoryError.
TL;DR
The speed and memory consumption measurements show that there’s no such situation when the generated class is definitely better than the case class. There’s either the gain in speed, but with more memory usage, or vice versa. Generated classes, if no additional generation modes applied, duplicate the effectiveness of the case class. Field packaging consumes less memory, but behaves slowly in accessing to class fields and .apply method. Memoization speeds up the .equals method, but also has overhead in .apply. Memory usage depends on the context. With a large number of duplicates memoization caches them and stores only references. When data has almost no duplicates, especially in the case when data is being cleaned by the garbage collector, the memory consumption increases significantly.
In the end
The conclusion will be simple. Built-in case classes work very well in most cases. Stalagmite provides an alternative to them, exchanging memory for speed and vice versa. There are some ideas on how to reduce the boilerplate and add new optimizations in developing plans, but complete replacement the case class is impossible. So use this library wisely. I hope I showed you the strengths and weaknesses of the current implementation :)