Connecting...

W1siziisimnvbxbpbgvkx3rozw1lx2fzc2v0cy9zawduawz5lxrly2hub2xvz3kvanbnl2jhbm5lci1kzwzhdwx0lmpwzyjdxq

Scala Snacks: Part 1 - Sets are invariantly interesting by Daniel Tattan-Birch

W1siziisijiwmtgvmdgvmtuvmtuvmjmvmtevotawl3blegvscy1wag90by0ynzazndguanblzyjdlfsiccisinrodw1iiiwiotawedkwmfx1mdazzsjdxq

Here is some Wednesday Wisdom for you from Football RadarDaniel Tattan-Birch gives us some interesting informative nuggets which they have come across while building their Scala services at Football Radar. This is Part 1 so keep an eye out for more to come.

 

'As software engineers we often stumble upon interesting technical curiosities relating to the languages, libraries and infrastructure we employ in our systems. Exploring these novel behaviours can give us a better understanding of the deeper complexities of those technologies. This is the first post of a series which will explore some such informative nuggets we've come across while building our Scala services here at Football Radar. Here's our first Scala snack.

 

Setting the scene

I was recently attempting to refactor the code around data being consumed and produced by certain APIs between services. We'll use the typical inheritance model of animals and mammals to illustrate the example. Implementation details aren't important.

sealed trait Animal  
sealed trait Mammal extends Animal  
sealed trait Bird extends Animal  

Now let's suppose we're consuming some API which sends us animals in string format, for example some JSON off a Kafka queue. We have a requirement that all animals must be of the same type in the consumed data, and this validation has already been done. The function below deserialises the input.

def deserialiseAnimals(jsonBlob: String, animalType: Int): Set[Animal] = animalType match {  
  ...
  case Mammal.id =>
    val mammals: Set[Mammal] = deserialiseMammals(jsonBlob)
    mammals.map(x => mammalWithName(x))
}

Not too nice. Straight off there should probably a wildcard or bound generic type parameter in the signature, for example return a 'Set[T]' where 'T <: Animal.' This would encode the API's requirement at compile-time and avoid the pattern matching. We'll come back to that at the end. The 'Object' in which this resides is bloated due to many similar deserialisation functions. One thing we could do is move functions to the companion objects of the animals, for example:

case object Mammal {  
  def deserialise(jsonBlob: String): Set[Mammal] = { ... }
}

We would then have something like

def deserialiseAnimals(jsonBlob: String, animalType: Int): Set[Animal] = animalType match {  
  ...
  case Mammal.id => Mammal.deserialise(jsonBlob)
}

You may or may not be surprised that this doesn't compile. Interesting! There are two independent questions here: why doesn't it compile now, and why did it compile before?

 

Why doesn't it compile?

Sets in Scala are invariant. Specifically, the signature of the 'Set' trait is defined [here] as 'trait Set[A]'. This means that, using Scala type parameter syntax, if 'A >: B' holds then both 'Set[A] >: Set[B]' and 'Set[B] >: Set[A]' do not hold. Hence we cannot return a 'Set[Mammal]' or 'Set[Bird]' from this function. Why is that? The developers of Scala rarely do anything without good reason. I dug a bit deeper.

The Scala 'List[+A]' class is covariant. This seems intuitive, since we expect to be able to assign a 'List[Mammal]' to 'List[Animal]'. Well, maybe that claim itself is debatable but I'll leave that one for another time. On the REPL:

scala> trait Animal  
defined trait Animal

scala> trait Mammal extends Animal  
defined trait Mammal

scala> val animals: List[Animal] = List[Mammal](new Mammal{})  
animals: List[Animal] = List($anon$1@73b10975)  

Actually this is only safe because Scala lists are immutable. Hence if we try to do the following, we get an error.

scala> trait Bird extends Animal  
defined trait Bird

scala> animals(0) = new Bird {}  
<console>:14: error: value update is not a member of List[Animal]  
       animals(0) = new Bird {}
       ^

Well that's a relief! If that weren't the case we could end up trying to wedge a 'Bird' into a' List[Mammal]' and it would only blow up at runtime. Scala does provide a 'MutableList' as well, but that's invariant so the compiler will stop us:

scala> import scala.collection.mutable._  
import scala.collection.mutable._

scala> val animals: MutableList[Animal] = new MutableList[Mammal]()  
<console>:22: error: type mismatch;  
 found   : scala.collection.mutable.MutableList[Mammal]
 required: scala.collection.mutable.MutableList[Animal]
Note: Mammal <: Animal, but class MutableList is invariant in type A. 

We're covered either way. This is a pitfall of Java and C# arrays. They are both covariant, for [historical reasons]. Lists in both languages are invariant, because they are not immutable (by default). Put succinctly by C# guru Jon Skeet:

>No, a 'List<Dog>' is not a 'List<Animal>.' Consider what you can do with a 'List<Animal>'- you can add any animal to it... including a cat. Now, can you logically add a cat to a litter of puppies? Absolutely not.

I digress. So why are sets in Scala invariant, *despite* being immutable? Martin Odersky [assigns] the invariance of sets to their implementation details:

>On the issue of sets, I believe the non-variance stems also from the implementations. Common sets are implemented as hashtables, which are non-variant arrays of the key type. I agree it's a slightly annoying irregularity.

That's sad, but probably true. An abstraction being defined based on common implementations.

I prefer to believe there could be deeper reasoning. I like [this] justification. We can think of a 'Set[A]' as simply a mapping - for every candidate element 'A', is the element present in the set or not? [Here's an interesting question] on that topic. This representation of a set treats it as a [characteristic (indicator) function]. I'm on board with that notion.

So in explicit terms a 'Set[A]' can be thought of as a Scala function of type 'Function[A, Boolean]'. It turns out 'scala.collection.Set' extends '(A => Boolean)' so it literally is one of those! Clearly I'm not the only one who likes the idea of sets as functions.

This stackoverflow answer does, however, skirt over something I had difficulty convincing myself of:

>due to the contravariance of functions

Specifically, functions are contravariant in their input parameters. This might be obvious to some - it's something we take advantage of every day. Indeed the [Scala doc] of function types indicate as much. I'll leave that discussion for another time. For now, let's take it as a given. To state what that means explicitly: if 'A >: B' then 'Function[A, Boolean] <: Function[B, Boolean]'.

I think we could now construct a pseudo-proof of Scala 'Set' invariance. Forgive my questionable use of symbols in the following:

Let 'A >: B'. Treating sets as functions, we can say 'Set[A] = Function[A, Boolean]' and 'Set[B] = Function[B, Boolean]'[^1]. But since functions are contravariant, it follows that 'Set[A] <: Set[B]'. This means sets are at best contravariant (at best meaning they could also be invariant, since subtyping is an identity relation [^2]). Now we can make an assumption that sets are covariant, that is 'Set[A] >: Set[B]'. This is a contradiction, since covariance and contravariance are mutually exclusive. We've [proved by contradiction] that sets cannot be covariant. What about contravariance? This also seems counterintuitive in a data structure sense. It means we could make the assignment 'val bs: Set[B] = Set[A](new C{})' but when we try to read out the element as a 'B' we would get a class cast exception. We can reinforce this intuition more rigorously - a set has the requirement of implementing Iterable which is covariant. This rules out contravariance. The best we can settle for is invariance.

Ok, now I'm convinced that Scala sets (and set implementations in general) should probably be invariant. In terms of the animal example, as mentioned previously a nice solution might be to use an explicit bound type parameter to enforce the requirement of a single type of animal. The signature would be - 'def deserialiseAnimals[T <: Animal](...): Set[T]'. That's assuming we can't restructure the API in a better way, which was impractical in this particular case.

Anyway, onwards and upwards.

 

Why did it compile before?

Recall that we had:

val mammals: Set[Mammal] = deserialiseMammals(jsonBlob)
    mammals.map(x => mammalWithName(x))

When I refactored this code, it no longer compiled. What's the difference? The 'map' function seems to be the only difference_. To the REPL_:

scala> val animal: Set[Animal] = Set[Mammal](new Mammal{})  
<console>:13: error: type mismatch;  
 found   : scala.collection.immutable.Set[Mammal]
 required: Set[Animal]
Note: Mammal <: Animal, but trait Set is invariant in type A.  

Ok fair, we expect that now. We know Scala sets are invariant. What about:

scala> val animal: Set[Animal] = Set[Mammal](new Mammal{}).map(x => x)  
animal: Set[Animal] = Set($anon$1@6793f752)  

Aha! The 'map' function appears to "break" the invariance of sets in Scala. Those better versed in Scala than I will immediately know the reason for this. Let's take a look at the signature the map used by 'Set':

def map[B, That](f: A => B)(implicit bf: CanBuildFrom[This, B, That]): That

A cheeky implicit. [This] blog post by a Scala library developer gives a nice overview of how 'CanBuildFrom' works, and why they're engineering it out of Scala 2.13. In summary, it provides a way for the compiler to "choose" which type to return from a function.

In our case, 'A' is 'Mammal', 'B' is 'Mammal', 'This' is 'Set[Mammal]' and 'That' is 'Set[Animal]'. Our function 'f' is of type '(Mammal => Mammal)'. [Here's the source code] for that 'map' function. The function 'f' is iteratively applied to get us some 'B's and the 'bf' creates a builder which takes those elements of type 'B' and builds a 'That' out of them. Let's convince ourselves that this is what's actually happening in the REPL - you never know with implicits.

First we need to get our hands on a 'CanBuildFrom' instance. It's not totally obvious where the compiler is finding the in-scope implicit 'bf'. After a bit of digging it can be traced up through the 'Set''s trait hierarchy. We end up finding an instance exposed through a public function in 'scala.collection.generic.GenTraversableFactory' [^3]:

def ReusableCBF: GenericCanBuildFrom[Nothing] = ...  

This is used in various places to create 'CanBuildFrom' instances, for example on the 'GenTraversable' companion object [^4]:

implicit def canBuildFrom[A] = 
ReusableCBF.asInstanceOf[GenericCanBuildFrom[A]]  

This looks like what we need in the REPL. Let's try to hack an instance together and explicitly call 'map' with the 'CanBuildFrom' instance:

scala> import scala.collection._  
import scala.collection._

scala> import scala.collection.generic._  
import scala.collection.generic._

scala> val animalCbf = GenIterable.ReusableCBF.asInstanceOf[CanBuildFrom[Set[Mammal], Mammal, Set[Animal]]]  
animalCbf: scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Animal]] = scala.collection.generic.GenTraversableFactory$$anon$1@1132baa3

scala> val animals: Set[Animal] = Set[Mammal](new Mammal {}).map(x => x)(animalCbf)  
animals: scala.collection.Set[Animal] = Set($anon$1@7d15c513)  

Nice. Presumably that's similar to what's happening at runtime in the real code, except that the implicit is resolved by the compiler beforehand. Now let's fudge the 'CanBuildFrom' by changing the 'That' to 'Set[Mammal]':

scala> val mammalCbf = GenIterable.ReusableCBF.asInstanceOf[CanBuildFrom[Set[Mammal], Mammal, Set[Mammal]]]  
mammalCbf: scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Mammal]] = scala.collection.generic.GenTraversableFactory$$anon$1@1132baa3

scala> val animals: Set[Animal] = Set[Mammal](new Mammal {}).map(x => x)(mammalCbf)  
<console>:26: error: type mismatch;  
 found   : scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Mammal]]
 required: scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Animal]]

That proves it! The reason the code compiled before is that the compiler was resolving an implicit 'CanBuildFrom' variable somewhere in scope, which enabled it to choose the parent type 'Animal' in place of its subtype 'Mammal'. If we explicitly call 'map' with a different 'CanBuildFrom' the code doesn't compile.

It's interesting to get a glimpse into what the Scala compiler does behind the scenes. I'd be curious to know whether the author of the code I was refactoring knew that there was an implicit being pulled in, without which their code would not compile. As of Scala 2.13, code like that will continue to compile, but the means by which it's achieved will be completely different. I'll point you again to [this] blog post for more details on that subject.

Well, that was fun. Forgive me that this particular Scala snack was not so much a snack as a hearty breakfast. At least I learnt some things.

Here are a couple more of our Scala blogs for those interested:

Citations:

[^1] Probably what this "=" really means is something like sets and functions are isomorphic, meaning that there's a bijection between the set of all sets and the set of all functions. For each set there exists a function that can entirely describe the contents of the set, and vice versa. Any proper Mathematicians out there can validate whether that statement is remotely correct! 

[^2] For example see this formal mathematical definition. 

[^3] Source code: https://github.com/scala/scala/blob/v2.12.1/src/library/scala/collection/generic/GenTraversableFactory.scala#L45 

[^4] Source code: https://github.com/scala/scala/blob/v2.12.1/src/library/scala/collection/GenTraversable.scala#L31 '

 
This article was written by Daniel Tattan-Birch and posted originally on engineering.footballradar.com