Connecting...

W1siziisimnvbxbpbgvkx3rozw1lx2fzc2v0cy9zawduawz5lxrly2hub2xvz3kvanbnl2jhbm5lci1kzwzhdwx0lmpwzyjdxq

Symbols and Internalising Strings by Afsal Thaj

W1siziisijiwmtgvmtivmjevmdkvndcvmzuvnjkyl3blegvscy1wag90by03ndu3njeuanblzyjdlfsiccisinrodw1iiiwiotawedkwmfx1mdazzsjdxq

Are you looking to find out more on Symbol? Software Consultant Afsal Thaj mixes it up a bit and looks at multiple programming languages to help further our understanding of Symbol from the perspective of a much more simpler concept called String Intern.

 

'I am writing this with a hope that you would get a wide angle perception on the term “Symbol” in programming. I am deliberately mixing multiple programming languages to make our understanding better, and at-least one of them would make sense to you — if not entirely — intuitively! We are also going to explain Symbol from the perspective of a much more simpler concept called String Intern.

 

Before we talk about Symbol….

Before we explain Symbols, let’s get familiarised with the behaviour of 'strings' (instantiation, storage, etc.) in all these languages. 'let x be "afsal"' and it's a string type (not language specific). 'x' will be allocated a memory space 'M' and is identified by an 'id' (sometimes we call 'object_id'). Invoking 'x' is getting the contents of 'x' identified by 'id'.

Note: Most of you would guess this to be an intro to reference identity, reference equality and related topic. But, for the time being, let us restrict ourselves on to the terminologies mentioned above.

 

Some code in Python

Let us use Python here as an example of getting the 'id' of a variable. You may not see this feature in all languages. But the concept remains the same.

 

String with same contents share same id?

As seen above, the strings with the same content share the same id.

  • It means, when we tried to define y with the same content as that of x, the application looked up the heap/memory and identified that 'y' could reuse the contents of 'x'.
  • If 'y' needs to re-use 'x', obvious that they should have the same object_id. 'id (x) == id (y)'
  • No extra copy of 'y' is created, and we saved some space.
 

In Java/Scala?

  • In computer science, the behaviour is termed as string interning. That is, internalising the strings will ensure that all strings with same contents share the same memory.
  • One source of drawbacks is that string interning may be problematic when mixed with multithreading, but this discussion is out of scope.
  • In the above examples with python/ruby, the string x is internalized automatically, so is for Java/Scala. (However, you will find differences in behavior across these languages soon)
  • In java, String.intern() internalise the strings forcefully, but you don’t do this generally.
  • The intern() method returns a canonical representation of the string object.

Again, for demonstration I am using Scala console. I strongly recommend to view this example as a conceptual explanation of String interning, and it doesn’t intend to explain the various functions in Scala/Java and its differences. For the time being, all that you would need to know is 'eq' method in Scala verifies if two strings are pointing to the same memory location.

 

Automatic String interning: naive testing in Scala (Skip through if you don’t care)

Let’s do a simple test on the automatic interning of Strings. For simplicity, let’s use Scala console. Let’s create ten strings, each with 10000000 characters of ‘a’. All ten long strings have same contents, and ideally, they should share the same id.

 

Is interning a property of Strings?

Yes, it is a property of strings. You won’t find an 'intern' method for an Integer variable. However, you may note that ids are constant for certain primitives. Ex: 'id' of integer 1 is always the same.

Example: In python, it would look like this. You can verify the same in Ruby, and it behaves in the same manner.

 

Is this a consistent behaviour for Strings?

Surprisingly, the answer is NO. In java world, we say there is no guarantee that strings would be internalised and it depends on JVM’s whim, and probably the content itself. Python doesn’t intern strings with special characters for example.

Let’s analyse the consistency of this string interning in back to python/ruby (programming language does matter here). We will come to know String interning, being such a generalised term, behave in different ways in different languages.

In python:

 

No default string intern in Ruby?

In Ruby, you would see something known as object_id. It is the equivalent of 'id' in python. But as per documentation, the object_ids always differ for 2 active objects. Hence, the following code gives different 'object_id' (or 'id') for two variables with the same content 'afsal'. There is no string interning happening here!

 

How to intern strings in Ruby?

The answer is “Symbol”. To make it further simpler, call the method 'intern' for a string in Ruby, and you get a "Symbol" in return.

Now you must be wondering, can we solve the intern inconsistencies in any language using Symbols? Yes and No. Yes for languages who has specific Symbol type but we may not always use this in practice, and there are alternative methods in languages that don’t have symbols inbuilt. In other words, even if some languages don’t have the concept of Symbol, they have string internalizing, and there is one or other way of doing it. In Ruby/Java/Scala it is done by making use of symbols. In Python, use the method intern, and in Clojure, we could use keywords or symbols. Let us try to understand it better.

 

Symbol

The concept of symbol is not language specific, but they differ in some or other ways. Let us explore.

 

Symbol in Ruby

As mentioned in above examples, the way Ruby handles intern is by converting it into symbols. i.e, Symbol representation of String '"afsal"' is :'afsal'.

If you are not using symbols, every time you define “afsal”, Ruby instantiates a new String object, interpreter looks at the memory (heap) and allocate a new memory, assign a new object_id to keep track of the object, and interpreter marks it for destruction if not used often, resulting in the repetition of the whole steps next time the String “afsal” is defined. It can affect the performance — especially when it comes to massive datasets.

 

Some performance comparison:

Let us define a string 100000000 times and similarly a symbol 100000000 times, and see which one performs better:


Hurray! symbols are performing almost ~2 times faster than string.

 

Usage of symbols:

Many times symbols are used as identifiers.Example: Every method name in Ruby is saved as a symbol under the hood. Symbols are widely used as keys in your hash. By using symbols as keys, Ruby need to compare only the object_ids of the already stored key with the new ones, and not its contents/compute-hash-of-each-value. It could be used anywhere with-in your application.

 

Are symbols always better than strings?

Be aware using excessive use of Symbols results in lots of memory usage. The frequent casting of Symbols to Strings can also slow down your application. Said that memory leakage due to the usage of Symbols is not a concern in Ruby anymore (for version > 2.2) as there is symbol garbage collector now.

 

Symbol in Python

There is no python equivalent for Ruby’s symbols. However, as you have seen from the above examples, they are interned by default. We have also seen a few examples of strings that were not interned by default in Python. But we can force the intern of those strings using the intern function.

 

Symbols in Clojure

Clojurists have Strings, Symbols and keywords. This trichotomy may confuse many. It may partly make sense to you as we need String and Symbol.

According to documentation, the one which provides faster equality tests is given by keywords (i.e. :afsal). Also, you may observe that these keywords are not Symbols. As far as I can understand, a Symbol in Clojure/Lisp is mainly used to manipulate the function names, variables and also program forms (closely related to macros). We could use Symbols (‘afsal) to manipulate objects, but that is done less common in practice. Another difference is while Symbols are namespace qualified, keywords are not. However, we could explicitly qualify a keyword in Clojure to a particular namespace by giving an extra column

Although the difference is not explained in detail here, it is important to know at-least this much if you are into Clojure.

 

Conclusion

In dynamic languages, symbols are often used to identify things that have a stronger meaning than a string content, identifiers that are often used more than once. Moreover, in homoiconic languages like Clojure, where code can act as data, the programmer has control over manipulating functions and variables using Symbols to produce various custom behaviours. However, in statically typed languages we could argue that your comparison space is already restricted by types, and most of the times the homoiconic nature doesn’t exist. Hence, although symbols have the same meaning in the context of Java/Scala (i.e., guaranteed interning and faster equality operations), they are probably less used in practice when we compare with Ruby or Clojure.


 

This article was written by Afsal Thaj and posted originally on Medium