Connecting...

W1siziisimnvbxbpbgvkx3rozw1lx2fzc2v0cy9zawduawz5lxrly2hub2xvz3kvanbnl2jhbm5lci1kzwzhdwx0lmpwzyjdxq

Pandas vs. Spark: how to handle dataframes (Part II) by Emma Grimaldi

W1siziisijiwmtkvmdqvmjqvmtuvndevndqvnji2l3blegvscy1wag90by0xntqxndeuanblzyjdlfsiccisinrodw1iiiwiotawedkwmfx1mdazzsjdxq

How does handling dataframes compare between using Python and Scala?

Data Analyst, Emma Grimaldi investigates the difference between Panda and Spark, how does it read, display and more!

 

'A few days ago I published a post comparing the basic commands of Python and Scala: how to deal with lists and arrays, functions, loops, dictionaries and so on. As I continue practicing with Scala, it seemed appropriate to follow-up with a second part, comparing how to handle dataframes in the two programming languages, in order to get the data ready before the modeling process. In Python, we will do all this by using Pandas library, while in Scala we will use Spark.

For this exercise, I will use the Titanic train dataset that can be easily downloaded at this link. Also, I do my Scala practices in Databricks: if you do so as well, remember to import your dataset first by clicking on Data and then Add Data.

 

1. Read the dataframe

I will import and name my dataframe df, in Python this will be just two lines of code. This will work if you saved your train.csv in the same folder where your notebook is.

Scala will require more typing.

Let’s see what’s going on up here. Scala does not assume your dataset has a header, so we need to specify that. Also, Python will assign automatically a dtype to the dataframe columns, while Scala doesn’t do so, unless we specify '.option("inferSchema", "true")'. Also notice that I did not import Spark Dataframe, because I practice Scala in Databricks, and it is preloaded. Otherwise we will need to do so.

Notice: booleans are capitalized in Python, while they are all lower-case in Scala!

2. Display the first rows of the dataframe

In Python, 'df.head()' will show the first five rows by default: the output will look like this.
df.head() output in Python.
 

If you want to see a number of rows different than five, you can just pass a different number in the parenthesis. Scala, with its 'df.show()',will display the first 20 rows by default.

df.show() in Scala.
 
If we want to keep it shorter, and also get rid of the ellipsis in order to read the entire content of the columns, we can run 'df.show(5, false)'.
 

3. Dataframe Columns and Dtypes

To retrieve the column names, in both cases we can just type 'df.columns': Scala and Pandas will return an Array and an Index of strings, respectively.

If we want to check the dtypes, the command is again the same for both languages: 'df.dtypes'. Pandas will return a Series object, while Scala will return an Array of tuples, each tuple containing respectively the name of the column and the dtype. So, if we are in Python and we want to check what type is the Age column, we run 'df.dtypes['Age']', while in Scala we will need to filter and use the Tuple indexing: 'df.dtypes.filter(colTup => colTup._1 == "Age")'.

df.dtypes in Python

 

4. Summary Statistics

This is another thing that every Data Scientist does while exploring his/her data: summary statistics. For every numerical column, we can see information such as count, mean, median, deviation, so on and so forth, to see immediately if there is something that doesn’t look right. In both cases this will return a dataframe, where the columns are the numerical columns of the original dataframe, and the rows are the statistical values.

In Python, we type 'df.describe()', while in Scala 'df.describe().show()'. The reason we have to add the '.show()' in the latter case, is because Scala doesn’t output the resulting dataframe automatically, while Python does so (as long as we don’t assign it to a new variable).

 

5. Select Columns

Suppose we want to see a subset of columns, for example Name and Survived.

In Python we can use either 'df[['Name','Survived]]' or 'df.loc[:,['Name','Survived]' indistinctly. Remember that the ':' in this case means “all the rows”.

In Scala, we will type 'df.select("Name","Survived").show()'. If you want to assign the subset to a new variable, remember to omit the '.show()'.

 

6. Filtering

Let’s say we want to have a look at the Name and Pclass of the passengers who survived. We will need to filter a condition on the Survived column and then select the the other ones.

In Python, we will use '.loc' again, by passing the filter in the rows place and then selecting the columns with a list. Basically like the example above but substituting the ':' with a filter, which means 'df.loc[df['Survived'] == 1, ['Name','Pclass']]'.

In Scala we will use '.filter' followed by '.select', which will be 'df.filter("Survived = 1").select("Name","Pclass").show()'.

 

6.1. Filtering null values

If we want to check the null values, for example in the Embarked column, it will work like a normal filter, just with a different condition.

In Python, we apply the '.isnull()' when passing the condition, in this case 'df[df['Embarked'].isnull()]'. Since we didn’t specify any columns, this will return a dataframe will all the original columns, but only the rows where the Embarked values are empty.

In Scala, we will use '.filter' again: 'df.filter("Embarked IS NULL").show()'. Notice that the boolean filters we pass in Scala, kind of look like SQL queries.

 

7. Imputing Null Values

We should always give some thought before imputing null values in a dataset, because it is something that will influence our final model and we want to be careful with that. However, just for demonstrative purposes, let’s say we want to impute the string “N/A” to the null values in our dataframe.

We can do so in Python with either 'df = df.fillna('N/A')' or 'df.fillna('N/A', inplace = True)'.

In Scala, quite similarly, this would be achieved with 'df = df.na.fill("N/A")'. Remember to not use the '.show()' in this case, because we are assigning the revised dataframe to a variable.

 

8. Renaming Columns

This is something that you will need to for sure in Scala, since the machine learning models will need two columns named features and label in order to be trained. However, this is something you might want to do also in Pandas if you don’t like how a column has been named, for example. For this purpose, we want to change the Survived column into label.

In Python we will pass a dictionary, where the key and the value are respectively the old and the new name of the column. In this case, it will be 'df.rename(columns = {"Survived": "label"}, inplace = True)'.

In Scala, this equals to 'df = df.withColumnRenamed("Survived", "label")'.

 

9. Group By and Aggregation

Let’s say we want to calculate the maximum Age for men and women distinctively, in this case '.groubpby' comes in handy. Not only to retrieve the maximum value; after '.groupby' we can use all sorts of aggregation functions: mean, count, median, so on and so forth. We stick with '.max()' for this example.

In Python this will be 'df.groupby('Sex').mean()['Age']'. If we don’t specify '['Age']' after '.mean()', this will return a dataframe with the maximum values for all numerical columns, grouped by Sex.

In Scala, we will need to import the aggregation function we want to use, first.

10. Create a New Column

This is really useful for feature engineering, we might want to combine two variables to see how their interaction is related to the target. For purely demonstrative purpose, let’s see how to create a column containing the product between Age and Fare.

In Python it is pretty straightforward.

df['Age_times_Fare'] = df['Age'] * df['Fare']
In Scala, we will need to put '$' before the names of the columns we want to use, so that the column object with the corresponding name will be considered.
df = df.withColumn("AgeTimesFare", $"Age" * $"Fare")

 

11. Correlation

Exploring correlation among numerical variables and target is always convenient, and obtaining a matrix of correlation coefficients among all numeric variables is pretty easy in Python, just by running 'df.corr()'. If you want to look at the correlation, let’s say between Age and Fare, we will just need to specify the columns: 'df[['Age','Fare']].corr()'.

In Scala, we will need to import first, and then run the command by specifying the columns.

 

This article was written by Emma Grimaldi and posted originally on Medium.