Monday, January 04, 2016

Configure Standalone Spark on Windows 10


Its been almost 2 years since I wrote a blog post. Hopefully the next ones will be much more frequent. This post is about my experice of setting up Spark as a standalone instance on Windows 10 64 bit machine. I got back to bit of programming after a long gap and it was quite evident that I struggledd a bit in configuring the system. Someone else coming from a .Net background and new to Java way of working might face similar difficulties tat I faced over a day to get Spark up and running.


What is Spark

Spark is an execution engine which is  gaining popularity due to its ability to perform in memory parallel processing. It claims to be upto 100 times more faster compared to Hadoop MapReduce processing methods. It also fits more in the distributed computing paradigm related to big data world. One of the positives of Spark is that it can be run in standalone mode without having to setup nodes in the cluster. This also means that we do not need to set up Hadoop cluster to get started with Spark. Spark is written in Scala & support Scala, Java, Python and R languages as of writing this post in January 2016. Currently it is one of the most popular projects among the different tools used as part of Hadoop ecosystem.


What is the problem in installing Spark in stand alone mode on Windows machine?

I started with downloading a copy of Spark distribution 1.5.2 Nov 9 2015 from the Apache website. I chose the version which is pre-built for Hadoop 2.6 and later. If you prefer you can also download the source code & build the whole package. After extracting the contents of the downloaded file, I tried running the Spark-shell command from the commnand prompt. If everything is installed successfully, we should get a Scala shell to execute our commands. Unfortunately on Windows 10 64 bit machine, Spark does not start very well. This seems to be a known issue as there are multiple resources on the internet which talk about it. 

When the Spark-shell command is executed, there are multiple errors which are reported on the console. The error which I received showed problems with creation of SqlContext. There was a big stack trace which was difficult to understand.

Personally this is one thing which I do not like about Java. In my past exxperience I always found it very difficult to debug issues as the error messages showed some error which may not be the correct source of the problem. I wish Java based tools and applications in future will be easier to deploy. In one sense it is good that it makes us aware of many of the internal things, but on the other hand sometimes you just want to install the stuff and get startedd with it without wanting to spend days configuring it.

I was referring to the Pluralsight course relted to Apache Spark fundamentals. The getting started and the installatioon module of the course was helpful in the first step to resolve the issue related to Spark. As suggested in the course, I changed the verbosity of the output for Spark from INFO to ERROR and the amount of info on the consoe reduced a lot. With this change, I was immediately able to get the error related to missing Winutils which is like a utility required specifically for Windows systems. This is reported as an issue SPARK-2356 in the Spark issue list. 

After copying the Winutils.exe file from the pluralsight course in the Spark installation’s bin folder, I was getting the permissions error for the tmp/Hive folder error. As reccommended in different online posts, I tried changing the permissions using chmod and setting it to 777. This did not seem to fix the issue. I tried running the command with administrative previlages. Still no luck.I updated the PATH environemnt variable to point to the Spark\bin directory. As suggested, I added the Spark_HOME, HADOOP_HOME to environment variables. Initially I had put the Winutils.exe file in the Spark/bin folder. I moved it out to dedicated directory named Winutils and updated the environemnt variable for HADOOP_HOME to this directory. Still no luck.

As many people had experienced the same problem with the latest version of Spark 1.5.2, I thought of trying an older version. Even in 1.5.1 I had the same issue. I went back to 1.4.2 version released in November 2014 and that seemed to create the SqlContext correctly. but the version is more than a year old, so there wass no point sticking to the outdated version. 

At this stage I was contemplating the option of getting the source code and building it from scratch. Having read in multiple posts about setting JAVA_HOME environment variable I thought of trying this apparoach. I downloaded the Java 7 SDK and created the environment variable to point to the location where jdk was installed. Even this did not solve the problem.


Use right version of Winutils

As a last option, I decided to download the Winutils.exe from a different source. In the downloaded contents, I got Winutils and some other dlls as well like Hadoop.dll as shown in the figure below.

Winutils with hadoop dlls

After putting these contents in the Winutils directory and running the Spark-shell command everything was in place and SqlContext was successfully created.

I am not really sure which step fixed the issue. Was it the jdk and setting of JAVA_HOME environment.Or was it the update of winutils exe along with other dll. All this setup was quite time consuming. Hope this is helpful for people trying to setup standalone instance of Spark on Windows 10 machines.

While I was trying to get Spark up & running, I found following links which might be helpful in case you face similar issues

The last one was really helpful from where I took the idea of separating Winutils exe into different folder and also to install JDK & Scala. But setting scala envirnment variables were not required as I was able to get the scala prompt without scala installation.


Following are the steps I followed for installing Standalone instance of Spark on Windows 10 64 bit machine

  • JDK (6 or higher version)
  • Download Spark distribution
  • Download correct version of Winutils.exe dll
  • Set Environment variables for JAVA_HOME, SPARK_HOME & HADOOP_HOME

Note : When running the chmod command to set 777 attributes for tmp/hive directory make sure to run the command prompt with Administrative privilages.

Submit Apache Spark job from IntelliJ to HDInsight cluster

Background This is 2nd part of the Step by Step guide to run Apache Spark on HDInsight cluster. In the first part we saw how to provision t...