Thursday, March 19, 2015

Apache Spark checkpoint issue on windows

"To keep track of the log statistics for all of time, state must be maintained between processing RDD's in a DStream.

To maintain state for key-pair values, the data may be too big to fit in memory on one machine - Spark Streaming can maintain the state for you. To do that, call the updateStateByKey function of the Spark Streaming library.

First, in order to use updateStateByKey, checkpointing must be enabled on the streaming context. To do that, just call checkpoint on the streaming context with a directory to write the checkpoint data." (from http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html)

When you enable checkpointing for your streaming context with ssc.checkpoint(<PATH_TO_DIRECTORY>); you may get the error messages

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
...
Exception in thread "pool-8-thread-1" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)


Also, looking inside checkpoint directory you can find some files created at the moment of the stream processing, however these files are empty. These files store state information (act as checkpoint) without this data we cannot use updateStateByKey properly.

To solve this issue you need:



1) Download winutils.exe and save it on a local storage:

  • for Win32 (x86) you can find it using the links below:

https://repo.rrd-hadoop-win32.googlecode.com/archive/f54eb586ddb66d3a938033bf3d9272a832b8e201.zip
https://code.google.com/p/rrd-hadoop-win32/source/checkout
Download and extract the zip file and extract jar files as well

  • for Win64 (x64):

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip
Discussions about winutils on stackoverflow  and MSDN

It's very important to have corresponding version of winutills accordingly your OS, don't forget about it in production code including the condition which resolves version of winutils on the target platform.
If the version of the winutils isn't capable with the target platform you can get the following message:

CreateProcess error=216, This version of %1 is not compatible with the version of Windows you're running. Check your computer's system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher

2) set environmental variable "HADOOP_HOME" to the folder which contains  bin/winutils.exe
- option A: use global environment variables. My Computer -> Properties -> Advanced system settings -> Environment variables
- option B: from your source code System.setProperty("hadoop.home.dir", <PATH_TO_WINUTILS>)

I hope this post will save somebody's time

3 comments:

  1. Hi,

    I could not find Win32 (x86) version of Winutils.exe on links published. Please help if you have them. I need them for Spark installation on Windows 7 32 bit machine.

    Thanks,
    Vineet

    ReplyDelete
  2. Thanks Victor, yes, you saved me some time! :)

    ReplyDelete
  3. Thanks a lot, It really saved my time.

    ReplyDelete