To maintain state for key-pair values, the data may be too big to fit in memory on one machine - Spark Streaming can maintain the state for you. To do that, call the updateStateByKey function of the Spark Streaming library.
First, in order to use updateStateByKey, checkpointing must be enabled on the streaming context. To do that, just call checkpoint on the streaming context with a directory to write the checkpoint data." (from http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html)
When you enable checkpointing for your streaming context with ssc.checkpoint(<PATH_TO_DIRECTORY>); you may get the error messages
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
Exception in thread "pool-8-thread-1" java.lang.NullPointerException
Also, looking inside checkpoint directory you can find some files created at the moment of the stream processing, however these files are empty. These files store state information (act as checkpoint) without this data we cannot use updateStateByKey properly.
To solve this issue you need:
1) Download winutils.exe and save it on a local storage:
- for Win32 (x86) you can find it using the links below:
Download and extract the zip file and extract jar files as well
- for Win64 (x64):
Discussions about winutils on stackoverflow and MSDN
It's very important to have corresponding version of winutills accordingly your OS, don't forget about it in production code including the condition which resolves version of winutils on the target platform.
If the version of the winutils isn't capable with the target platform you can get the following message:
CreateProcess error=216, This version of %1 is not compatible with the version of Windows you're running. Check your computer's system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher
2) set environmental variable "HADOOP_HOME" to the folder which contains bin/winutils.exe
- option A: use global environment variables. My Computer -> Properties -> Advanced system settings -> Environment variables
- option B: from your source code System.setProperty("hadoop.home.dir", <PATH_TO_WINUTILS>)
I hope this post will save somebody's time