How does Apache Spark achieve a 100x speedup over Hadoop MapReduce and in what scenarios?
Apache Spark [http://spark.apache.org/] claims to be 100x faster than Apache Hadoop in memory. How does it achieve this phenomenal speedup? It this speedup only applicable for iterative machine learning algorithms or also for ETL (extract-transform-load) tasks like JOINs and GROUPBYs? Can Spark's RDDs (Resilient Distributed Datasets) and DataFrames both provide this kind of a speedup? Are there any benchmark test results obtained by the Spark community for some of the scenarios described above?
- Spark does data processing in-memory.
- There will not be intermediary files as in Map Reduce, so there is no I/O or negligible.
- It does not run 100x faster in all the scenarios, especially when it involves joins and sorting.
- As it is memory intensive, it can saturate cluster quickly. You might be able to run one job 100x faster at a given point in time, but will not be able to run as many jobs/applications you can run using traditional hadoop approach.
- RDDs and Data Frames are the internal data structures which come handy for processing the data. RDDs are in-memory data structures for the data and data frames are primarily metadata of those RDDs. They are more of representation of data in spark.
Problem with most of these claims are not bench marked against true production use cases. There might be volume, but not quality in the data which can represent actual business applications. Spark can be very handy for streaming analytics, where you want to understand data in near real time. But for true batch processing Hadoop can be better solution, especially on commodity hardware.
Spark is faster than Hadoop due to in-memory processing. But there are some twisted facts about the numbers.
Still Spark has to rely on HDFS for some use cases.
Have a look at complete presentation.
Spark is faster for real time analytics due to in-memory processing. Hadoop is good for batch processing. If you are not worried about the latency of job, still you can use Hadoop.
But one thing is sure. Spark and Hadoop have to co-exist. Neither of them can replace other.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.