Low CPU utilization in Spark

I'm running a Spark job in local mode on a 8 core machine. It has a local SSD and 64GB of RAM. HDFS is being run in pseudo distributed mode on the same machine. When running the below job, I can't get CPU utilization to get past maxing out a single core. RAM usage stays under 10GB. The loopback interface maxes out around 333MB/s. Disk IO is typically under 30MB/s either way. How can I write this to make better use of my hardware resources?

object FilterProperty {
    def main(args:Array[String]) {
        val conf = new SparkConf()
            .setAppName("Filter Claims Data for Property")
            .setMaster("local")
            .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
            .set("spark.cores.max", "16")
        conf.registerKryoClasses(Array(classOf[JsObject]))
        val sc = new SparkContext(conf)
        val filtered = sc.textFile("hdfs://localhost:9000/user/kevin/intermediate/claims.json", 48)
            .filter(s => s != "")
            .map(s => Json.parse(s).as[JsObject])
            .filter(Util.property_filter)
        filtered.saveAsTextFile("hdfs://localhost:9000/user/kevin/intermediate/property_claims.json" + fn)
        sc.stop()
    }
}

Answers


You should change this line of code

.setMaster("local")

to

.setMaster("local[*]")

which means using as many threads as cores on your machine. Or you can set a number instead of * which means use that number of threads.


Need Your Help

Single instance form but not singleton

c# winforms multithreading

I cannot understand how this is possible. Please help!!