Load classification test data into sparse vector in Apache Spark

I have a classification model in Spark MLlib which was built using training data. Now I would like to use it to predict unlabeled data.

I have my features (Without the labels) in LIBSVM format. This is a sample of how my unlabeled data look like

1:1  18:1
4:1  32:1
2:1  8:1  33:1
1:1  6:1  11:1
1:1  2:1  8:1  28:1

I have these features saved in a text file on HDFS. How can I load them in RDD[Vector] so I can pass them to model.predict()?

I use Scala for coding.

Thanks.

Answers


Here is a solution considering that indices are one-based and in ascending order.

Let's create some dummy data similar to the one in your text file.

val data = sc.parallelize(Seq("1:1  18:1", "4:1  32:1", "2:1  8:1  33:1", "1:1  6:1  11:1", "1:1  2:1  8:1  28:1"))

We can now transform the data into a pair RDD with indices and values.

val parsed = data.map(_.trim).map { line =>
  val items = line.split(' ')
  val (indices, values) = items.filter(_.nonEmpty).map { item =>
    val indexAndValue = item.split(':')
    val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.
  val value = indexAndValue(1).toDouble
    (index, value)
  }.unzip

  (indices.toArray, values.toArray)
}

Get the number of features

val numFeatures = parsed.map { case (indices, values) => indices.lastOption.getOrElse(0) }.reduce(math.max) + 1

And finally create Vectors

val vectors = parsed.map { case (indices, values) => Vectors.sparse(numFeatures, indices, values) }

vectors.take(10) foreach println
// (33,[0,17],[1.0,1.0])
// (33,[3,31],[1.0,1.0])
// (33,[1,7,32],[1.0,1.0,1.0])
// (33,[0,5,10],[1.0,1.0,1.0])
// (33,[0,1,7,27],[1.0,1.0,1.0,1.0])

Need Your Help

Disadvantage of Python eggs?

python comparison egg

Are there any disadvantages about using eggs through easy-install compared to the "traditional" packages/modules/libs?

GCC: In what way is visibility internal "pretty useless in real world usage"?

gcc export visibility symbols

I am currently developing a library for QNX (x86) using GCC, and I want to make some symbols which are used exclusively in the library and are invisible to other modules, notably to the code which ...