mardi 21 avril 2015

Type mismatch with identical types in Spark-shell

I have build a scripting workflow around the spark-shell but I'm often vexed by bizarre type mismatches (probably inherited from the scala repl) occuring with identical found and required types. The following example illustrates the problem. Executed in paste mode, no problem

scala> :paste
// Entering paste mode (ctrl-D to finish)


import org.apache.spark.rdd.RDD
case class C(S:String)
def f(r:RDD[C]): String = "hello"
val in = sc.parallelize(List(C("hi")))
f(in)

// Exiting paste mode, now interpreting.

import org.apache.spark.rdd.RDD
defined class C
f: (r: org.apache.spark.rdd.RDD[C])String
in: org.apache.spark.rdd.RDD[C] = ParallelCollectionRDD[0] at parallelize at <console>:13
res0: String = hello

but

scala> f(in)
<console>:29: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[C]
 required: org.apache.spark.rdd.RDD[C]
              f(in)
                ^ 

There are related discussion about the scala repl and about the spark-shell but the mentioned issue seems unrelated (and resolved) to me.

This problem causes serious problems for writing passable code to be executed interactively in the repl, or causes to lose most of the advantage of working in a repl to begin with. Is there a solution? (And/or is it a known issue?)

Edits:

Problems occured with spark 1.2 and 1.3.0. Test made on spark 1.3.0 using scala 2.10.4

It seems that, at least in the test, repeating the statement using the class separately from the case class definition, mitigate the problem

scala> :paste
// Entering paste mode (ctrl-D to finish)


def f(r:RDD[C]): String = "hello"
val in = sc.parallelize(List(C("hi1")))

// Exiting paste mode, now interpreting.

f: (r: org.apache.spark.rdd.RDD[C])String
in: org.apache.spark.rdd.RDD[C] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala> f(in)
res2: String = hello

Aucun commentaire:

Enregistrer un commentaire