mardi 21 avril 2015

Scala/Spark efficient partial string match

I am writing a small program in Spark using Scala, and came across a problem. I have a List/RDD of single word strings and a List/RDD of sentences which might or might not contain words from the list of single words. i.e.

val singles = Array("this", "is")
val sentence = Array("this Date", "is there something", "where are something", "this is a string")

and I want to select the sentences that contains one or more of the words from singles such that the result should be something like:

output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]

I thought about two approaches, one by splitting the sentence and filtering using .contains. The other is to split and format sentence into a RDD and use the .join for RDD intersection. I am looking at around 50 single words and 5 million sentences, which method would be faster? Are there any other solutions? Could you also help me with the coding, I seem to get no results with my code (although it compiles and run without error)

Aucun commentaire:

Enregistrer un commentaire