Progressive histograms with Spark-SQL -


i wrote little scala+spark code reads large csv file , calculates of column pair histograms (separate histogram (col1, col1), (col1, col2), (col1, col3), etc.), , stores results progressively on database (redis) other applications use.

can similar spark-sql? i'm able data dataframe , calculate histograms, i'm getting final results only.

the data can quite large i'd other applications able work on partial results while histograms being calculated (histograms of n rows n number of rows spark able process far).

edit: code example of have far:

val params = map("url" -> "jdbc:vertica:someaddress", "dbtable" -> "schema.mytable") val jdbc = sqlcontext.load("jdbc", params) val columns = jdbc.columns val columnpairs = scala.collection.mutable.arraybuffer[(string, string)]() (i <- 0 columns.size-1){     (j <- columns.size-1){         columnpairs += ((columns(i), columns(j)))     } } val histograms = columnpairs.map{case(first, second) =>      enter code herejdbc.groupby(first, second).count().collect()} histograms.foreach(histogram => mergetoredis(histogram)) 

this way column based, it'd finish 1 column pair , save database. i'd row based, take few rows, generate histograms possible pairs, merge database , move on next batch.


Comments