i wrote little scala+spark code reads large csv file , calculates of column pair histograms (separate histogram (col1, col1), (col1, col2), (col1, col3), etc.), , stores results progressively on database (redis) other applications use.
can similar spark-sql? i'm able data dataframe , calculate histograms, i'm getting final results only.
the data can quite large i'd other applications able work on partial results while histograms being calculated (histograms of n
rows n
number of rows spark able process far).
edit: code example of have far:
val params = map("url" -> "jdbc:vertica:someaddress", "dbtable" -> "schema.mytable") val jdbc = sqlcontext.load("jdbc", params) val columns = jdbc.columns val columnpairs = scala.collection.mutable.arraybuffer[(string, string)]() (i <- 0 columns.size-1){ (j <- columns.size-1){ columnpairs += ((columns(i), columns(j))) } } val histograms = columnpairs.map{case(first, second) => enter code herejdbc.groupby(first, second).count().collect()} histograms.foreach(histogram => mergetoredis(histogram))
this way column based, it'd finish 1 column pair , save database. i'd row based, take few rows, generate histograms possible pairs, merge database , move on next batch.
Comments
Post a Comment