hadoop - Spark - How to count number of records by key -


this easy problem have dataset count number of females each country. want group each count country unsure of use value since there not count column in dataset can use value in groupbykey or reducebykey. thought of using reducebykey() requires key-value pair , want count key , make counter value. how go this?

val lines = sc.textfile("/home/cloudera/desktop/file.txt") val split_lines = lines.map(_.split(",")) val femaleonly = split_lines.filter(x => x._10 == "female") 

here stuck. country index 13 in dataset also. output should this: (australia, 201000) (america, 420000) etc great. thanks

you're there! need countbyvalue:

val countoffemalesbycountry = femaleonly.map(_(13)).countbyvalue() // prints (australia, 230), (america, 23242), etc. 

(in example, assume meant x(10) rather x._10)

all together:

sc.textfile("/home/cloudera/desktop/file.txt")     .map(_.split(","))     .filter(x => x(10) == "female")     .map(_(13))     .countbyvalue() 

Comments