hadoop - Pig how to format a semi-structured CSV with filters -


i have semi-structured csv , looks this.

vts,01,0099,7022606164,sp,gp,33,060646,a,1258.9805,n,07735.9303,e,0.0,278.6,280515,0000,00,4000,11,999,842,4b61  vts,01,0099,7022606164,nm,gp,20,060637,a,1258.9805,n,07735.9302,e,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++  vts,66,0065,7022606164,nm,0,gp,22,060648,280515,1258.9804,n,07735.9301,e,04ae+++  vts,01,0099,7022606164,nm,gp,22,060656,a,1258.9804,n,07735.9301,e,0.0,278.6,280515,0000,00,4000,11,999,843,8feb+++  vts,01,0099,7022606164,nm,gp,22,060721,a,1258.9803,n,07735.9304,e,0.0,278.6,280515,0000,00,4000,11,999,845,044d++++++  vts,99,0065,7022606164,nm,0,a,gp,22,060648,280515,1258.9804,n,07735.9301,e,04ae+++  vts,99,0065,7022606164,nm,0,a,gp,22,060648,280515,1258.9804,n,07735.9301,e,04ae

i want make make 3 different tables data. i.e. 1 vts,01 vts,99 , vts,66. again need remove "+++" attached each line error, have written pig script.

data = load '/user/simulator/skytrack/27thmay2015' using pigstorage('\n') (f1:chararray);  splt = foreach data generate flatten(strsplit($0, '\\+++'));  data_pkt = filter splt $0 matches '.*vts,01+.*';  sos_pkt = filter splt $1 matches '.*vts,99+.*';  health_pkt = filter splt $2 matches '.*vts,66+.*';

when testing scripts individually each of table 1 output receive rest no output,

dump data_pkt; dump sos_pkt; dump health_pkt;

i new pig can me solve issue..it appreciated.

to remove +++, need escape "+" , not one. not specific on meaning of these pluses. rather use regex split :

 "\\+{3,}" 

and consequently, in pig script :

splt = foreach data generate flatten(strsplit($0, '\\+{3,}')); 

altough aman correct, however, rather use split instead of filter separate datasets :

 = load '/abc.txt';  split       b01 if $1 == 01,      b66 if $1 == 66,      b99 if $1 == 69; 

Comments