regex - how to transforming pseudo xml into flat structure? -



i'm trying parse file looks xml not. actualy readable version of crd transformed asn1 format. looks this:

<pin rownum="1"> <cgpa tag="3100.2.960.51" value="1"> <data tag="3100.2.962.56" name="cgpasubscriberidentifier" value="50212000000089804"/> <data tag="3100.2.962.60" name="cgparoaming" value="1"/> </cgpa> <aaa_common tag="3100.2.960.1" value="1"> <data tag="3100.2.962.12" name="sigsleeid" value="watbf102"/> <data tag="3100.2.962.34" name="scpaddress" value="48602888950"/> </aaa_common> <evt tag="3100.2.134.28" name="unsupported" value="0"/> <data tag="3100.2.112.1" name="eventdatetime" value="07/05/2014 19:45:18"/> <data tag="3100.2.137.4" name="intriggeringkey" value="0048662221827"/> <evt tag="3100.2.137.5" name="typeintriggeringkey" value="1"/> <customerdomain tag="3100.2.134.1" value="1"> <data tag="3100.2.133.1" name="ordinaryclientid" value="50212000000089804"/> <data tag="3100.2.105.1" name="customerservicename" value="so_tt_roam_voice"/> <accountdomain tag="3100.2.134.3" value="1"> <data tag="3100.2.104.4" name="accountidentifier" value="50212000000089804"/> <data tag="3100.2.100.1" name="subscribertype" value="1"/> <evt tag="3100.2.139.3" name="unsupported" value="0"/> <tariffdomain tag="3100.2.134.11" value="1"> <data tag="3100.2.106.10" name="tariffplannameversion" value="tt_voi_r_1_pl_1a_0_roamb - 2_tca"/> </tariffdomain> <tariffdomain tag="3100.2.134.11" value="1"> <data tag="3100.2.106.10" name="tariffplannameversion" value="tt_voi_r_1_pl_1a_0_main - 2_tca"/> <data tag="3100.2.106.1" name="tariffplanname" value="tt_voi_r_1_pl_1a_0_main"/> <evt tag="3100.2.139.9" name="tariffcost" value="1013"/> <evt tag="3100.2.139.10" name="tariffcostvat" value="1013"/> <evt tag="3100.2.140.7" name="eventquantitypertariff1" value="614"/> <evt tag="3100.2.142.11" name="usedquantitypertariff1" value="614"/> </tariffdomain> <evt tag="3100.2.134.29" name="unsupported" value="1"/> <data tag="3100.2.124.45" name="unsupported" value="07/05/2014 19:45:18"/> <evt tag="3100.2.139.35" name="unsupported" value="495"/> <data tag="3100.2.24.11" name="unsupported" value="84490"/> <evt tag="3100.2.134.30" name="unsupported" value="1"/> </accountdomain> </customerdomain> </pin> 

the main tag each record pin, sub-tags can appear in random order or don't appear @ all. typical solution xml cases in pig use piggybank function xmlloader. assumes order of tags constant. otherwise unnable put schema. solution see regexp each line , take name , value , use map[]. tags appear more ones tariffdomain in example? how deal it?

regards
pawel

i throwing 1 idea, please let me know if works you.
algorithm:
1. parse each line , take name , value using regex
2. remove null strings
3. group rows based on key
4. map each key multiple values bags

pigscript:    = load 'input.txt' line;   b = foreach generate flatten(regex_extract_all(line,'.*name="(.*)"\\s+value="(.*)".*'))   as(mykey:chararray,myvalue:chararray);   c = filter b mykey not null;   d = group c mykey;   e = foreach d generate tomap(group,c.myvalue);   dump  e;    output: ([sigsleeid#{(watbf102)}])   ([scpaddress#{(48602888950)}])   ([tariffcost#{(1013)}])   ([cgparoaming#{(1)}])   ([unsupported#{(1),(0),(0),(1),(07/05/2014 19:45:18),(495),(84490)}])   ([eventdatetime#{(07/05/2014 19:45:18)}])   ([tariffcostvat#{(1013)}])   ([subscribertype#{(1)}])   ([tariffplanname#{(tt_voi_r_1_pl_1a_0_main)}])   ([intriggeringkey#{(0048662221827)}])   ([ordinaryclientid#{(50212000000089804)}])   ([accountidentifier#{(50212000000089804)}])   ([customerservicename#{(so_tt_roam_voice)}])   ([typeintriggeringkey#{(1)}])   ([tariffplannameversion#{(tt_voi_r_1_pl_1a_0_main - 2_tca),(tt_voi_r_1_pl_1a_0_roamb - 2_tca)}])   ([usedquantitypertariff1#{(614)}])   ([eventquantitypertariff1#{(614)}])   ([cgpasubscriberidentifier#{(50212000000089804)}])  

Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -