i'm new pandas
, python
, hope make sense.
i have parsed multiple
tables website multiple csv files
, , unfortunately if value not available parsed data, omitted table. hence, have csv files varying number of columns.
i've used read_csv()
, to_csv()
in past , works charm when data clean, i'm stumped here.
i figured there might way "map
" read data if first fed pandas df
column headers, map each file against columns in main file.
e.g. once used read_csv()
, to_csv()
@ main merged file , "map
" available fields correct columns in merged file.
this short version of data:
file 1: id, price, name, 1, $800, jim file 2: id, price, address, name 2, $500, 1 main st., amanda desired output: id, price, adress, name 1, $800, , jim 2, $500, 1 main st., amanda
this code got far.
mypath='i:\\filepath\\' #creating list of files read, , merged. listfiles = [] (dirpath, dirnames, filenames) in walk(mypath): listfiles.extend(filenames) break # reading/writing "master headers" new csv using "master header" file headers = pd.read_csv('i:\\filepath\\master_header.csv', index_col=0) open('i:\\filepath\\merge.csv', 'wb') f: headers.to_csv(f) def mergefile(filenames): try: # creating list of files read. open('i:\\filepath\\file_list.txt', 'a') f: f.write(str(filenames)+'\n') os.chdir('i:\\filepath\\') # reading file add. df = pd.read_csv(filenames, index_col=0) # appending data (w/o header) new merged data csv file. open('i:\\filepath\\merge.csv', 'a') f: df.to_csv(f, header=false) except exception, e: open('i:\\filepath\\all_error.txt', 'a') f: f.write(str(e)+'\n') eachfilenames in listfiles: mergefile(eachfilenames)
this code merges data, since number of columns vary, not in right place...
any appreciated.
here complete example demonstrates how load files , merge them using concat
:
in [297]: import pandas pd import io t="""id, price, name 1, $800, jim""" df = pd.read_csv(io.stringio(t), sep=',\s+') t1="""id, price, address, name 2, $500, 1 main st., amanda""" df1 = pd.read_csv(io.stringio(t1), sep=',\s+') pd.concat([df,df1], ignore_index=true) out[297]: address id name price 0 nan 1 jim $800 1 1 main st. 2 amanda $500
note pass ignore_index=true
otherwise duplicate index entries assume not want, i'm assuming in original data sample 'file 1' don't have trailing comma in header line: id, price, name,
removed code above
Comments
Post a Comment