python - Merging CSV Files with missing columns in Pandas -


i'm new pandas , python, hope make sense.

i have parsed multiple tables website multiple csv files, , unfortunately if value not available parsed data, omitted table. hence, have csv files varying number of columns.

i've used read_csv() , to_csv() in past , works charm when data clean, i'm stumped here.

i figured there might way "map" read data if first fed pandas df column headers, map each file against columns in main file.

e.g. once used read_csv(), to_csv() @ main merged file , "map" available fields correct columns in merged file.

this short version of data:

file 1: id, price, name,  1, $800, jim file 2: id, price, address, name 2, $500, 1 main st., amanda   desired output: id, price, adress, name 1, $800, , jim 2, $500, 1 main st., amanda 

this code got far.

mypath='i:\\filepath\\'  #creating list of files read, , merged.  listfiles = [] (dirpath, dirnames, filenames) in walk(mypath):     listfiles.extend(filenames)     break  # reading/writing "master headers" new csv using "master header" file      headers = pd.read_csv('i:\\filepath\\master_header.csv', index_col=0)  open('i:\\filepath\\merge.csv', 'wb') f:         headers.to_csv(f)  def mergefile(filenames):       try:     # creating list of files read.      open('i:\\filepath\\file_list.txt', 'a') f:         f.write(str(filenames)+'\n')      os.chdir('i:\\filepath\\')     # reading file add.     df = pd.read_csv(filenames, index_col=0)       # appending data (w/o header) new merged data csv file.      open('i:\\filepath\\merge.csv', 'a') f:       df.to_csv(f, header=false)   except exception, e:     open('i:\\filepath\\all_error.txt', 'a') f:         f.write(str(e)+'\n')  eachfilenames in listfiles:     mergefile(eachfilenames) 

this code merges data, since number of columns vary, not in right place...

any appreciated.

here complete example demonstrates how load files , merge them using concat:

in [297]: import pandas pd import io t="""id, price, name 1, $800, jim""" df = pd.read_csv(io.stringio(t), sep=',\s+') t1="""id, price, address, name 2, $500, 1 main st., amanda""" df1 = pd.read_csv(io.stringio(t1), sep=',\s+') pd.concat([df,df1], ignore_index=true)  out[297]:       address  id    name price 0         nan   1     jim  $800 1  1 main st.   2  amanda  $500 

note pass ignore_index=true otherwise duplicate index entries assume not want, i'm assuming in original data sample 'file 1' don't have trailing comma in header line: id, price, name, removed code above


Comments