python - How can I use melt() to reshape a pandas DataFrame to a list, creating an index from a crosstab column and creating a new variable in its place? -
i have matrix of data 29523 rows x 503 cols of 3 cols indices (below subset example).
idx1| idx2 | idx3 | 1983 q4 | x | y | z |1984 q1 | x | y | z --------------------------------------------------------------------------- | a1 | q | 10 | | f | nan | 110 | | f | nan | a2 | q | 20 | b | c | 40 | 120 | b | c | 240 | a3 | q | 30 | | f | nan | 130 | | f | nan | a4 | q | 40 | b | c | 80 | 140 | b | c | 280 | a5 | q | 50 | | f | nan | 150 | | f | nan | a6 | q | 60 | b | f | 120 | 160 | b | f | 320 i read dataframe with:
>>> df = pd.read_csv(c:\filename.csv, low_memory=false, mangle_dupe_cols=false) and use pandas.melt() pivot data:
df1 = pd.melt(df, id_vars=['idx1', 'idx2', 'idx3'], var_name='valuetype', value_name = 'value') i have tried stack() melt() proved better here.
idx1 | idx2 | idx3 | valuetype | value --------------------------------------------------------------- | a1 | q | 1983 q4 | 10 | a1 | q | x | a | a1 | q | y | f | a1 | q | z | nan | a1 | q | 1984 q1 | 110 | a1 | q | x | a | a1 | q | y | f | a1 | q | z | nan | a2 | q | 1983 q4 | 20 | a2 | q | x | b | a2 | q | y | c | a2 | q | z | 40 the option mangle_dupe_cols on read_csv if true place .int suffix against valuetypes duplicated. not ideal, without there no way of linking values variables correct period.
what prefer instead of having period (1984 q1) valuetype, give periods corresponding value variable 'w' , have each period form part of idx below:
idx1 | idx2 | idx3 | idx4 | valuetype | value --------------------------------------------------------------- | a1 | q | 1983 q4| w | 10 | a1 | q | 1983 q4| x | a | a1 | q | 1983 q4| y | f | a1 | q | 1983 q4| z | nan | a1 | q | 1984 q1| w | 110 | a1 | q | 1984 q1| x | a | a1 | q | 1984 q1| y | f | a1 | q | 1984 q1| z | nan | a2 | q | 1983 q4| w | 20 | a2 | q | 1983 q4| x | b | a2 | q | 1983 q4| y | c | a2 | q | 1983 q4| z | 40 is above possible pandas or numpy?
my final dataframe going 14,761,500 rows x 6 cols.
given
in [189]: df out[189]: idx1 idx2 idx3 1983 q4 x y z 1984 q1 x.1 y.1 z.1 0 a1 q 10 f nan 110 f nan 1 a2 q 20 b c 40 120 b c 240 2 a3 q 30 f nan 130 f nan 3 a4 q 40 b c 80 140 b c 280 4 a5 q 50 f nan 150 f nan 5 a6 q 60 b f 120 160 b f 320 let first set ['idx1', 'idx2', 'idx3'] index.
df = df.set_index(['idx1', 'idx2', 'idx3']) the other columns have periodic quality them; want handle every 4 columns group. idea of "handling group" leads naturally assigning new index level column index; value same every 4 columns. ideal:
1983 q4 1984 q1 w x y z w x y z idx1 idx2 idx3 a1 q 10 f nan 110 f nan a2 q 20 b c 240 120 b c 240 a3 q 30 f nan 130 f nan a4 q 40 b c 280 140 b c 280 a5 q 50 f nan 150 f nan a6 q 60 b f 320 160 b f 320 we can achieve building multiindex , assigning df.columns:
columns = [col col in df.columns if col[0] not in set(list('xyz'))] df.columns = pd.multiindex.from_product([columns, list('wxyz')]) now desired long-format dataframe can obtained calling df.stack move column levels row index:
df.columns.names = ['idx4', 'valuetype'] series = df.stack(['idx4', 'valuetype'], dropna=false) note when mangle_dupe_cols=false, duplicate columns, x, y, z, get overwritten. lose data mangle_dupe_cols=false. example, when use mangle_dupe_cols=false last row's z value gets assigns every z column regardless of period.
so must use mangle_dupe_cols=true, (or leave out since default) , adjust code accordingly. that, fortunately, not hard since reassigning df.columns custom-build multiindex anyway.
putting together:
import numpy np import pandas pd df = pd.read_table('data', sep=r'\s*[|]\s*') df = df.set_index(['idx1', 'idx2', 'idx3']) columns = [col col in df.columns if col[0] not in set(list('xyz'))] df.columns = pd.multiindex.from_product([columns, list('wxyz')]) df.columns.names = ['idx4', 'valuetype'] series = df.stack(['idx4', 'valuetype'], dropna=false) print(series.head()) yields
idx1 idx2 idx3 idx4 valuetype a1 q 1983 q4 w 10 x y f z nan 1984 q1 w 110 dtype: object note since we've removed column levels, result series. if want dataframe 6 columns, should follow with:
series.name = 'value' df = series.reset_index() print(df.head()) which yields
idx1 idx2 idx3 idx4 valuetype value 0 a1 q 1983 q4 w 10 1 a1 q 1983 q4 x 2 a1 q 1983 q4 y f 3 a1 q 1983 q4 z nan 4 a1 q 1984 q1 w 110 ...
Comments
Post a Comment