python - How can I use melt() to reshape a pandas DataFrame to a list, creating an index from a crosstab column and creating a new variable in its place? -

September 15, 2013

i have matrix of data 29523 rows x 503 cols of 3 cols indices (below subset example).

idx1|  idx2  | idx3 | 1983 q4   |  x  | y |  z  |1984 q1 |   x  | y | z  ---------------------------------------------------------------------------   |   a1   |  q   |   10      |   | f | nan | 110    |    | f | nan   |   a2   |  q   |   20      |  b  | c | 40  | 120    |   b  | c | 240   |   a3   |  q   |   30      |   | f | nan | 130    |    | f | nan   |   a4   |  q   |   40      |  b  | c | 80  | 140    |   b  | c | 280   |   a5   |  q   |   50      |   | f | nan | 150    |    | f | nan   |   a6   |  q   |   60      |  b  | f | 120 | 160    |   b  | f | 320

i read dataframe with:

>>> df = pd.read_csv(c:\filename.csv, low_memory=false, mangle_dupe_cols=false)

and use pandas.melt() pivot data:

df1 = pd.melt(df, id_vars=['idx1', 'idx2', 'idx3'], var_name='valuetype',               value_name = 'value')

i have tried stack() melt() proved better here.

idx1    |   idx2    |   idx3    |   valuetype   |   value ---------------------------------------------------------------       |   a1      |   q       |   1983 q4     |   10       |   a1      |   q       |   x           |   a       |   a1      |   q       |   y           |   f       |   a1      |   q       |   z           |   nan       |   a1      |   q       |   1984 q1     |   110       |   a1      |   q       |   x           |   a       |   a1      |   q       |   y           |   f       |   a1      |   q       |   z           |   nan       |   a2      |   q       |   1983 q4     |   20       |   a2      |   q       |   x           |   b       |   a2      |   q       |   y           |   c       |   a2      |   q       |   z           |   40

the option mangle_dupe_cols on read_csv if true place .int suffix against valuetypes duplicated. not ideal, without there no way of linking values variables correct period.

what prefer instead of having period (1984 q1) valuetype, give periods corresponding value variable 'w' , have each period form part of idx below:

idx1    |   idx2    |   idx3 | idx4    |    valuetype   |   value ---------------------------------------------------------------       |   a1      |   q    |  1983 q4|    w           |   10       |   a1      |   q    |  1983 q4|    x           |   a       |   a1      |   q    |  1983 q4|    y           |   f       |   a1      |   q    |  1983 q4|    z           |   nan       |   a1      |   q    |  1984 q1|    w           |   110       |   a1      |   q    |  1984 q1|    x           |   a       |   a1      |   q    |  1984 q1|    y           |   f       |   a1      |   q    |  1984 q1|    z           |   nan       |   a2      |   q    |  1983 q4|    w           |   20       |   a2      |   q    |  1983 q4|    x           |   b       |   a2      |   q    |  1983 q4|    y           |   c       |   a2      |   q    |  1983 q4|    z           |   40

is above possible pandas or numpy?

my final dataframe going 14,761,500 rows x 6 cols.

given

in [189]: df out[189]:    idx1 idx2 idx3  1983 q4  x  y    z  1984 q1 x.1 y.1  z.1 0      a1    q       10   f  nan      110     f  nan 1      a2    q       20  b  c   40      120   b   c  240 2      a3    q       30   f  nan      130     f  nan 3      a4    q       40  b  c   80      140   b   c  280 4      a5    q       50   f  nan      150     f  nan 5      a6    q       60  b  f  120      160   b   f  320

let first set ['idx1', 'idx2', 'idx3'] index.

df = df.set_index(['idx1', 'idx2', 'idx3'])

the other columns have periodic quality them; want handle every 4 columns group. idea of "handling group" leads naturally assigning new index level column index; value same every 4 columns. ideal:

               1983 q4            1984 q1                                 w  x  y    z       w  x  y    z idx1 idx2 idx3                                          a1   q         10   f  nan     110   f  nan      a2   q         20  b  c  240     120  b  c  240      a3   q         30   f  nan     130   f  nan      a4   q         40  b  c  280     140  b  c  280      a5   q         50   f  nan     150   f  nan      a6   q         60  b  f  320     160  b  f  320

we can achieve building multiindex , assigning df.columns:

columns = [col col in df.columns if col[0] not in set(list('xyz'))] df.columns = pd.multiindex.from_product([columns, list('wxyz')])

now desired long-format dataframe can obtained calling df.stack move column levels row index:

df.columns.names = ['idx4', 'valuetype'] series = df.stack(['idx4', 'valuetype'], dropna=false)

note when mangle_dupe_cols=false, duplicate columns, x, y, z, get overwritten. lose data mangle_dupe_cols=false. example, when use mangle_dupe_cols=false last row's z value gets assigns every z column regardless of period.

so must use mangle_dupe_cols=true, (or leave out since default) , adjust code accordingly. that, fortunately, not hard since reassigning df.columns custom-build multiindex anyway.

putting together:

import numpy np import pandas pd df = pd.read_table('data', sep=r'\s*[|]\s*') df = df.set_index(['idx1', 'idx2', 'idx3']) columns = [col col in df.columns if col[0] not in set(list('xyz'))] df.columns = pd.multiindex.from_product([columns, list('wxyz')]) df.columns.names = ['idx4', 'valuetype'] series = df.stack(['idx4', 'valuetype'], dropna=false) print(series.head())

yields

idx1  idx2  idx3  idx4     valuetype     a1    q     1983 q4  w             10                            x                                         y              f                            z            nan                   1984 q1  w            110 dtype: object

note since we've removed column levels, result series. if want dataframe 6 columns, should follow with:

series.name = 'value' df = series.reset_index() print(df.head())

which yields

  idx1 idx2 idx3     idx4 valuetype value 0      a1    q  1983 q4         w    10 1      a1    q  1983 q4         x     2      a1    q  1983 q4         y     f 3      a1    q  1983 q4         z   nan 4      a1    q  1984 q1         w   110 ...

Search This Blog

Ruby Code

python - How can I use melt() to reshape a pandas DataFrame to a list, creating an index from a crosstab column and creating a new variable in its place? -

Comments

Post a Comment

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

command line - Use qwinsta in PowerShell ISE -

java - Show Soft Keyboard when EditText Appears -