python - How can I use melt() to reshape a pandas DataFrame to a list, creating an index from a crosstab column and creating a new variable in its place? -
i have matrix of data 29523 rows x 503 cols of 3 cols indices (below subset example).
idx1| idx2 | idx3 | 1983 q4 | x | y | z |1984 q1 | x | y | z --------------------------------------------------------------------------- | a1 | q | 10 | | f | nan | 110 | | f | nan | a2 | q | 20 | b | c | 40 | 120 | b | c | 240 | a3 | q | 30 | | f | nan | 130 | | f | nan | a4 | q | 40 | b | c | 80 | 140 | b | c | 280 | a5 | q | 50 | | f | nan | 150 | | f | nan | a6 | q | 60 | b | f | 120 | 160 | b | f | 320
i read dataframe
with:
>>> df = pd.read_csv(c:\filename.csv, low_memory=false, mangle_dupe_cols=false)
and use pandas.melt()
pivot data:
df1 = pd.melt(df, id_vars=['idx1', 'idx2', 'idx3'], var_name='valuetype', value_name = 'value')
i have tried stack()
melt()
proved better here.
idx1 | idx2 | idx3 | valuetype | value --------------------------------------------------------------- | a1 | q | 1983 q4 | 10 | a1 | q | x | a | a1 | q | y | f | a1 | q | z | nan | a1 | q | 1984 q1 | 110 | a1 | q | x | a | a1 | q | y | f | a1 | q | z | nan | a2 | q | 1983 q4 | 20 | a2 | q | x | b | a2 | q | y | c | a2 | q | z | 40
the option mangle_dupe_cols
on read_csv
if true
place .int
suffix against valuetype
s duplicated. not ideal, without there no way of linking values variables correct period.
what prefer instead of having period
(1984 q1)
valuetype
, give period
s corresponding value
variable 'w'
, have each period form part of idx
below:
idx1 | idx2 | idx3 | idx4 | valuetype | value --------------------------------------------------------------- | a1 | q | 1983 q4| w | 10 | a1 | q | 1983 q4| x | a | a1 | q | 1983 q4| y | f | a1 | q | 1983 q4| z | nan | a1 | q | 1984 q1| w | 110 | a1 | q | 1984 q1| x | a | a1 | q | 1984 q1| y | f | a1 | q | 1984 q1| z | nan | a2 | q | 1983 q4| w | 20 | a2 | q | 1983 q4| x | b | a2 | q | 1983 q4| y | c | a2 | q | 1983 q4| z | 40
is above possible pandas or numpy?
my final dataframe
going 14,761,500 rows x 6 cols.
given
in [189]: df out[189]: idx1 idx2 idx3 1983 q4 x y z 1984 q1 x.1 y.1 z.1 0 a1 q 10 f nan 110 f nan 1 a2 q 20 b c 40 120 b c 240 2 a3 q 30 f nan 130 f nan 3 a4 q 40 b c 80 140 b c 280 4 a5 q 50 f nan 150 f nan 5 a6 q 60 b f 120 160 b f 320
let first set ['idx1', 'idx2', 'idx3']
index.
df = df.set_index(['idx1', 'idx2', 'idx3'])
the other columns have periodic quality them; want handle every 4 columns group. idea of "handling group" leads naturally assigning new index level column index; value same every 4 columns. ideal:
1983 q4 1984 q1 w x y z w x y z idx1 idx2 idx3 a1 q 10 f nan 110 f nan a2 q 20 b c 240 120 b c 240 a3 q 30 f nan 130 f nan a4 q 40 b c 280 140 b c 280 a5 q 50 f nan 150 f nan a6 q 60 b f 320 160 b f 320
we can achieve building multiindex , assigning df.columns
:
columns = [col col in df.columns if col[0] not in set(list('xyz'))] df.columns = pd.multiindex.from_product([columns, list('wxyz')])
now desired long-format dataframe can obtained calling df.stack
move column levels row index:
df.columns.names = ['idx4', 'valuetype'] series = df.stack(['idx4', 'valuetype'], dropna=false)
note when mangle_dupe_cols=false
, duplicate columns, x
, y
, z
, get overwritten. lose data mangle_dupe_cols=false
. example, when use mangle_dupe_cols=false
last row's z
value gets assigns every z
column regardless of period.
so must use mangle_dupe_cols=true
, (or leave out since default) , adjust code accordingly. that, fortunately, not hard since reassigning df.columns
custom-build multiindex anyway.
putting together:
import numpy np import pandas pd df = pd.read_table('data', sep=r'\s*[|]\s*') df = df.set_index(['idx1', 'idx2', 'idx3']) columns = [col col in df.columns if col[0] not in set(list('xyz'))] df.columns = pd.multiindex.from_product([columns, list('wxyz')]) df.columns.names = ['idx4', 'valuetype'] series = df.stack(['idx4', 'valuetype'], dropna=false) print(series.head())
yields
idx1 idx2 idx3 idx4 valuetype a1 q 1983 q4 w 10 x y f z nan 1984 q1 w 110 dtype: object
note since we've removed column levels, result series. if want dataframe 6 columns, should follow with:
series.name = 'value' df = series.reset_index() print(df.head())
which yields
idx1 idx2 idx3 idx4 valuetype value 0 a1 q 1983 q4 w 10 1 a1 q 1983 q4 x 2 a1 q 1983 q4 y f 3 a1 q 1983 q4 z nan 4 a1 q 1984 q1 w 110 ...
Comments
Post a Comment