Creating a partial SAS PROC SUMMARY replacement in Python/Pandas -
we working off of sas , onto python/pandas. however, 1 thing having trouble creating replacement proc summary (aka proc means) has sas routine's flexibility. non-sas users: proc summary routine produces table containing "descriptive statistics variables across observations or within groups of observations" in dataset, paraphrase sas documentation. our requirements small subset of full functionality - outputting table have:
- ability apply different stats different columns (for count, sum, mean, weighted mean)
- ability handle 0 many grouping variables
- ability specify weight variable weighted mean
we not trying else (anything graphical, etc.)
here have far:
def wmean_ungrouped (d,w): return (d.dot(w)).sum() / w.sum() def wmean_grouped (group, var_name_in, var_name_weight): d = group[var_name_in] w = group[var_name_weight] return (d * w).sum() / w.sum() funcs = { "mean" : np.mean , "sum" : np.sum , "count" : np.count_nonzero } def my_summary ( data , var_names_in , var_names_out , var_functions , var_name_weight = none , var_names_group = none ): result = dataframe() if var_names_group not none: grouped = data.groupby (var_names_group) var_name_in, var_name_out, var_function in \ zip(var_names_in,var_names_out,var_functions): if var_function == "wmean": func = lambda x : wmean_grouped (x, var_name_in, var_name_weight) result[var_name_out] = series(grouped.apply(func)) else: func = funcs[var_function] result[var_name_out] = grouped[var_name_in].apply(func) else: var_name_in, var_name_out, var_function in \ zip(var_names_in,var_names_out,var_functions): if var_function == "wmean": result[var_name_out] = \ series(wmean_ungrouped(data[var_name_in], data[var_name_weight])) else: func = funcs[var_function] result[var_name_out] = series(func(data[var_name_in])) return result here sample call my_summary() function:
my_summary ( data=df, var_names_in=["x_1","x_1","x_1","x_1"] , var_names_out=[ "x_1_c","x_1_s","x_1_m","x_1_wm" ] , var_functions=["count","sum","mean","wmean"] , var_name_weight="val_1" , var_names_group=["region","category"] ) my_summary() works, can see, implementation not prettiest. here main issues:
- two different code paths depending on grouped or ungrouped - stems fact
dataframe,dataframegroupbyhave different ways applying programmatically-selected reducing function single column.dataframe, way i've found directly invokingfunc(data[var_name_in]).data[var_name_in].apply(func)doesn't work becauseapply()onseriesnot reduce (unlikeapply()ondataframe). on other hand,dataframegroupby, have use approach:grouped[var_name_in].apply(func). that's becausefunc(grouped[var_name_in])not work (no reason should.) - special treatment weighted mean - because operates on 2 columns, unlike other calculations, operate on one; don't know if can helped.
- two different weighted mean functions - consequence of first issue. ungrouped function has
series-type parameters , needsdot()multiply , reduce them; grouped function dealsseriesgroupbyobjects , has use*operator (acknowledgements the answer post weighted average function code.)
so questions are:
- is there native pandas can of (i.e. throw out above , use instead)?
- if not, there fixes of issues mentioned above?
- by chance, there way group nothing - is, obtain
dataframegroupbyobjectdataframewithout grouping on variable? code paths reduced dealingdataframegroupbyinterface exclusively.
update / current solution
@johne's answer provides way group nothing: groupby(lambda x: true). workaround spotted in post (which, incidentally, features answer wes himself speaking of need dataframe.agg(), serve same purpose). @johne's excellent solution allows deal exclusively objects of type dataframegroupby, , instantly reduces of code paths. able reduce further using functional gimmickry possible because have dataframegroupby instances. basically, functions generated needed - "generators" (in quotes here not confused python generator expressions) take 2 parameters: value column name , weight column name, second of ignored in cases except wmean. generated functions applied on entire dataframegroupby, case wmean, parameters being correct column name(s) use. replaced np.* implementations pandas calculations, better deal nan values.
unless there native pandas can this, our solution:
func_gens = { "mean" : lambda y,z : lambda x : x[y].mean(), "sum" : lambda y,z : lambda x : x[y].sum() , "count" : lambda y,z : lambda x : x[y].count() , "wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum() } def my_summary ( data , var_names_in , var_names_out , var_functions , var_name_weight = none , var_names_group = none ): result = pd.dataframe() if var_names_group none: grouped = data.groupby (lambda x: true) else: grouped = data.groupby (var_names_group) var_name_in, var_name_out, var_function in \ zip(var_names_in,var_names_out,var_functions): func_gen = func_gens[var_function] func = func_gen (var_name_in, var_name_weight) result[var_name_out] = grouped.apply(func) return result
well, here's quickie @ 2 issues (but still requires different function weighted mean). uses trick here (credit @dsm) around empty group doing groupby(lamda x: true). great if there kwarg 'weights' on stuff means there not, knowledge. apparently there package weighted quantiles mentioned here based on numpy don't know it. great project btw!
(note names same yours, added '2' wmean_grouped , my_summary, otherwise can use same calling interface)
def wmean_grouped2 (group, var_name_in, var_name_weight): d = group[var_name_in] w = group[var_name_weight] return (d * w).sum() / w.sum() funcs = { "mean" : np.mean , "sum" : np.sum , "count" : np.count_nonzero } def my_summary2 ( data , var_names_in , var_names_out , var_functions , var_name_weight = none , var_names_group = none ): result = pd.dataframe() if var_names_group none: grouped = data.groupby (lambda x: true) else: grouped = data.groupby (var_names_group) var_name_in, var_name_out, var_function in \ zip(var_names_in,var_names_out,var_functions): if var_function == "wmean": func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight) result[var_name_out] = pd.series(grouped.apply(func)) else: func = funcs[var_function] result[var_name_out] = grouped[var_name_in].apply(func) return result
Comments
Post a Comment