Creating a partial SAS PROC SUMMARY replacement in Python/Pandas -


we working off of sas , onto python/pandas. however, 1 thing having trouble creating replacement proc summary (aka proc means) has sas routine's flexibility. non-sas users: proc summary routine produces table containing "descriptive statistics variables across observations or within groups of observations" in dataset, paraphrase sas documentation. our requirements small subset of full functionality - outputting table have:

  • ability apply different stats different columns (for count, sum, mean, weighted mean)
  • ability handle 0 many grouping variables
  • ability specify weight variable weighted mean

we not trying else (anything graphical, etc.)

here have far:

def wmean_ungrouped (d,w):     return (d.dot(w)).sum() / w.sum()  def wmean_grouped (group, var_name_in, var_name_weight):     d = group[var_name_in]     w = group[var_name_weight]     return (d * w).sum() / w.sum()  funcs = {     "mean"   : np.mean ,     "sum"   : np.sum ,     "count" : np.count_nonzero }  def my_summary (         data ,         var_names_in ,         var_names_out ,         var_functions ,         var_name_weight = none ,         var_names_group = none ):     result = dataframe()      if var_names_group not none:         grouped = data.groupby (var_names_group)         var_name_in, var_name_out, var_function in \                 zip(var_names_in,var_names_out,var_functions):             if var_function == "wmean":                 func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)                 result[var_name_out] = series(grouped.apply(func))             else:                 func = funcs[var_function]                 result[var_name_out] = grouped[var_name_in].apply(func)     else:         var_name_in, var_name_out, var_function in \                 zip(var_names_in,var_names_out,var_functions):             if var_function == "wmean":                 result[var_name_out] = \                     series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))             else:                 func = funcs[var_function]                 result[var_name_out] = series(func(data[var_name_in]))      return result 

here sample call my_summary() function:

    my_summary (         data=df,         var_names_in=["x_1","x_1","x_1","x_1"] ,         var_names_out=[             "x_1_c","x_1_s","x_1_m","x_1_wm"         ] ,         var_functions=["count","sum","mean","wmean"] ,         var_name_weight="val_1" ,         var_names_group=["region","category"] ) 

my_summary() works, can see, implementation not prettiest. here main issues:

  • two different code paths depending on grouped or ungrouped - stems fact dataframe , dataframegroupby have different ways applying programmatically-selected reducing function single column. dataframe, way i've found directly invoking func(data[var_name_in]). data[var_name_in].apply(func) doesn't work because apply() on series not reduce (unlike apply() on dataframe). on other hand, dataframegroupby, have use approach: grouped[var_name_in].apply(func). that's because func(grouped[var_name_in]) not work (no reason should.)
  • special treatment weighted mean - because operates on 2 columns, unlike other calculations, operate on one; don't know if can helped.
  • two different weighted mean functions - consequence of first issue. ungrouped function has series-type parameters , needs dot() multiply , reduce them; grouped function deals seriesgroupby objects , has use * operator (acknowledgements the answer post weighted average function code.)

so questions are:

  • is there native pandas can of (i.e. throw out above , use instead)?
  • if not, there fixes of issues mentioned above?
  • by chance, there way group nothing - is, obtain dataframegroupby object dataframe without grouping on variable? code paths reduced dealing dataframegroupby interface exclusively.

update / current solution

@johne's answer provides way group nothing: groupby(lambda x: true). workaround spotted in post (which, incidentally, features answer wes himself speaking of need dataframe.agg(), serve same purpose). @johne's excellent solution allows deal exclusively objects of type dataframegroupby, , instantly reduces of code paths. able reduce further using functional gimmickry possible because have dataframegroupby instances. basically, functions generated needed - "generators" (in quotes here not confused python generator expressions) take 2 parameters: value column name , weight column name, second of ignored in cases except wmean. generated functions applied on entire dataframegroupby, case wmean, parameters being correct column name(s) use. replaced np.* implementations pandas calculations, better deal nan values.

unless there native pandas can this, our solution:

func_gens = {     "mean"  : lambda y,z : lambda x : x[y].mean(),     "sum"   : lambda y,z : lambda x : x[y].sum() ,     "count" : lambda y,z : lambda x : x[y].count() ,     "wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum() }  def my_summary (         data ,         var_names_in ,         var_names_out ,         var_functions ,         var_name_weight = none ,         var_names_group = none ):      result = pd.dataframe()      if var_names_group none:         grouped = data.groupby (lambda x: true)     else:         grouped = data.groupby (var_names_group)      var_name_in, var_name_out, var_function in \             zip(var_names_in,var_names_out,var_functions):         func_gen = func_gens[var_function]         func = func_gen (var_name_in, var_name_weight)         result[var_name_out] = grouped.apply(func)      return result 

well, here's quickie @ 2 issues (but still requires different function weighted mean). uses trick here (credit @dsm) around empty group doing groupby(lamda x: true). great if there kwarg 'weights' on stuff means there not, knowledge. apparently there package weighted quantiles mentioned here based on numpy don't know it. great project btw!

(note names same yours, added '2' wmean_grouped , my_summary, otherwise can use same calling interface)

def wmean_grouped2 (group, var_name_in, var_name_weight):     d = group[var_name_in]     w = group[var_name_weight]     return (d * w).sum() / w.sum()  funcs = { "mean"  : np.mean ,           "sum"   : np.sum ,           "count" : np.count_nonzero }  def my_summary2 (         data ,         var_names_in ,         var_names_out ,         var_functions ,         var_name_weight = none ,         var_names_group = none ):      result = pd.dataframe()      if var_names_group none:         grouped = data.groupby (lambda x: true)     else:         grouped = data.groupby (var_names_group)      var_name_in, var_name_out, var_function in \             zip(var_names_in,var_names_out,var_functions):         if var_function == "wmean":             func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)             result[var_name_out] = pd.series(grouped.apply(func))         else:             func = funcs[var_function]             result[var_name_out] = grouped[var_name_in].apply(func)      return result 

Comments

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -