python - Avoiding double counting in pandas merge -
problem 1:
i have aggregated 2 large (for me) files date january 2014. 1 17 mil rows , other 3 mil rows. aggregated them based on date field personid field, , customerid field , summed column 1 per row.
file 1 (i dropped duplicates personid should have 1 visit per customerid per date):
date | personid | customerid | sum of visits file 2 (this not have complete record of customerid not including - want total chats first not leave out such large chunk of data):
date | personid | sum of chats when pd.merge(file1, file2, how= 'left') end severely increasing number of chats file 2. because personid might have several customerids on same date if had multiple chats adding chats each row. not work out when load data in tableau , sum up. (the end result looking dividing total visits total chats each personid create ratio). what's best approach here?
problem 2:
after finish aggregated files want again merge 2 files @ granular row level. issue file 2 can have multiple chats same personid on same date. there way join/merge file 1 has 1 record per personid+date+customerid without creating duplicate visits on first file?
file 2 :
date | personid | customerid | count of chat
assuming vision of data close, see approach.
first, reproducible data.
in [2]: d1 = {'date': {0: pd.timestamp('2010-01-01 00:00:00'), 1: pd.timestamp('2010-01-02 00:00:00'), 2: pd.timestamp('2010-01-03 00:00:00'), 3: pd.timestamp('2010-01-03 00:00:00'), 4: pd.timestamp('2010-01-03 00:00:00'), 5: pd.timestamp('2010-01-06 00:00:00'), 6: pd.timestamp('2010-01-06 00:00:00'), 7: pd.timestamp('2010-01-06 00:00:00'), 8: pd.timestamp('2010-01-09 00:00:00'), 9: pd.timestamp('2010-01-10 00:00:00'), 10: pd.timestamp('2010-01-11 00:00:00'), 11: pd.timestamp('2010-01-12 00:00:00'), 12: pd.timestamp('2010-01-12 00:00:00'), 13: pd.timestamp('2010-01-12 00:00:00'), 14: pd.timestamp('2010-01-12 00:00:00'), 15: pd.timestamp('2010-01-12 00:00:00'), 16: pd.timestamp('2010-01-17 00:00:00'), 17: pd.timestamp('2010-01-17 00:00:00'), 18: pd.timestamp('2010-01-17 00:00:00'), 19: pd.timestamp('2010-01-17 00:00:00')}, 'personid': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'bar', 8: 'foo', 9: 'bar', 10: 'foo', 11: 'bar', 12: 'foo', 13: 'bar', 14: 'foo', 15: 'bar', 16: 'foo', 17: 'bar', 18: 'foo', 19: 'bar'}, 'customerid': {0: 'aaa', 1: 'bbb', 2: 'ccc', 3: 'ddd', 4: 'eee', 5: 'fff', 6: 'ggg', 7: 'hhh', 8: 'iii', 9: 'jjj', 10: 'kkk', 11: 'lll', 12: 'mmm', 13: 'nnn', 14: 'ooo', 15: 'ppp', 16: 'qqq', 17: 'rrr', 18: 'sss', 19: 'ttt'}} ...: ...: d2 = {'date': {0: pd.timestamp('2010-01-01 00:00:00'), 1: pd.timestamp('2010-01-02 00:00:00'), 2: pd.timestamp('2010-01-03 00:00:00'), 3: pd.timestamp('2010-01-06 00:00:00'), 4: pd.timestamp('2010-01-09 00:00:00'), 5: pd.timestamp('2010-01-10 00:00:00'), 6: pd.timestamp('2010-01-11 00:00:00'), 7: pd.timestamp('2010-01-12 00:00:00'), 8: pd.timestamp('2010-01-17 00:00:00'), 9: pd.timestamp('2010-01-01 00:00:00'), 10: pd.timestamp('2010-01-02 00:00:00'), 11: pd.timestamp('2010-01-03 00:00:00'), 12: pd.timestamp('2010-01-06 00:00:00'), 13: pd.timestamp('2010-01-09 00:00:00'), 14: pd.timestamp('2010-01-10 00:00:00'), 15: pd.timestamp('2010-01-11 00:00:00'), 16: pd.timestamp('2010-01-12 00:00:00'), 17: pd.timestamp('2010-01-17 00:00:00')}, 'personid': {0: 'foo', 1: 'foo', 2: 'foo', 3: 'foo', 4: 'foo', 5: 'foo', 6: 'foo', 7: 'foo', 8: 'foo', 9: 'bar', 10: 'bar', 11: 'bar', 12: 'bar', 13: 'bar', 14: 'bar', 15: 'bar', 16: 'bar', 17: 'bar'}, 'sum of chats': {0: 5.0, 1: 3.0, 2: 24.0, 3: 7.0, 4: 15.0, 5: 9.0, 6: 16.0, 7: 22.0, 8: 14.0, 9: 8.0, 10: 15.0, 11: 14.0, 12: 29.0, 13: 11.0, 14: 6.0, 15: 14.0, 16: 30.0, 17: 12.0}} in [3]: df1 = pd.dataframe.from_dict(d1) ...: df2 = pd.dataframe.from_dict(d2) the above produces following dataframes.
# file 1 customerid date personid 0 aaa 2010-01-01 foo 1 bbb 2010-01-02 bar 2 ccc 2010-01-03 foo 3 ddd 2010-01-03 bar 4 eee 2010-01-03 foo 5 fff 2010-01-06 bar 6 ggg 2010-01-06 foo 7 hhh 2010-01-06 bar 8 iii 2010-01-09 foo 9 jjj 2010-01-10 bar 10 kkk 2010-01-11 foo 11 lll 2010-01-12 bar 12 mmm 2010-01-12 foo 13 nnn 2010-01-12 bar 14 ooo 2010-01-12 foo 15 ppp 2010-01-12 bar 16 qqq 2010-01-17 foo 17 rrr 2010-01-17 bar 18 sss 2010-01-17 foo 19 ttt 2010-01-17 bar # file 2 date personid sum of chats 0 2010-01-01 foo 5 1 2010-01-02 foo 3 2 2010-01-03 foo 24 3 2010-01-06 foo 7 4 2010-01-09 foo 15 5 2010-01-10 foo 9 6 2010-01-11 foo 16 7 2010-01-12 foo 22 8 2010-01-17 foo 14 9 2010-01-01 bar 8 10 2010-01-02 bar 15 11 2010-01-03 bar 14 12 2010-01-06 bar 29 13 2010-01-09 bar 11 14 2010-01-10 bar 6 15 2010-01-11 bar 14 16 2010-01-12 bar 30 17 2010-01-17 bar 12 if, chance, want use customerid count visits, can use pivot_table aggregate quickly.
in [4]: df1 = df1.pivot_table(index=['date','personid'], values='customerid', aggfunc=len) ...: print df1 date personid 2010-01-01 foo 1 2010-01-02 bar 1 2010-01-03 bar 1 foo 2 2010-01-06 bar 2 foo 1 2010-01-09 foo 1 2010-01-10 bar 1 2010-01-11 foo 1 2010-01-12 bar 3 foo 2 2010-01-17 bar 2 foo 2 name: customerid, dtype: int64 i prefer combine reset_index when aggregating instead of other approaches because afforded above sneak-peek of meaningful data while pivoting.
in [5]: df1 = df1.reset_index(); print df1 date personid customerid 0 2010-01-01 foo 1 1 2010-01-02 bar 1 2 2010-01-03 bar 1 3 2010-01-03 foo 2 4 2010-01-06 bar 2 5 2010-01-06 foo 1 6 2010-01-09 foo 1 7 2010-01-10 bar 1 8 2010-01-11 foo 1 9 2010-01-12 bar 3 10 2010-01-12 foo 2 11 2010-01-17 bar 2 12 2010-01-17 foo 2 so we're pretty @ start. remaining step merge second dataframe our chats per date per person.
in [6]: df = pd.merge(df1, df2, how='outer', sort=true) ...: print df date personid customerid sum of chats 0 2010-01-01 bar nan 8 1 2010-01-01 foo 1 5 2 2010-01-02 bar 1 15 3 2010-01-02 foo nan 3 4 2010-01-03 bar 1 14 5 2010-01-03 foo 2 24 6 2010-01-06 bar 2 29 7 2010-01-06 foo 1 7 8 2010-01-09 bar nan 11 9 2010-01-09 foo 1 15 10 2010-01-10 bar 1 6 11 2010-01-10 foo nan 9 12 2010-01-11 bar nan 14 13 2010-01-11 foo 1 16 14 2010-01-12 bar 3 30 15 2010-01-12 foo 2 22 16 2010-01-17 bar 2 12 17 2010-01-17 foo 2 14 granted, nans product of faulty mock data set-up on part. here, it's straight-up calculations.
let me know if helps.
Comments
Post a Comment