python - Avoiding double counting in pandas merge -


problem 1:

i have aggregated 2 large (for me) files date january 2014. 1 17 mil rows , other 3 mil rows. aggregated them based on date field personid field, , customerid field , summed column 1 per row.

file 1 (i dropped duplicates personid should have 1 visit per customerid per date):

date | personid | customerid | sum of visits 

file 2 (this not have complete record of customerid not including - want total chats first not leave out such large chunk of data):

date | personid | sum of chats 

when pd.merge(file1, file2, how= 'left') end severely increasing number of chats file 2. because personid might have several customerids on same date if had multiple chats adding chats each row. not work out when load data in tableau , sum up. (the end result looking dividing total visits total chats each personid create ratio). what's best approach here?

problem 2:

after finish aggregated files want again merge 2 files @ granular row level. issue file 2 can have multiple chats same personid on same date. there way join/merge file 1 has 1 record per personid+date+customerid without creating duplicate visits on first file?

file 2 :

date | personid | customerid | count of chat 

assuming vision of data close, see approach.

first, reproducible data.

in [2]: d1 = {'date': {0: pd.timestamp('2010-01-01 00:00:00'), 1: pd.timestamp('2010-01-02 00:00:00'), 2: pd.timestamp('2010-01-03 00:00:00'), 3: pd.timestamp('2010-01-03 00:00:00'), 4: pd.timestamp('2010-01-03 00:00:00'), 5: pd.timestamp('2010-01-06 00:00:00'), 6: pd.timestamp('2010-01-06 00:00:00'), 7: pd.timestamp('2010-01-06 00:00:00'), 8: pd.timestamp('2010-01-09 00:00:00'), 9: pd.timestamp('2010-01-10 00:00:00'), 10: pd.timestamp('2010-01-11 00:00:00'), 11: pd.timestamp('2010-01-12 00:00:00'), 12: pd.timestamp('2010-01-12 00:00:00'), 13: pd.timestamp('2010-01-12 00:00:00'), 14: pd.timestamp('2010-01-12 00:00:00'), 15: pd.timestamp('2010-01-12 00:00:00'), 16: pd.timestamp('2010-01-17 00:00:00'), 17: pd.timestamp('2010-01-17 00:00:00'), 18: pd.timestamp('2010-01-17 00:00:00'), 19: pd.timestamp('2010-01-17 00:00:00')}, 'personid': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'bar', 8: 'foo', 9: 'bar', 10: 'foo', 11: 'bar', 12: 'foo', 13: 'bar', 14: 'foo', 15: 'bar', 16: 'foo', 17: 'bar', 18: 'foo', 19: 'bar'}, 'customerid': {0: 'aaa', 1: 'bbb', 2: 'ccc', 3: 'ddd', 4: 'eee', 5: 'fff', 6: 'ggg', 7: 'hhh', 8: 'iii', 9: 'jjj', 10: 'kkk', 11: 'lll', 12: 'mmm', 13: 'nnn', 14: 'ooo', 15: 'ppp', 16: 'qqq', 17: 'rrr', 18: 'sss', 19: 'ttt'}}    ...:     ...: d2 = {'date': {0: pd.timestamp('2010-01-01 00:00:00'), 1: pd.timestamp('2010-01-02 00:00:00'), 2: pd.timestamp('2010-01-03 00:00:00'), 3: pd.timestamp('2010-01-06 00:00:00'), 4: pd.timestamp('2010-01-09 00:00:00'), 5: pd.timestamp('2010-01-10 00:00:00'), 6: pd.timestamp('2010-01-11 00:00:00'), 7: pd.timestamp('2010-01-12 00:00:00'), 8: pd.timestamp('2010-01-17 00:00:00'), 9: pd.timestamp('2010-01-01 00:00:00'), 10: pd.timestamp('2010-01-02 00:00:00'), 11: pd.timestamp('2010-01-03 00:00:00'), 12: pd.timestamp('2010-01-06 00:00:00'), 13: pd.timestamp('2010-01-09 00:00:00'), 14: pd.timestamp('2010-01-10 00:00:00'), 15: pd.timestamp('2010-01-11 00:00:00'), 16: pd.timestamp('2010-01-12 00:00:00'), 17: pd.timestamp('2010-01-17 00:00:00')}, 'personid': {0: 'foo', 1: 'foo', 2: 'foo', 3: 'foo', 4: 'foo', 5: 'foo', 6: 'foo', 7: 'foo', 8: 'foo', 9: 'bar', 10: 'bar', 11: 'bar', 12: 'bar', 13: 'bar', 14: 'bar', 15: 'bar', 16: 'bar', 17: 'bar'}, 'sum of chats': {0: 5.0, 1: 3.0, 2: 24.0, 3: 7.0, 4: 15.0, 5: 9.0, 6: 16.0, 7: 22.0, 8: 14.0, 9: 8.0, 10: 15.0, 11: 14.0, 12: 29.0, 13: 11.0, 14: 6.0, 15: 14.0, 16: 30.0, 17: 12.0}}  in [3]: df1 = pd.dataframe.from_dict(d1)    ...: df2 = pd.dataframe.from_dict(d2) 

the above produces following dataframes.

# file 1     customerid       date personid 0         aaa 2010-01-01      foo 1         bbb 2010-01-02      bar 2         ccc 2010-01-03      foo 3         ddd 2010-01-03      bar 4         eee 2010-01-03      foo 5         fff 2010-01-06      bar 6         ggg 2010-01-06      foo 7         hhh 2010-01-06      bar 8         iii 2010-01-09      foo 9         jjj 2010-01-10      bar 10        kkk 2010-01-11      foo 11        lll 2010-01-12      bar 12        mmm 2010-01-12      foo 13        nnn 2010-01-12      bar 14        ooo 2010-01-12      foo 15        ppp 2010-01-12      bar 16        qqq 2010-01-17      foo 17        rrr 2010-01-17      bar 18        sss 2010-01-17      foo 19        ttt 2010-01-17      bar  # file 2           date personid  sum of chats 0  2010-01-01      foo             5 1  2010-01-02      foo             3 2  2010-01-03      foo            24 3  2010-01-06      foo             7 4  2010-01-09      foo            15 5  2010-01-10      foo             9 6  2010-01-11      foo            16 7  2010-01-12      foo            22 8  2010-01-17      foo            14 9  2010-01-01      bar             8 10 2010-01-02      bar            15 11 2010-01-03      bar            14 12 2010-01-06      bar            29 13 2010-01-09      bar            11 14 2010-01-10      bar             6 15 2010-01-11      bar            14 16 2010-01-12      bar            30 17 2010-01-17      bar            12 

if, chance, want use customerid count visits, can use pivot_table aggregate quickly.

in [4]: df1 = df1.pivot_table(index=['date','personid'], values='customerid', aggfunc=len)    ...: print df1 date        personid 2010-01-01  foo         1 2010-01-02  bar         1 2010-01-03  bar         1             foo         2 2010-01-06  bar         2             foo         1 2010-01-09  foo         1 2010-01-10  bar         1 2010-01-11  foo         1 2010-01-12  bar         3             foo         2 2010-01-17  bar         2             foo         2 name: customerid, dtype: int64 

i prefer combine reset_index when aggregating instead of other approaches because afforded above sneak-peek of meaningful data while pivoting.

in [5]: df1 = df1.reset_index(); print df1          date personid  customerid 0  2010-01-01      foo           1 1  2010-01-02      bar           1 2  2010-01-03      bar           1 3  2010-01-03      foo           2 4  2010-01-06      bar           2 5  2010-01-06      foo           1 6  2010-01-09      foo           1 7  2010-01-10      bar           1 8  2010-01-11      foo           1 9  2010-01-12      bar           3 10 2010-01-12      foo           2 11 2010-01-17      bar           2 12 2010-01-17      foo           2 

so we're pretty @ start. remaining step merge second dataframe our chats per date per person.

in [6]: df = pd.merge(df1, df2, how='outer', sort=true)    ...: print df          date personid  customerid  sum of chats 0  2010-01-01      bar         nan             8 1  2010-01-01      foo           1             5 2  2010-01-02      bar           1            15 3  2010-01-02      foo         nan             3 4  2010-01-03      bar           1            14 5  2010-01-03      foo           2            24 6  2010-01-06      bar           2            29 7  2010-01-06      foo           1             7 8  2010-01-09      bar         nan            11 9  2010-01-09      foo           1            15 10 2010-01-10      bar           1             6 11 2010-01-10      foo         nan             9 12 2010-01-11      bar         nan            14 13 2010-01-11      foo           1            16 14 2010-01-12      bar           3            30 15 2010-01-12      foo           2            22 16 2010-01-17      bar           2            12 17 2010-01-17      foo           2            14 

granted, nans product of faulty mock data set-up on part. here, it's straight-up calculations.

let me know if helps.


Comments

Popular posts from this blog

python - Mongodb How to add addtional information when aggregating? -

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

java - Incorrect order of records in M-M relationship in hibernate -