python mongodb $match and $group -
i want write simple query gives me user followers has timezone brazil , has tweeted 100 or more times:
this line :
pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"brasilia"}}, {"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}}, {"$sort":{"count":-1}} ]
i adapted practice problem.
this given example structure : { "_id" : objectid("5304e2e3cc9e684aa98bef97"), "text" : "first week of school on :p", "in_reply_to_status_id" : null, "retweet_count" : null, "contributors" : null, "created_at" : "thu sep 02 18:11:25 +0000 2010", "geo" : null, "source" : "web", "coordinates" : null, "in_reply_to_screen_name" : null, "truncated" : false, "entities" : { "user_mentions" : [ ], "urls" : [ ], "hashtags" : [ ] }, "retweeted" : false, "place" : null, "user" : { "friends_count" : 145, "profile_sidebar_fill_color" : "e5507e", "location" : "ireland :)", "verified" : false, "follow_request_sent" : null, "favourites_count" : 1, "profile_sidebar_border_color" : "cc3366", "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkhoxzmam_normal.jpg", "geo_enabled" : false, "created_at" : "sun may 03 19:51:04 +0000 2009", "description" : "", "time_zone" : null, "url" : null, "screen_name" : "catherinemull", "notifications" : null, "profile_background_color" : "ff6699", "listed_count" : 77, "lang" : "en", "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg", "statuses_count" : 2475, "following" : null, "profile_text_color" : "362720", "protected" : false, "show_all_inline_media" : false, "profile_background_tile" : true, "name" : "catherine mullane", "contributors_enabled" : false, "profile_link_color" : "b40b43", "followers_count" : 169, "id" : 37486277, "profile_use_background_image" : true, "utc_offset" : null }, "favorited" : false, "in_reply_to_user_id" : null, "id" : numberlong("22819398300") }
can spot mistakes?
suppose have couple of sample documents minimum test case. insert test documents collection in mongoshell:
db.collection.insert([ { "_id" : objectid("5304e2e3cc9e684aa98bef97"), "user" : { "friends_count" : 145, "statuses_count" : 457, "screen_name" : "catherinemull", "time_zone" : "brasilia", "followers_count" : 169, "id" : 37486277 }, "id" : numberlong(22819398300) }, { "_id" : objectid("52fd2490bac3fa1975477702"), "user" : { "friends_count" : 145, "statuses_count" : 12334, "time_zone" : "brasilia", "screen_name" : "marble", "followers_count" : 2597, "id" : 37486278 }, "id" : numberlong(22819398301) }])
for user followers in timezone "brasilia"
, has tweeted 100
or more times, pipeline achieves desired result doesn't use $group
operator:
pipeline = [ { "$match": { "user.statuses_count": { "$gt":99 }, "user.time_zone": "brasilia" } }, { "$project": { "followers": "$user.followers_count", "screen_name": "$user.screen_name", "tweets": "$user.statuses_count" } }, { "$sort": { "followers": -1 } }, {"$limit" : 1} ]
pymongo output:
{u'ok': 1.0, u'result': [{u'_id': objectid('52fd2490bac3fa1975477702'), u'followers': 2597, u'screen_name': u'marble', u'tweets': 12334}]}
the following aggregation pipeline will give desired result. in pipeline, first stage $match
operator filters documents user has got timezone
field value "brasilia"
, has tweet count (represented statuses_count
) greater or equal 100 matched via $gte
comparison operator.
the second pipeline stage has $group
operator groups filtered documents specified identifier expression $user.id
field , applies accumulator expression $max
each group on $user.followers_count
field greatest number of followers each user. system variable $$root
references root document, i.e. top-level document, being processed in $group
aggregation pipeline stage, added array field use later on. achieved using $addtoset
array operator.
the next pipeline stage $unwinds
output document each element in data
array processing in next step.
the following pipeline step, $project
, transforms each document in stream, adding new fields have values previous stream.
the last 2 pipeline stages $sort
, $limit
reorders document stream specified sort key followers
, returns 1 document contains user highest number of followers.
you final aggregation pipeline should this:
db.collection.aggregate([ { '$match': { "user.statuses_count": { "$gte": 100 }, "user.time_zone": "brasilia" } }, { "$group": { "_id": "$user.id", "max_followers": { "$max": "$user.followers_count" }, "data": { "$addtoset": "$$root" } } }, { "$unwind": "$data" }, { "$project": { "_id": "$data._id", "followers": "$max_followers", "screen_name": "$data.user.screen_name", "tweets": "$data.user.statuses_count" } }, { "$sort": { "followers": -1 } }, { "$limit" : 1 } ])
executing in robomongo gives result
/* 0 */ { "result" : [ { "_id" : objectid("52fd2490bac3fa1975477702"), "followers" : 2597, "screen_name" : "marble", "tweets" : 12334 } ], "ok" : 1 }
in python, implementation should same:
>>> pipeline = [ ... {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "brasilia"}}, ... {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addtoset": "$$roo t" }}}, ... {"$unwind": "$data"}, ... {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}}, ... {"$sort": { "followers": -1 }}, ... {"$limit" : 1} ... ] >>> >>> doc in collection.aggregate(pipeline): ... print(doc) ... {u'tweets': 12334.0, u'_id': objectid('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'} >>>
where
pipeline = [ {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "brasilia"}}, {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addtoset": "$$root" }}}, {"$unwind": "$data"}, {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}}, {"$sort": { "followers": -1 }}, {"$limit" : 1} ]
Comments
Post a Comment