python mongodb $match and $group -

i want write simple query gives me user followers has timezone brazil , has tweeted 100 or more times:

this line :

pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"brasilia"}},             {"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},             {"$sort":{"count":-1}} ]

i adapted practice problem.

this given example structure :     {     "_id" : objectid("5304e2e3cc9e684aa98bef97"),     "text" : "first week of school on :p",     "in_reply_to_status_id" : null,     "retweet_count" : null,     "contributors" : null,     "created_at" : "thu sep 02 18:11:25 +0000 2010",     "geo" : null,     "source" : "web",     "coordinates" : null,     "in_reply_to_screen_name" : null,     "truncated" : false,     "entities" : {         "user_mentions" : [ ],         "urls" : [ ],         "hashtags" : [ ]     },     "retweeted" : false,     "place" : null,     "user" : {         "friends_count" : 145,         "profile_sidebar_fill_color" : "e5507e",         "location" : "ireland :)",         "verified" : false,         "follow_request_sent" : null,         "favourites_count" : 1,         "profile_sidebar_border_color" : "cc3366",         "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkhoxzmam_normal.jpg",         "geo_enabled" : false,         "created_at" : "sun may 03 19:51:04 +0000 2009",         "description" : "",         "time_zone" : null,         "url" : null,         "screen_name" : "catherinemull",         "notifications" : null,         "profile_background_color" : "ff6699",         "listed_count" : 77,         "lang" : "en",         "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",         "statuses_count" : 2475,         "following" : null,         "profile_text_color" : "362720",         "protected" : false,         "show_all_inline_media" : false,         "profile_background_tile" : true,         "name" : "catherine mullane",         "contributors_enabled" : false,         "profile_link_color" : "b40b43",         "followers_count" : 169,         "id" : 37486277,         "profile_use_background_image" : true,         "utc_offset" : null     },     "favorited" : false,     "in_reply_to_user_id" : null,     "id" : numberlong("22819398300") }

can spot mistakes?

suppose have couple of sample documents minimum test case. insert test documents collection in mongoshell:

db.collection.insert([ {     "_id" : objectid("5304e2e3cc9e684aa98bef97"),     "user" : {         "friends_count" : 145,         "statuses_count" : 457,         "screen_name" : "catherinemull",         "time_zone" : "brasilia",         "followers_count" : 169,         "id" : 37486277     },     "id" : numberlong(22819398300) }, {     "_id" : objectid("52fd2490bac3fa1975477702"),     "user" : {         "friends_count" : 145,         "statuses_count" : 12334,         "time_zone" : "brasilia",         "screen_name" : "marble",         "followers_count" : 2597,         "id" : 37486278     },     "id" : numberlong(22819398301) }])

for user followers in timezone "brasilia" , has tweeted 100 or more times, pipeline achieves desired result doesn't use $group operator:

pipeline = [     {         "$match": {             "user.statuses_count": {                 "$gt":99              },              "user.time_zone": "brasilia"         }     },     {         "$project": {                             "followers": "$user.followers_count",             "screen_name": "$user.screen_name",             "tweets": "$user.statuses_count"         }     },     {         "$sort": {              "followers": -1          }     },     {"$limit" : 1} ]

pymongo output:

{u'ok': 1.0,  u'result': [{u'_id': objectid('52fd2490bac3fa1975477702'),               u'followers': 2597,               u'screen_name': u'marble',               u'tweets': 12334}]}

the following aggregation pipeline will give desired result. in pipeline, first stage $match operator filters documents user has got timezone field value "brasilia" , has tweet count (represented statuses_count) greater or equal 100 matched via $gte comparison operator.

the second pipeline stage has $group operator groups filtered documents specified identifier expression $user.id field , applies accumulator expression $max each group on $user.followers_count field greatest number of followers each user. system variable $$root references root document, i.e. top-level document, being processed in $group aggregation pipeline stage, added array field use later on. achieved using $addtoset array operator.

the next pipeline stage $unwinds output document each element in data array processing in next step.

the following pipeline step, $project, transforms each document in stream, adding new fields have values previous stream.

the last 2 pipeline stages $sort , $limit reorders document stream specified sort key followers , returns 1 document contains user highest number of followers.

you final aggregation pipeline should this:

db.collection.aggregate([     {         '$match': {              "user.statuses_count": { "$gte": 100 },             "user.time_zone": "brasilia"         }     },     {         "$group": {             "_id": "$user.id",             "max_followers": { "$max": "$user.followers_count" },             "data": { "$addtoset": "$$root" }         }     },     {         "$unwind": "$data"     },        {         "$project": {             "_id": "$data._id",             "followers": "$max_followers",             "screen_name": "$data.user.screen_name",             "tweets": "$data.user.statuses_count"         }     },      {         "$sort": { "followers": -1 }     },     {         "$limit" : 1     } ])

executing in robomongo gives result

/* 0 */ {     "result" : [          {             "_id" : objectid("52fd2490bac3fa1975477702"),             "followers" : 2597,             "screen_name" : "marble",             "tweets" : 12334         }     ],     "ok" : 1 }

in python, implementation should same:

>>> pipeline = [ ...     {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "brasilia"}}, ...     {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addtoset": "$$roo t" }}}, ...     {"$unwind": "$data"}, ...     {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":  "$data.user.statuses_count"}}, ...     {"$sort": { "followers": -1 }}, ...     {"$limit" : 1} ... ] >>> >>> doc in collection.aggregate(pipeline): ...     print(doc) ... {u'tweets': 12334.0, u'_id': objectid('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'} >>>

where

pipeline = [     {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "brasilia"}},     {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addtoset": "$$root" }}},     {"$unwind": "$data"},        {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}},      {"$sort": { "followers": -1 }},     {"$limit" : 1} ]

Search This Blog

Ruby Code

python mongodb $match and $group -

Comments

Post a Comment

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

command line - Use qwinsta in PowerShell ISE -

java - Show Soft Keyboard when EditText Appears -