MongoDB Query Language Part 2 – Count, Distinct, Group, and Introduction to the Aggregation Framework

What we will cover:

  • Counting documents
  • Selecting distinct values
  • Grouping results
  • Aggregation framework
  • Basic Operators in the Aggregation Framework

In case you missed Part 1 in this series:

  • Make sure you view the guide Load Test Data to load a test dataset so you can follow along with our examples.
  • Download a good MongoDB IDE. I recommend Robo 3T, formerly named Robomongo. https://robomongo.org/   
  • Everything in MongoDB is case sensitive including database names, collection names, users, methods, etc.
  • It is important to familiarize yourself with the structure of a JSON document before you begin. This is a good place to start: http://www.json.org/

Counting documents

Count is an aggregation command (not a part of the aggregation framework, we’ll get to that later) that relies on the runCommand to perform a count in MongoDB. The db.runCommand is a helper used to run specific database commands like count. If we want to count all of the documents in the userprofile collection, we would run the following.

db.runCommand ({  
   count:"userprofile"
})

A more common scenario for counting records would be counting based on the value of something. In the following query we will introduce the “query” option for counting. The “query” option lets us specify a condition (all users older than 35) and count all records that meet this condition.

db.runCommand({  
   count:"userprofile",
   query:{  
      "age":{  
         $gt:35
      }
   }
})

Counting this way is not very productive; we’ll need to tap in to the aggregation framework for more advanced queries. We’ll get to that later in this guide.

Select Distinct Values

Like the counting aggregation, the distinct aggregation can also be run with the db.runCommand helper. There is also a distinct() method we can use; let's look at the two examples below. If we want to return all distinct names in our userprofile collection, we could run the following.

db.runCommand({  
   distinct:"userprofile",
   key:"name"
})

Or we could use the distinct() method:

db.userprofile.distinct("name")

In the first example we call the distinct aggregation with the userprofile collection. We have the key value we want to select distinct against, in this example the names of the users. In the second example we just call the distinct() method on the userprofile collection and use the names value to select our distinct user names.

Grouping Results and the Aggregation Framework

The grouping aggregation, which was used in the same way as the examples above, was deprecated in MongoDB version 3.4. If we want to group results we would instead use the aggregation framework. To be more specific, we would use the aggregation pipeline, which is a framework for data aggregation. The terms aggregation pipeline and aggregation framework are interchangeable, in case you hear it called one or the other. The aggregation framework is modeled on the concept of data processing pipelines, where documents enter a multi-stage pipeline that transforms the documents into aggregated results. Each stage transforms the documents as they pass through the pipeline. MongoDB provides the aggregate() method in the format of db.collection.aggregate(). Aggregation operators like group, count, sum, or max begin with a $ when using the aggregation framework. There are many aggregation pipeline operators--we are going to cover the basics.

You will recall in pervious examples in part 1 of this series, we talked about the find() method, filtering, and projection. That filter (the first set of curly braces following the find method) works the same way the $match operator does in the aggregation framework. Let’s compare a query using the two.

Using the find() method, return all documents where the age of each user is greater than 35.

db. userprofile.find({  
   "age":{  
      $gt:35
   }
})

Now we'll return the same result using the aggregation framework:

db.userprofile.aggregate ([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   }
])

In the find() method we match records in the first set of curly braces (this is the filter), no match keyword is needed in the find() method. In the aggregate() method we match records using the $match operator. Also notice that the aggregate() method begins with an open paren and an open square bracket. Now that we understand the basic format of the aggregation() method let's take the next step. In the following example we are going to return the total balance for users grouped by gender. This will use the $group operator and the $sum operator.

db.userprofile.aggregate([  
   {  
      $group:{  
         _id:"$gender",
         totalBalance:{  
            $sum:"$balance"
         }
      }
   }
])

The $group operator groups documents and then outputs the distinct grouping for each document to the next stage of the pipeline. In the example above, following the $group operator, the first thing you will see is the _id: field followed by $gender. The _id field is a mandatory field when using $group. We can think of  _id as an alias that represents the fields we want to group on. In the example above, we needed to group on the gender field so the value for _id: is "$gender". The $ preceding gender tells the operator to treat $gender as a Literal, which means that $gender evaluates to a field path—it allows us to directly access the gender field in the document. Next in our statement we see totalBalance, which is the alias name we are providing for the sum of balance. We use $sum to calculate a total of the users balance grouped by gender. In order for us to access the balance field in the document we use a Literal $balance like we did with $gender. I would encourage you to look at the results with and without the $ so you can see how it affects the output.

Now let's look at another example, this time grouping on multiple fields. We will take the same query up top but this time we'll group users by their favorite fruit and gender.

db.userprofile.aggregate([  
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         }
      }
   }
])

The main difference between this query the one before it is that the mandatory _id field is now an object, with both fields from our collection we wanted to group on. userGender is the alias for the gender field and favoriteFruits is the alias for the favorite Fruit field. This represents the grouping for the query. Run the query above to see the results and the grouping.

Using $group with embedded objects is a subject we will visit later in this series. There are additional functions needed when working with arrays of objects and the aggregation framework.

Projection

Like we discussed in part one, projection is the process of limiting what fields get returned in our query. In the find() method, the second set of curly braces represents our projection and no keyword is needed. In the aggregation framework, projection is represented by the $project operator.

db.userprofile.aggregate ([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   },
   {  
      $project:{  
         "name":1,
         "email":1,
         "_id":0
      }
   }
])

Counting

There are two ways to count in the aggregation framework, the first is using the $count operator and the second is using the $sum operator. That is not a typo, we can use $sum to return a count of records. Let's look at the different ways we can use count.

db.userprofile.aggregate([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   },
   {  
      $count:"totalCount"
   }
])

In the example above , we are returning a count of all documents where the user's age is greater than 35. This will return the alias name "totalCount" along with the number of records that match our filter. This is simple enough, but what if we want to return more than just the record count? Going back to our example where we grouped on gender and favoriteFruits, let's add a count to this query. We want to know how many records exist for each grouping. This query also returns a totalBalance using $sum, and you are about to see why this second method can be confusing.

db.userprofile.aggregate ([  
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         },
         documentCount:{  
            $sum:1
         }
      }
   }
])

In this example $sum is used both to provide a total balance as well as count the number of documents per grouping. You can see the difference in the syntax surrounding $sum. Where you see $sum:1, all this is doing is summing the value 1 for every record returned. If you change this to the number to 0 it will return 0 for the document count.

Putting it all together

Now we need to put all the pieces together. In the following query we are going to provide a total balance and document count for all active users grouped by gender and favorite fruit and we only want to return the balance and the count.

db.userprofile.aggregate ([  
   {  
      $match:{  
         "isActive":true
      }
   },
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         },
         documentCount:{  
            $sum:1
         }
      }
   },
   {  
      $project:{  
         "_id":0,
         "totalBalance":1,
         "documentCount":1
      }
   }
])

It's important to note the _id filter in this query. In our projection we specify _id:0 so it does not return the results of _id. If we were working with the find() method, this would simply suppress the object id in the result set. When used in the aggregate framework with $group, it's going to suppress the grouping fields. We can't suppress the grouping fields directly by applying a projection on those individual fields, instead we apply the projection on the _id that represents the grouping fields.

In the Part 3 of this series we will continue working with the aggregation framework and begin exploring arrays.

MongoDB Query Language Part 1 – The basics of selecting data from MongoDB

What we will cover:

  • Selecting data
  • Filtering results 
  • Sorting
  • Top N

Before you get started:

  • Make sure you load the test data in the document found here.
  • Download a good MongoDB IDE. I recommend Robo 3T, formerly named Robomongo. https://robomongo.org/
  • Once you install Robo 3T and open up the shell, make sure you set the results mode to text, this will make reading the output easier.
  •  Everything in MongoDB is case sensitive including database names, collection names, users, methods, etc.
  • It is important to familiarize yourself with the structure of a JSON document before you begin. This is a good place to start: http://www.json.org/

Selecting Data

There are several different ways you can query data in MongoDB. It is important to note early on, that there is the standard way of retrieving data and then there is the aggregation framework. This can be confusing especially when you start Googling for answers. Sometimes you will see examples that use the standard method for querying data and other times you will see examples that use the aggregation framework. We are going to start with the basic methods and move on to the aggregation framework later in this series.

There are two standard methods for querying MongoDB data, find() and findOne(). find() returns a cursor of results 20 at a time, you can type “it” to keep iterating through the results; findOne() only returns a single document. It’s important to understand the syntax of the find method. In the example below you will see two sets of curly braces, comma separated, enclosed in a set of parens—here is what all of that means:

Userprofile is our collection (table) and find is the method we will use to select data from the collection followed by a set of parens. Inside your parens the first set of curly braces represents your filters while the second set of curly braces represents the data you want to return. Either set of curly braces can be excluded, in which case both of the statements below are functionally equal. Run either of the find statements below to select all documents from the userprofile collection.

db. userprofile.find( {}, {} )

is equivalent to

db. userprofile.find()

Now let’s look at the findOne() method. This is just like find() except that findOne() only returns a single document whereas find() returns a cursor of documents. Note the letter case, findone() is not the same as findOne(). Run the statement below to select a single document from the userprofile collection.

db.userprofile.findOne()

If we want to query in to an array of objects we can use what’s called dot notation. In our userprofile collection we have an array of objects named friends. If we wanted to filter our results by a value in an embedded object we would use dot notation. In the query below we call the name from the friends array using “friends.name” where the user name is Martha Solis. This returns the document where this name exists in the array. You can use dot notation to drill down multiple levels of embedded arrays.

db. userprofile.find({"friends.name":"April Hammond"})

Filtering

Now that we can return documents from our collection, let’s add a filter to our query. You can think of this as something like a WHERE clause in SQL, in MongoDB we typically refer to this as matching. Using the first set of curly braces we are going to return only documents where the isActive status for the user is true. Notice I am excluding the second set of curly braces, if we want to return the entire document we don’t need to include the second set of curly braces.

db. userprofile.find({"isActive":true})

All of the standard comparison operators are also available to filter on such as greater than, less than, in, etc. If we wanted to find all users where the isActive status is true and their age is greater than 25 we could use the following.

db. userprofile.find({"isActive":true, "age": { $gt: 35 }}) 

Notice that all filters are still contained in that first set of curly braces. Each filter is comma separated and when using a comparison operator you will have another set of curly braces that represent the comparison object. Here is a common list of comparison operators.

$eq – Equals
$gt – Greater Than
$gte – Greater Than or Equals
$lt – Less Than
$lte – Less Than of Equal
$ne – Not Equal
$in – Where Value In
$nin – Where Value Not In

In our last example we are combining two filters to what amounts to, effectively, isActive = true AND age is > 35. Another way to write this is using a logical query operator, in the following example we will use the $and logical operator. Notice the enclosures of the code - the $and comparison object has an array of values it compares. Arrays are always contained in square brackets.

As our scripts get longer, good code formatting will be essential for reading and debugging. I would strongly encourage good formatting practices when writing code for the MongoDB shell or any programming language for that matter.

db.userprofile.find({  
   $and:[  
      {  
         "isActive":true,
         "age":{  
            $gt:35
         }
      }
   ]
})

 

Here is a list of logical operators we can use like the $and operator above:

$or – Performs a logical OR operation
$not – Performs a logical NOT operation
$nor – Performs a logical NOR operation. Only returns the results when all inputs are negative.

Now what if we don’t want the entire document returned but instead only want to return each users name, their status, and their age? Now we are going to use the second set of curly braces to specify the values we want to return. This process of limiting the columns returned is called projection in MongoDB.

db.userprofile.find({  
   $and:[  
      {  
         "isActive":true,
         "age":{  
            $gt:35
         }
      }
   ]
},
{  
   "name":1,
   "age":1,
   "isActive":1,
   "_id":0
})

To select certain name/value pairs we call the name followed by a colon and a 1 or 0, a true/false Boolean. So we are asking for the name, the age, and the isActive values in all documents in the collection. Notice we include “_id”:0, this excludes the system _id from being returned in the query. By default, the system _id will always be returned unless you exclude it as we have done above.

Sorting

Sorting a result is one of the more common tasks in data analysis. We need to be able to order the data in a certain way so that important values are at the top of the result. We can accomplish this using the sort() method. The sort method follows the end of the find() method. You specify the name/value pair you want to sort on and then the direction, ASC ( 1 ) or DESC ( -1 ). Run the statement below to sort by the age in descending order.

db.userprofile.find({  
},
{  
   "_id":0,
   "name":1,
   "age":1

}).sort ({  
   "age":-1
})

In addition to the sort, we use projection in this query but there are no data filters. Notice the empty set of curly bracers before the projection. If you recall the first set of curly braces is the filter, the second set it the projection.

Top N

What if we only want to return the first 5 documents from the collection? We can accomplish this using the limit() method. The limit() method is similar to the TOP function in SQL. The limit method is applied at the end of the find() using .limit(). The limit method is called a cursor method because it is applied to a cursor result, which is what the find() method produces. In the query below, we are only returning the first 5 documents from the table.

db.userprofile.find({  
},
{  
   "_id":0,
   "name":1,
   "gender":1,
   "company":1,
   "email":1
}).limit(5)

Putting it all together

Now we need to put all the pieces together. In the following query we will apply everything we learned in the previous steps. We are going to select the name, age, and email address of the top 5 youngest active users. This will use a filter, a projection, a sort, and limit the number of results to 5.

// Add commets to your code using a double forward slash.  
db.userprofile.find({  
   isActive:true   // This is our filter
},
{  
   "_id":0, 
 "name":1,
   "age":1,
   "email":1
}).sort({ // Here is our sort 
   "age":1
}).limit(5) // And this is limiting the results to 5

This is the first step to querying MongoDB. The above queries are the most basic building blocks for querying MongoDB. Once we get in to the aggregation framework, it gets a little more complex. We will begin to cover the aggregation framework in the next part in this series.