Remove Empty Row from CSV File Using PowerShell in SSIS

When you export data to a flat file destination in SSIS, you end up with an empty row at the end of the file. This is by design, because each row requires a CR LF. That last row of data still has a carriage return at the end of it, thus the blank row that follows.

Finding a solution to this problem is not straightforward and that is likely due to the fact this is not really a problem, but not everyone will agree with that opinion. If you have been Googling you probably stumbled upon the conditional split solution, but that only works if you are starting from a csv that has the trailing blank row and you are writing that data to some place other than a flat file destination. We want to remove the empty row after we write to the flat file destination.

If you are not a C# developer or a VB developer (nobody should be one of these actually), then you might struggle with the script task in SSIS. Instead of that we are going to invoke a process, PowerShell.exe. Here are the steps you will need to follow:

  • Make sure PowerShell is installed/enabled on the server that the package runs on.
  • On the server where your SSIS package is executed, we need to create the PowerShell script to remove the blank row (below). Replace the value in the $textFile variable to your target file. Save the script in a directory where the SSIS service account can execute the file and modify the file, etc. Save the file as PS_RemoveBlankRow.ps1 or whatever you like.
$textFile = 'C:\YourFileShare\YourTargetFile.csv'
[System.IO.File]::WriteAllText($textFile, [System.IO.File]::ReadAllText($textFile) -replace '[\r\n]+$')
  • Before you can remove the empty row from your file, you first need to save the file to its final destination. So work through all the steps in your SSIS package to get your data to the desired flat file.
  • Now we are ready to clean the file. In the control flow tab of your package, add an Execute Process Task. Click on the Process tab, in the Executable option, type PowerShell.exe, this will invoke PowerShell. In the arguments, type the following (below). The –F option is for file, then you are entering the directory and file of the PS file we created in the second step. You can leave the rest of the settings to their defaults.

-F C:\MyFileShare\PS_RemoveBlankRow.ps1

  • You might have to grant an exception on executing this script file if you have some kind of AV that blocks it. Depends on your environment.

There you go, a simple way to remove the trailing empty row in a CSV.

Connect to Azure CosmosDB with MongoDB API from Studio 3T for Mongo

Let's assume you've created a CosmosDB database with MongoDB API in Azure and you would like to connect to it using Studio 3T for MongoDB. Follow these steps to connect.

In Studio 3T, click the Connect button and then click New Connection—this will open the new connection dialog box. At this point you will also want to connect to Azure and open your CosmosDB database.

img1.jpg

You need to get the connection details for the 3T dialog from your CosmosDB database configuration. In Azure, click on the Azure Cosmos DB link, and then click the target database, this should bring you to the database overview. You will see a tree view of options, under settings, click on the Connection String link, this is where you will get all the details to connect to your CosmoDB database.

In the 3T connection dialog, select Direct Connection for the Connection Type, for the Server you will use the HOST name from Azure. CosmoDB does not use the default MongoDB port so make sure you use the correct port from Azure. Once you have finished adding your server name and port you can click on the Authentication tab.

img2.jpg

For the Authentication Mode, select the Basic MongoDB type. For the User name, copy the USERNAME from Azure. For the Password, copy the PRIMARY PASSWORD from Azure. Authentication DB is admin and you can optionally configure the "Manually list visible databases" option as seen in the screen shot above. Now we are ready to move on to the SSL tab.

img3.jpg

Check the box to "Use SSL protocol to connect" and then click the "Accept server SSL certificates trusted by the operating system" option.

Save the connection and then click connect.

Default Schema and Schema Scope in Stored Procedures

When working in SQL server, if you call an object--EXEC, SELECT FROM, etc, you can do so without specifying a schema and it will default to dbo (unless you have assigned a different default schema to that user, though it is more common not to). What happens if you have a procedure in, let's say, a dev schema and you are calling tables from both the dev and dbo schemas inside it? If you are including the schema names in your query <SchemaName.TableName> then everything will work as expected, but what happens if you exclude the schema from a table you are calling in hopes that it will return dbo? That depends on where the table exists, in dev, dbo, or both and what schema the calling object belongs to. 

I wanted to put together a script that explains the behavior of default schema and schema scope in stored procedures. Let's walk through the following script for a more meaningful explanation.

--======================================================
-- Default Schema and Schema Scope in Stored Procedures 
--======================================================

-- Create a dev schema if you don't have one already

CREATE SCHEMA [dev] AUTHORIZATION [dbo];
GO

--DROP TABLE IF EXISTS dbo.ordertest;
--DROP TABLE IF EXISTS dev.ordertest;

-- Create a table in the dbo schema

CREATE TABLE dbo.ordertest (
ID smallint,
Targets varchar(20) );

INSERT INTO dbo.ordertest
SELECT 1, 'dbo';

-- Create the same table, this time in the dev schema

CREATE TABLE dev.ordertest (
ID smallint,
Targets varchar(20) );

INSERT INTO dev.ordertest
SELECT 1, 'dev';
GO

-- Notice we populate the dbo.ordertest with the targets value of dbo and the the dev.ordertest with dev

-- Now we need to create a stored procedure in the dev schema. We create this without specifying the schema in our SELECT table. 

CREATE PROCEDURE dev.TestSchema AS
    SET NOCOUNT ON

        BEGIN
        SELECT * FROM ordertest -- No schema specified here
        END;
GO

-- Now lets do a simple query against ordertest without specifying a schema

SELECT * FROM ordertest

-- The result is dbo, because dbo is the default schema in SQL

-- This time we specify the dev schema and get the value from our dev table

SELECT * FROM dev.ordertest

-- What happenes when we call our stored procedure in dev, which performs a SELECT against the table without a schema?

EXEC dev.TestSchema;

-- Returns dev, because the scope of the stored procedure is the dev schema, so it assumes an undefined schema is the same as the object calling it.

So if we have two tables of the same name in a SQL database that belong to two different schemas, if you call one of those in a SELECT without specifying the schema you will get the table in the dbo schema because that is default schema in SQL. If you write that same SELECT without specifying a schema and drop it in a stored procedure that belongs to the dev schema, when you execute that stored procedure it will return the results from the dev table because dev is the schema scope for that procedure. This can be confusing if you are expecting the result from dbo since that is the behavior you are used to. Now what happens if we delete the table in dev and call the procedure again?

-- What happenes if we delete the dev table and then run the procedure again?

DROP TABLE IF EXISTS dev.ordertest;

EXEC dev.TestSchema;

-- Now it returns dbo.

Now the query returns the result from dbo because the table no longer exists in the dev schema (the scope of the stored procedure) so it once again defaults to dbo. One more reason to always include schema names in all of our queries, that way you never run in to the situation we have illustrated here.


    

Upgrade SQL Server SSRS to Power BI Report Server

Wanted to make a quick guide on upgrading SSRS Report Server to Power BI Report Server. This is an in-place upgrade, we will replace the existing SSRS Report Server with Power BI. These are the upgrade steps for SQL 2016--I can't say how different this process is on an older version though I can't image much. Power BI Report Server is a stand alone product now. Here are the steps you will need to follow.

  • Take a full backup of your ReportServer and ReportServerTempDB databases
  • Backup the SSRS Encryption Key and make sure you save the password you used to back it up.
  • Stop the SSRS Report Server service
  • Detach the ReportServer and ReportServerTempDB databases (we want these in case we need to roll back, your backups would work also, up to you).
  • Restore the ReportServer and ReportServerTempDB databases you backed up, keep the same database names but give them new physical file names, for example, PowerBIReportServer.mdf.
  • Install the PowerBI Report Server
  • Configure the PowerBI Report Server.
    • Start the service
    • When you configure the http reservations, it will override the existing SSRS Report Server reservations. That is fine and what you want; you will see a warning when this happens, you can ignore it. If you had to roll back, when you uninstall the PowerBI Report Server it would release those reservations back to SSRS Report Manager.
    • Point the service at the existing databases you just restored. Important note here: because we are effectively restoring SSRS Reporting database to the PowerBI Report Server, it's important that you don't change the names of these databases. Your SSRS report subscriptions will be looking for the original name of the databases, if this is changed your subscriptions will not work.
    • Make sure you configure your email setting on the new server. If you fail to do this you will get an error when you try to open existing subscriptions on the server, because it is expecting those mail settings.
  • Once you have finished configuring the server, restore the SSRS encryption key you backed up. This will restart the service at which point the upgrade is complete.

Not a lot to it, but there were a few things not mentioned in the Microsoft guide which is why I wanted to create this here. Keeping the database names the same and also the issue with email configuration and subscriptions. It's a pretty simple upgrade and the new PowerBI Report Server is a nice addition to the Microsoft BI stack.

Current Year, Previous Year, The Missing Dates Problem, and LAG

When working with data that has date gaps, we need to have different ways of filling in the missing dates. This could be something like previous revenue where you want to see what a customer purchased last year but they don't have a sale in the current year. Or maybe you want to carry a value forward over a certain period where you are missing dates, like a rate value over a holiday. I want to go over several different scenarios and ways to solve these problems. We will look at issues with LAG when dealing with missing years, single missing year vs. multiple missing years, and the different methods we can use to fill in the gaps.

The first problem we will look at is the previous year issue. We want to see revenue totals and previous years revenue totals for customers even if they don't have a sale in the current year. For simplicity, our sample data only contains a single customer, but the solutions here have proved performant across larger data sets. We have the customer, sales year, trimester, and revenue. You will see that the customer has no sales in 2013 or 2017 so we can address the missing years. Sales are mixed across different trimesters in different years to provide all the different previous year scenarios.

DROP TABLE IF EXISTS #Sales;

CREATE TABLE #Sales (
     CustomerNumber int
    ,SaleYear int
    ,Trimester int
    ,Revenue decimal(7,2) );

INSERT INTO #Sales
SELECT 333, 2012, 1, 25.00 UNION ALL SELECT 333, 2012, 1, 44.00 UNION ALL SELECT 333, 2012, 3, 18.00 UNION ALL SELECT 333, 2012, 3, 12.00 UNION ALL
SELECT 333, 2014, 1, 18.00 UNION ALL SELECT 333, 2014, 1, 24.00 UNION ALL SELECT 333, 2014, 3, 15.00 UNION ALL SELECT 333, 2014, 3, 24.00 UNION ALL
SELECT 333, 2015, 1, 56.00 UNION ALL SELECT 333, 2015, 2, 6.00  UNION ALL SELECT 333, 2015, 3, 31.00 UNION ALL
SELECT 333, 2016, 1, 43.00 UNION ALL SELECT 333, 2016, 1, 11.00 UNION ALL SELECT 333, 2016, 2, 36.00 UNION ALL SELECT 333, 2016, 3, 31.00 UNION ALL
SELECT 333, 2018, 1, 29.00 UNION ALL SELECT 333, 2018, 1, 33.00 UNION ALL SELECT 333, 2018, 2, 17.00 UNION ALL SELECT 333, 2018, 3, 16.00 ;

In our first example we look at how to get previous year, year over year, without addressing the missing year using the self-join method. Because our customer has multiple sales in the same year and trimester we summarize the data in a CTE before doing our self-join for previous year.

-- Get total revenue by Customer, SaleYear, and Trimester

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

-- Then use self join to get revenue for previous year

SELECT 
     c.CustomerNumber
    ,c.SaleYear
    ,c.Trimester
    ,c.Revenue AS CurrentRevenue
    ,ISNULL(p.Revenue, 0) AS PreviousRevenue
FROM CTE c
LEFT JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester
ORDER BY c.Trimester, c.SaleYear;

In the query above we are self-joining the CTE; if we didn't have to summarize the data we could exclude the CTE and just self join to the table (personal note, I find CTE's more readable for queries like this). The key here is in the join condition c.SaleYear = p.SaleYear + 1. How this works is, if c.SaleYear = 2016 and p.SaleYear = 2015, then if we add 1 to p.SaleYear that equals 2016 and so the join condition evaluates to true, which effectively gives us the previous year record if that record exists. This solution returns the correct value for previous year when that year is missing but does not return the actual missing year (we will get to that shortly). Next let's take a look at the LAG function.

Anytime we start talking about previous value it's common to think about the LAG function, and while LAG works great with contiguous date periods, it falls apart on data sets that are missing dates. Take for example:

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

SELECT 
     CustomerNumber
    ,SaleYear
    ,Trimester
    ,Revenue
    ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue 
FROM CTE c
ORDER BY c.Trimester, c.SaleYear

Where dates are contiguous it provides the correct result, but in the case of 2012 to 2014 it is showing 2012 as the previous revenue for 2014 and that is not what we are looking for. If you look at the previous revenue from 2014 to 2016 the previous year data is correct, but again it breaks down between 2016 and 2018. In order to use LAG in this scenario we would need to manufacture all possible years and trimesters for each customer. We will look at that in a minute but first lets look at a better way to fill in the missing years for the previous year problem.

In the following example we are again using a self-join, but this time a FULL OUTER JOIN to create records for missing years where there was a sale in the previous year but not the current year. Remember, we have no sales in 2013 but we still want to see previous year sales from 2012 on a line for 2013 regardless if there are no sales in 2013.

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

-- Then use self join to get revenue for previous year

SELECT 
     COALESCE(c.CustomerNumber, p.CustomerNumber) AS CustomerNumber
    ,COALESCE(c.SaleYear, p.SaleYear + 1) AS SaleYear
    ,COALESCE(c.Trimester, p.Trimester) AS Trimester
    ,ISNULL(c.Revenue, 0) AS CurrentRevenue
    ,ISNULL(p.Revenue, 0) AS PreviousRevenue
FROM CTE c
FULL OUTER JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester
ORDER BY COALESCE(c.Trimester, p.Trimester), COALESCE(c.SaleYear, p.SaleYear);

If you look at the data you will see we now have records for the missing years or trimesters, and we have rows for both 2013 and 2017. This uses the same thinking as the first self-join we looked at with a few differences. First, we are using a FULL OUTER JOIN instead of the LEFT JOIN. The years are filled in in the SELECT using the COALESCE on SaleYear and you can see we are also using the SaleYear + 1 here, this will always give us the missing year. We use COALESCE for the customer and trimester to complete the data set. A simple way to get previous year and fill in missing years.

What if we wanted to do something like this using LAG? Again, we would need to manufacture every possible year and trimester for each customer so we could provide a contiguous date range to LAG on. There are a few different ways to do this, with a date table, a numbers tables, or recursive CTE; we are going to demonstrate this with recursive CTE's because it doesn't require the use of those other tables if you don't have them already, those other options will be more performant in most scenarios.

First we need to get all possible years starting from the first sale year, next we need to get all possible trimesters, then get our revenue totals, and finally, LAG on the previous year.

-- All possible SaleYears
WITH rCTEyr AS (
    SELECT DISTINCT
         CustomerNumber 
        ,MIN(SaleYear) OVER (PARTITION BY CustomerNumber) AS SaleYear
        ,CAST(0.00 AS decimal(7,2)) AS Revenue
    FROM #Sales

    UNION ALL

    SELECT 
         CustomerNumber
        ,SaleYear + 1
        ,Revenue
    FROM rCTEyr
    WHERE SaleYear <= 2017 ),

-- All possible Trimesters
rCTEQtr AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,1 AS Trimester
        ,Revenue
    FROM rCTEyr

    UNION ALL

    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester + 1
        ,Revenue
    FROM rCTEQtr
    WHERE Trimester < 3), 

-- Get our revenue totals 
CTEfinal AS (
    SELECT
         st.CustomerNumber
        ,st.SaleYear
        ,st.Trimester
        ,ISNULL(SUM(s.Revenue), 0) AS Revenue
    FROM rCTEQtr st
    LEFT JOIN #Sales s ON st.CustomerNumber = s.CustomerNumber AND s.SaleYear = st.SaleYear AND s.Trimester = st.Trimester
    GROUP BY st.CustomerNumber, st.SaleYear, st.Trimester )

SELECT DISTINCT
    CustomerNumber
    ,SaleYear
    ,Trimester
    ,Revenue
    ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue
FROM CTEfinal
ORDER BY Trimester, SaleYear

This creates a record for every possible combination of year and trimester which is useful, but maybe not so much for the previous year problem. You'll see we have the correct previous years revenue as well as the missing years now, but when you compare this with the simplicity of the self-join, the self-join is a faster, more efficient solution to this problem. This brings us to our second scenario, multiple missing dates or years. 

Imagine you have auto finance rates that change day to day. Rates don't change on the holidays or weekends or special events. You have a report that shows the rate for every day in a given period. We created several gaps in our data set below, gaps which span multiple days. We need to carry our rate forward for each missing day to show the rate didn't change for that day. This is the missing dates issue, we need to roll the value forward each missing day until we hit the next valid record. We are going to use a different data set this time to illustrate this.

DROP TABLE IF EXISTS #Test; 

CREATE TABLE #Test (
     RateDate date
    ,RateValue decimal(5,4) )

INSERT INTO #Test
SELECT '12/29/16', 1.2266
UNION ALL 
SELECT '12/30/16', 1.2345
UNION ALL
SELECT '01/03/17', 1.2240
UNION ALL
SELECT '01/06/17', 1.1902;

You will see that in the table above we have values for the 29th, 30th, 3rd, and the 6th but we are missing the dates in between. We need to carry the last rate value forward for each of the missing days. Let's image we have a report and our report accepts 2 parameters, the start date and the end date. We could generate all the days and then join them to the actual records in the table to effectively create missing dates and roll forward rate values. This is another example where you could use a date table, numbers table, or recursive CTE, we are using the CTE for convenience.

DECLARE 
     @StartDate date = '12/29/16'
    ,@EndDate date = '01/06/17';

WITH rCTE AS (
    SELECT @StartDate AS RateDate
    UNION ALL 
    SELECT DATEADD(DAY, 1, RateDate) FROM rCTE
    WHERE RateDate <= @EndDate )

SELECT 
     r.RateDate
    ,z.RateValue
FROM rCTE r
CROSS APPLY (
    SELECT TOP 1 RateValue
    FROM #Test t
    WHERE t.RateDate <= r.RateDate
    ORDER BY t.RateDate DESC ) AS z

The CROSS APPLY works great here and allows us to carry our rates forward for each day. The CROSS APPLY is the most performant of the different ways to solve this problem. I did want to include one more option because it's a clever way to solve this using ROWS UNBOUNDED, but might not be as performant as the CROSS APPLY.

DECLARE 
     @StartDate date = '12/29/16'
    ,@EndDate date = '01/06/17';

-- Get missing dates
WITH rCTE AS (
    SELECT @StartDate AS RateDate
    UNION ALL 
    SELECT DATEADD(DAY, 1, RateDate) FROM rCTE
    WHERE RateDate < @EndDate ),

-- Join missing dates to result
C1 AS (
    SELECT 
         c.RateDate
        ,RateValue
    FROM rCTE c
    LEFT JOIN #Test t ON c.RateDate = t.RateDate ),

-- Create date groups for your ranges
C2 AS (
    SELECT
         RateDate 
        ,RateValue 
        ,MAX(CASE WHEN RateValue IS NOT NULL THEN RateDate END) OVER(ORDER BY RateDate ROWS UNBOUNDED PRECEDING) AS DateGroup
    FROM C1 )

SELECT
     RateDate
    ,MIN(RateValue) OVER(PARTITION BY DateGroup) AS grp2
FROM C2

In summary, there are many different ways to solve the missing dates problem. It's about knowing when to use what patterns for which problems and ensuring those solutions are performant. The queries here are a great start and give you some options as you begin to explore solutions around the missing dates problems.

Microsoft Power BI Report Server Licensing - On-Prem Power BI

Wanted to make a quick post about the licensing model for Power BI on-prem, how it works with SSRS Report Server, and a few other details you might want to know:

  • If you currently have a SQL Enterprise license with software assurance and a minimum of 4 cores, you are entitled to the on-prem Power BI Report Server at no additional cost.
  • If you do not have a SQL Enterprise license with SA you can opt to purchase the Power BI Server standalone license for $5000 a month. I think it is cheaper to buy an enterprise license with SA.
  • The Microsoft PowerBI on-prem solution is called Power BI Report Server. This can be confusing because we already have SSRS Report Server which is often referred to as "Report Server". The Power BI people at Microsoft are also calling the Power BI Report Server "Report Server"...so much for clarity.
  • When you install Power BI Report Server, it's the same as SSRS Report Server with the added support of publishing Power BI reports to it. When you install a new instance of Power BI Report Server you will want backup and restore your existing SSRS Report Server databases to the Power BI Report Server. This will preserve all SSRS reporting while adding in the option to publish and consume Power BI reports from "Report Manager".
  • The current version of Power BI Report Server only works with SSAS, it has no support for other data sources at this time.
  • There is no dashboard support in the initial release of Power BI Report Server..

Connect to MongoDB Atlas Using Robo 3T (Robomongo)

If you use Robomongo you probably already know that Robomongo was acquired by 3T Software. With their first release, Robo 3T  v1.1, we now have support for MongoDB 3.4 and some new features around working with your clusters in Atlas. Here is a quick guide on using Robo3T to connect to your MongoDB 3.4 replica set in Atlas.

We are going to assume you have already setup your cluster in Atlas. When you went through the configuration the first time you should have been asked to setup your Atlas admin account, we will use this to connect to the MongoDB instance.

Open Robo 3T and click the file option, then click connect. Click the create button at the top of this window to create a new connection. For the Type, we are going to select Replica Set. Give it a friendly name and then enter the members of your replica set in Atlas.

IMPORTANT - As of writing this you can only connect with the Replica Set connection type if you are connecting as an Atlas admin, which we are doing in the first image below. If you want to connect as a basic user, maybe a user that has only read access to a specific database, you would change the connection type to Direct Connection and and use the primary member to connect. See the second image that follows for reference. All other steps in this guide are the same for either connection type.

You can find the node details in Atlas by clicking on Clusters and then clicking on the Data button. Scroll down the the section titled "Connect with the Mongo Shell". If you look at the connection string on this page and then look at the example in the images above, you can see what parts we parse out for the individual replica set members. Find your replica set members and enter them like you see in the images above. The set name is optional. Now we are ready to click on the Authentication tab up top.

On the Authentication tab, click the checkbox Perform Authentication. Our Atlas admin account is automatically added to the admin database when we stand up our first cluster, so you shouldn't have to add. For the User credentials enter your Atlas admin account and the password and select SCRAM-SHA-1 for the Auth Mechanism. If you were using a basic user account, you will still authenticate with the admin database since all users authenticate with the admin db in Atlas. Another important note, all user need to be created in the Atlas security interface, you cannot create users through the shell.

Robo3T_Connect2.jpg

We can skip the SSH tab, so let's click on the SSL tab. Check the box "Use SSL protocol. For your authentication method select Self-signed Certificate.

Finally, click on the Advanced tab. For the Default Database we are going to use the admin database. 

Click the Save button and then click Connect to connect to your Atlas replica set. At this point your should see all 3 nodes of your replica set under the Replica Set folder and under the System folder you should see your system databases. If you have created other databases you will see them listed below these folders.

IMPORTANT - If you are connecting as a user with limited access such as a read only user with access to a single database, when you connect to your Mongo Atlas instance you might not actually see the databases you are permissioned to. You can still type "use <database name>" to switch to the database you have permissions in but you might not see it listed in the Robo 3T interface.

MongoDB Query Language Part 2 – Count, Distinct, Group, and Introduction to the Aggregation Framework

What we will cover:

  • Counting documents
  • Selecting distinct values
  • Grouping results
  • Aggregation framework
  • Basic Operators in the Aggregation Framework

In case you missed Part 1 in this series:

  • Make sure you view the guide Load Test Data to load a test dataset so you can follow along with our examples.
  • Download a good MongoDB IDE. I recommend Robo 3T, formerly named Robomongo. https://robomongo.org/   
  • Everything in MongoDB is case sensitive including database names, collection names, users, methods, etc.
  • It is important to familiarize yourself with the structure of a JSON document before you begin. This is a good place to start: http://www.json.org/

Counting documents

Count is an aggregation command (not a part of the aggregation framework, we’ll get to that later) that relies on the runCommand to perform a count in MongoDB. The db.runCommand is a helper used to run specific database commands like count. If we want to count all of the documents in the userprofile collection, we would run the following.

db.runCommand ({  
   count:"userprofile"
})

A more common scenario for counting records would be counting based on the value of something. In the following query we will introduce the “query” option for counting. The “query” option lets us specify a condition (all users older than 35) and count all records that meet this condition.

db.runCommand({  
   count:"userprofile",
   query:{  
      "age":{  
         $gt:35
      }
   }
})

Counting this way is not very productive; we’ll need to tap in to the aggregation framework for more advanced queries. We’ll get to that later in this guide.

Select Distinct Values

Like the counting aggregation, the distinct aggregation can also be run with the db.runCommand helper. There is also a distinct() method we can use; let's look at the two examples below. If we want to return all distinct names in our userprofile collection, we could run the following.

db.runCommand({  
   distinct:"userprofile",
   key:"name"
})

Or we could use the distinct() method:

db.userprofile.distinct("name")

In the first example we call the distinct aggregation with the userprofile collection. We have the key value we want to select distinct against, in this example the names of the users. In the second example we just call the distinct() method on the userprofile collection and use the names value to select our distinct user names.

Grouping Results and the Aggregation Framework

The grouping aggregation, which was used in the same way as the examples above, was deprecated in MongoDB version 3.4. If we want to group results we would instead use the aggregation framework. To be more specific, we would use the aggregation pipeline, which is a framework for data aggregation. The terms aggregation pipeline and aggregation framework are interchangeable, in case you hear it called one or the other. The aggregation framework is modeled on the concept of data processing pipelines, where documents enter a multi-stage pipeline that transforms the documents into aggregated results. Each stage transforms the documents as they pass through the pipeline. MongoDB provides the aggregate() method in the format of db.collection.aggregate(). Aggregation operators like group, count, sum, or max begin with a $ when using the aggregation framework. There are many aggregation pipeline operators--we are going to cover the basics.

You will recall in pervious examples in part 1 of this series, we talked about the find() method, filtering, and projection. That filter (the first set of curly braces following the find method) works the same way the $match operator does in the aggregation framework. Let’s compare a query using the two.

Using the find() method, return all documents where the age of each user is greater than 35.

db. userprofile.find({  
   "age":{  
      $gt:35
   }
})

Now we'll return the same result using the aggregation framework:

db.userprofile.aggregate ([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   }
])

In the find() method we match records in the first set of curly braces (this is the filter), no match keyword is needed in the find() method. In the aggregate() method we match records using the $match operator. Also notice that the aggregate() method begins with an open paren and an open square bracket. Now that we understand the basic format of the aggregation() method let's take the next step. In the following example we are going to return the total balance for users grouped by gender. This will use the $group operator and the $sum operator.

db.userprofile.aggregate([  
   {  
      $group:{  
         _id:"$gender",
         totalBalance:{  
            $sum:"$balance"
         }
      }
   }
])

The $group operator groups documents and then outputs the distinct grouping for each document to the next stage of the pipeline. In the example above, following the $group operator, the first thing you will see is the _id: field followed by $gender. The _id field is a mandatory field when using $group. We can think of  _id as an alias that represents the fields we want to group on. In the example above, we needed to group on the gender field so the value for _id: is "$gender". The $ preceding gender tells the operator to treat $gender as a Literal, which means that $gender evaluates to a field path—it allows us to directly access the gender field in the document. Next in our statement we see totalBalance, which is the alias name we are providing for the sum of balance. We use $sum to calculate a total of the users balance grouped by gender. In order for us to access the balance field in the document we use a Literal $balance like we did with $gender. I would encourage you to look at the results with and without the $ so you can see how it affects the output.

Now let's look at another example, this time grouping on multiple fields. We will take the same query up top but this time we'll group users by their favorite fruit and gender.

db.userprofile.aggregate([  
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         }
      }
   }
])

The main difference between this query the one before it is that the mandatory _id field is now an object, with both fields from our collection we wanted to group on. userGender is the alias for the gender field and favoriteFruits is the alias for the favorite Fruit field. This represents the grouping for the query. Run the query above to see the results and the grouping.

Using $group with embedded objects is a subject we will visit later in this series. There are additional functions needed when working with arrays of objects and the aggregation framework.

Projection

Like we discussed in part one, projection is the process of limiting what fields get returned in our query. In the find() method, the second set of curly braces represents our projection and no keyword is needed. In the aggregation framework, projection is represented by the $project operator.

db.userprofile.aggregate ([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   },
   {  
      $project:{  
         "name":1,
         "email":1,
         "_id":0
      }
   }
])

Counting

There are two ways to count in the aggregation framework, the first is using the $count operator and the second is using the $sum operator. That is not a typo, we can use $sum to return a count of records. Let's look at the different ways we can use count.

db.userprofile.aggregate([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   },
   {  
      $count:"totalCount"
   }
])

In the example above , we are returning a count of all documents where the user's age is greater than 35. This will return the alias name "totalCount" along with the number of records that match our filter. This is simple enough, but what if we want to return more than just the record count? Going back to our example where we grouped on gender and favoriteFruits, let's add a count to this query. We want to know how many records exist for each grouping. This query also returns a totalBalance using $sum, and you are about to see why this second method can be confusing.

db.userprofile.aggregate ([  
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         },
         documentCount:{  
            $sum:1
         }
      }
   }
])

In this example $sum is used both to provide a total balance as well as count the number of documents per grouping. You can see the difference in the syntax surrounding $sum. Where you see $sum:1, all this is doing is summing the value 1 for every record returned. If you change this to the number to 0 it will return 0 for the document count.

Putting it all together

Now we need to put all the pieces together. In the following query we are going to provide a total balance and document count for all active users grouped by gender and favorite fruit and we only want to return the balance and the count.

db.userprofile.aggregate ([  
   {  
      $match:{  
         "isActive":true
      }
   },
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         },
         documentCount:{  
            $sum:1
         }
      }
   },
   {  
      $project:{  
         "_id":0,
         "totalBalance":1,
         "documentCount":1
      }
   }
])

It's important to note the _id filter in this query. In our projection we specify _id:0 so it does not return the results of _id. If we were working with the find() method, this would simply suppress the object id in the result set. When used in the aggregate framework with $group, it's going to suppress the grouping fields. We can't suppress the grouping fields directly by applying a projection on those individual fields, instead we apply the projection on the _id that represents the grouping fields.

In the Part 3 of this series we will continue working with the aggregation framework and begin exploring arrays.

MongoDB Query Language Part 1 – The basics of selecting data from MongoDB

What we will cover:

  • Selecting data
  • Filtering results 
  • Sorting
  • Top N

Before you get started:

  • Make sure you load the test data in the document found here.
  • Download a good MongoDB IDE. I recommend Robo 3T, formerly named Robomongo. https://robomongo.org/
  • Once you install Robo 3T and open up the shell, make sure you set the results mode to text, this will make reading the output easier.
  •  Everything in MongoDB is case sensitive including database names, collection names, users, methods, etc.
  • It is important to familiarize yourself with the structure of a JSON document before you begin. This is a good place to start: http://www.json.org/

Selecting Data

There are several different ways you can query data in MongoDB. It is important to note early on, that there is the standard way of retrieving data and then there is the aggregation framework. This can be confusing especially when you start Googling for answers. Sometimes you will see examples that use the standard method for querying data and other times you will see examples that use the aggregation framework. We are going to start with the basic methods and move on to the aggregation framework later in this series.

There are two standard methods for querying MongoDB data, find() and findOne(). find() returns a cursor of results 20 at a time, you can type “it” to keep iterating through the results; findOne() only returns a single document. It’s important to understand the syntax of the find method. In the example below you will see two sets of curly braces, comma separated, enclosed in a set of parens—here is what all of that means:

Userprofile is our collection (table) and find is the method we will use to select data from the collection followed by a set of parens. Inside your parens the first set of curly braces represents your filters while the second set of curly braces represents the data you want to return. Either set of curly braces can be excluded, in which case both of the statements below are functionally equal. Run either of the find statements below to select all documents from the userprofile collection.

db. userprofile.find( {}, {} )

is equivalent to

db. userprofile.find()

Now let’s look at the findOne() method. This is just like find() except that findOne() only returns a single document whereas find() returns a cursor of documents. Note the letter case, findone() is not the same as findOne(). Run the statement below to select a single document from the userprofile collection.

db.userprofile.findOne()

If we want to query in to an array of objects we can use what’s called dot notation. In our userprofile collection we have an array of objects named friends. If we wanted to filter our results by a value in an embedded object we would use dot notation. In the query below we call the name from the friends array using “friends.name” where the user name is Martha Solis. This returns the document where this name exists in the array. You can use dot notation to drill down multiple levels of embedded arrays.

db. userprofile.find({"friends.name":"April Hammond"})

Filtering

Now that we can return documents from our collection, let’s add a filter to our query. You can think of this as something like a WHERE clause in SQL, in MongoDB we typically refer to this as matching. Using the first set of curly braces we are going to return only documents where the isActive status for the user is true. Notice I am excluding the second set of curly braces, if we want to return the entire document we don’t need to include the second set of curly braces.

db. userprofile.find({"isActive":true})

All of the standard comparison operators are also available to filter on such as greater than, less than, in, etc. If we wanted to find all users where the isActive status is true and their age is greater than 25 we could use the following.

db. userprofile.find({"isActive":true, "age": { $gt: 35 }}) 

Notice that all filters are still contained in that first set of curly braces. Each filter is comma separated and when using a comparison operator you will have another set of curly braces that represent the comparison object. Here is a common list of comparison operators.

$eq – Equals
$gt – Greater Than
$gte – Greater Than or Equals
$lt – Less Than
$lte – Less Than of Equal
$ne – Not Equal
$in – Where Value In
$nin – Where Value Not In

In our last example we are combining two filters to what amounts to, effectively, isActive = true AND age is > 35. Another way to write this is using a logical query operator, in the following example we will use the $and logical operator. Notice the enclosures of the code - the $and comparison object has an array of values it compares. Arrays are always contained in square brackets.

As our scripts get longer, good code formatting will be essential for reading and debugging. I would strongly encourage good formatting practices when writing code for the MongoDB shell or any programming language for that matter.

db.userprofile.find({  
   $and:[  
      {  
         "isActive":true,
         "age":{  
            $gt:35
         }
      }
   ]
})

 

Here is a list of logical operators we can use like the $and operator above:

$or – Performs a logical OR operation
$not – Performs a logical NOT operation
$nor – Performs a logical NOR operation. Only returns the results when all inputs are negative.

Now what if we don’t want the entire document returned but instead only want to return each users name, their status, and their age? Now we are going to use the second set of curly braces to specify the values we want to return. This process of limiting the columns returned is called projection in MongoDB.

db.userprofile.find({  
   $and:[  
      {  
         "isActive":true,
         "age":{  
            $gt:35
         }
      }
   ]
},
{  
   "name":1,
   "age":1,
   "isActive":1,
   "_id":0
})

To select certain name/value pairs we call the name followed by a colon and a 1 or 0, a true/false Boolean. So we are asking for the name, the age, and the isActive values in all documents in the collection. Notice we include “_id”:0, this excludes the system _id from being returned in the query. By default, the system _id will always be returned unless you exclude it as we have done above.

Sorting

Sorting a result is one of the more common tasks in data analysis. We need to be able to order the data in a certain way so that important values are at the top of the result. We can accomplish this using the sort() method. The sort method follows the end of the find() method. You specify the name/value pair you want to sort on and then the direction, ASC ( 1 ) or DESC ( -1 ). Run the statement below to sort by the age in descending order.

db.userprofile.find({  
},
{  
   "_id":0,
   "name":1,
   "age":1

}).sort ({  
   "age":-1
})

In addition to the sort, we use projection in this query but there are no data filters. Notice the empty set of curly bracers before the projection. If you recall the first set of curly braces is the filter, the second set it the projection.

Top N

What if we only want to return the first 5 documents from the collection? We can accomplish this using the limit() method. The limit() method is similar to the TOP function in SQL. The limit method is applied at the end of the find() using .limit(). The limit method is called a cursor method because it is applied to a cursor result, which is what the find() method produces. In the query below, we are only returning the first 5 documents from the table.

db.userprofile.find({  
},
{  
   "_id":0,
   "name":1,
   "gender":1,
   "company":1,
   "email":1
}).limit(5)

Putting it all together

Now we need to put all the pieces together. In the following query we will apply everything we learned in the previous steps. We are going to select the name, age, and email address of the top 5 youngest active users. This will use a filter, a projection, a sort, and limit the number of results to 5.

// Add commets to your code using a double forward slash.  
db.userprofile.find({  
   isActive:true   // This is our filter
},
{  
   "_id":0, 
 "name":1,
   "age":1,
   "email":1
}).sort({ // Here is our sort 
   "age":1
}).limit(5) // And this is limiting the results to 5

This is the first step to querying MongoDB. The above queries are the most basic building blocks for querying MongoDB. Once we get in to the aggregation framework, it gets a little more complex. We will begin to cover the aggregation framework in the next part in this series.

Database Backup, Backup Automation, and Restore in MongoDB

Backup a single database using mongodump

In many examples of how to backup a MongoDB database found online, those examples don't use authentication. So if you have enabled auth on your MongoDB instance you will need to do one of two things when backing up a database. Either you will specify a user from the target database you are backing up and then authenticate with that or else you'll want to use an admin or backup account and authenticate with that instead. Notice I am calling out the admin database as my authentication database. The user here would be some kind of server scoped admin with permissions to backup all database and they would reside in the admin database. This way especially makes sense when we get in to automating the backup routine for all databases in our instance. If you have not enabled auth in your instance, you can exclude the user, password, and authentication database but I would strongly suggest you look in to setting up authentication.

 Without compression:

 mongodump -h myMongoHost:27017 –d myDatabase -u myAdminUser -p myPassword --authenticationDatabase admin -o /data/backups/MyDatabaseBackup

With compression:

 mongodump -h myMongoHost:27017 -d DatabaseName -u AdminUser -p password --authenticationDatabase admin --gzip --archive=/data/backups/MyDatabaseNameBackup

Here are the options we used:

-h is the host name
-u is the username you want to login in with
-p is the password
-d is the database you want to backup
-o is the location you want to backup to ( without compression )

Compression options:

--gzip tells the backup to use compression
--archive is the location of the compressed backup

* If you exclude the -d flag, mongodump will back up all databases.

Automating MongoDB Backups

In order to automate the backup of databases in MongoDB, we'll need to create a shell script that can be run in a Linux cron job. So pick a directory and let's create the script. You can use whatever file editor you choose, I typically use vim. Modify the contents below then copy and paste them to the new file you created.

# Use mongodump to backup all databases in a Mongo instance
# Variables

MONGOHOST="myMongoServer"
MONGOPORT="27017"
MONGODUMP="/usr/bin/mongodump"
BACKUPDIR="/data/backups/"
TIMESTAMP=`date +%F-%H%M`

#Backup commands

$MONGODUMP -h $MONGOHOST:$MONGOPORT -d StackExchange -u myAdmin -p myPassword --authenticationDatabase admin --gzip --archive=/data/backups/ StackExchange$TIMESTAMP

$MONGODUMP -h $MONGOHOST:$MONGOPORT -d AdventureWorks -u mongoadmin -p myPassword --authenticationDatabase admin --gzip --archive=/data/backups/ AdventureWorks$TIMESTAMP

$MONGODUMP -h $MONGOHOST:$MONGOPORT -d Census -u myAdmin -p myPassword --authenticationDatabase admin --gzip --archive=/data/backups/ Census$TIMESTAMP

# Begin file cleanup, older than 3 days

find /data/backups -type f -mtime +2 -delete

The variables are things like your mongo server name, port, directory, etc. The backup commands apply the variables and there is a single command for each database we want to backup, in this example we have three databases. Finally, we have a cleanup command that will delete backup files older than 3 days. Once you have modified the file and saved it, next we need to give execute permission to the script we just created. We can do this using chmod. Chmod allows us to change the access permissions to file system objects. Modify this according to where you created your file.

chmod +x /home/YourDirectory/mongo_backups.sh

Next we need to create a cronjob to run the script nightly. Open the crontab file as the root user, we can sudo with the -u flag to access as root.

sudo crontab -e -u root

Following the last line of comments in the crontab file, add the following line to run your script every day at 3:25 PM. You can modify the number preceding the script to your preferred backup time.

25 15 * * * /home/YourDirectory/mongo_backups.sh

These fields follow an order of minute, hour, day of month, month, day of week. An asterisk means it’ll run on every division of the interval. So our job is set to run daily at 3:25 PM.

If you want to check on the status of a cron job, you can use the following:

sudo grep CRON /var/log/syslog

A quick note on cron permissions-- If you setup the cronjob under a user other than root, you might have problems running the job, in which case you need to grant explicit permissions to the user. I would recommend you setup cronjobs as the root user.

Restore a single database using mongorestore without compression

Like we explained in the database backup section of this guide, when restoring a database using mongorestore you must use an account that has permissions in the backed up database or else you need to specify the --authorizationDatabase option. The authorization database is the database you authenticate with. In most cases this would be the admin database, using an account that has admin permissions across the entire environment.

Backup from Server (A) and restore to Server(B), the database does not exist on Server (B):

mongorestore --host servername:27017 -u myAdminLogin -p myPassword --authenticationDatabase admin -d myDatabase /data/backups/MyDatabaseNameBackup

Backup from Server (A) and restore to Server (B), the database already exists on Server (B) so we want to replace it. All we need to do is follow the steps from the previous restore and this time include the --drop flag. 

mongorestore --host servername:27017 -u myAdminLogin -p myPassword --authenticationDatabase admin --drop -d myDatabase /data/backups/MyDatabaseNameBackup

Restoring a compressed backup using mongorestore

It's important to note, you cannot restore a compressed database to a database name different from the original database you backed up unless you are running ver. 3.4 +  and use the new ns options. If you are not on 3.4 + you can still restore a compressed backup, but only to the same name of the database you backed up. Be careful not to overwrite the existing database when performing a restore from a compressed backup.

mongorestore --host servername:27017 -u mongoadmin -p mypassword --authenticationDatabase admin --gzip --archive=/data/backups/MyDatabaseNameBackup

Restoring a compressed backup to a new database name (MongoDB 3.4 and higher):

mongorestore --host servername:27017 -u mongoadmin -p mypassword --authenticationDatabase admin --gzip --archive=/data/backups/MyDatabaseNameBackup --nsInclude 'DatabaseName.*' --nsFrom 'DatabaseName.*' --nsTo 'NewDatabaseName.*'

nsInclude, nsFrom and nsTo are new to MongoDB 3.4. The * is a wildcard for all objects in the target namespace. This restores a database named "myDatabase" to a database named "myNewDatabase".

Install and Configure MongoDB on Ubuntu

Import the MongoDB public key:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6

Create a list file for MongoDB

echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list

Reload the package database

sudo apt-get update

Install the MongoDB packages

sudo apt-get install -y mongodb-org

Create a new MongoDB systemd service if one doesn't already exist. First check to see if the file already exists:

ls /lib/systemd/system

If the file doesn't exist then create it:

sudo vim mongod.service

Paste the script below and save the file:

[Unit]

Description=High-performance, schema-free document-oriented database

After=network.target

Documentation=https://docs.mongodb.org/manual

[Service]

User=mongodb

Group=mongodb

ExecStart=/usr/bin/mongod --quiet --config /etc/mongod.conf

[Install]

WantedBy=multi-user.target

*Esc Key*
:w

Update the systemd service

systemctl daemon-reload

Enable the service

systemctl enable mongod

If you start your mongo service at this point (let's wait) you will see several errors:

** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.

MongoDB suggests setting these to 'never'

To address this we are going to create another service that is dependent on the mongod service.

Create service to set transparent_hugepage settings to never

cd /lib/systemd/system
sudo vi mongodb-hugepage-fix.service

Paste the contents below and save the file:

[Unit]
Description="Disable Transparent Hugepage before MongoDB boots"
Before=mongod.service

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
ExecStart=/bin/bash -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'

[Install]
RequiredBy=mongod.service

*Esc Key*
:w

Update the systemd service

systemctl daemon-reload

Enable the service

systemctl enable mongodb-hugepage-fix.service

Now we are ready to start the services

systemctl start mongodb-hugepage-fix
systemctl start mongod

Optional, check netstat to see if the service is listening

netstat -plntu

There are a few more things we need to do before our install is complete. We still need to modify our config file which is used to set configuration options for MongoDB, like data directories and user authentication. Before we enable auth we first need to create the mongo admin user with the root role.

Create the mongo admin account

use admin
db.createUser ( {user: "mongoadmin", pwd: "UrStrongPassword", roles: [ "root" ]} )

Go ahead and exit mongo and stop the service:

*Ctrl+C*
systemctl stop mongod

Next we are going to modify the config file. There are a few things I want to do here, the most important are setting authorization so that users have to provide a username and password to access MongoDB and to make sure my data directories are in the correct place. Let's open up the config file (this file should already exist, if not you'll want to create it). We only want to take a quick look at the file then close it so we can work on the data directories.

cd /etc
sudo vi mongod.conf
:q!

To see all your configuration options in check out MongoDB documentation:

https://docs.mongodb.com/manual/reference/configuration-options/

The first option in the config file we want to look at is the default data directory. The default directory for MongoDB data files is /var/lib/mongodb, but we don't want to store our data files there. Instead we have a raid 1/0 partition we'd like to use for our data. The directory for our RAID 1/0 is /data, so we are going to move the mongo data files.

Move our mongo data files to a new directory

*Make sure the mongo service is stopped.

cd /var/lib
sudo mv mongodb /data/mongodb
sudo ls -s /data/mongodb/ /var/lib/mongodb

Now we are ready to get back in to the config file

cd /etc
sudo vi mongod.conf

We need to update the data directory in the config file to point to the directory we just moved our data files from the previous step, and set the bindIp of the network so we can connect remotely. Below is what the configuration file looks like for those options:

# Set the location for your data
storage:
dbPath: /data/mongodb
journal:
enabled: true
engine: wiredTiger

# Set the network options
net:
port: 27017
bindIp: 0.0.0.0
ssl:
 mode: disabled

*Esc*
:w

Enable authentication for the mongod.service

Open the mongod.service file in your editor

sudo vi /lib/systemd/system/mongod.service

Add the --auth option to the mongod.service and save the file, the entire line should look like:

ExecStart=/usr/bin/mongod --quiet --auth --config /etc/mongod.conf

*Esc*
:w

Update the systemd service

systemctl daemon-reload

Now we are ready to start the service

systemctl start mongod

Connect to mongo with the user we created

mongo YourServerName:27017/admin -u mongoadmin -p password

Load JSON Files to a MongoDB Database Using Mongoimport on Linux or Windows

Make sure you are out of the mongo shell before you start. Mongoimport is a stand-alone utility that is not part of the shell.

Liunx

Import a single file

mongoimport -d databaseName -c collectionName --file /home/myFiles/myJsonFile.json -u username -p password --authenticationDatabase admin

Import all files from a folder

ls *.json | xargs -I filename 
mongoimport -d targetDatabase -c targetCollection --file /home/myFiles/myJsonFile.json-u username -p password --authenticationDatabase admin

Here is another way to import all files using a shell script.

First create a shell file

vim /targetdirectory/filename.sh


Copy the following in to the file and save. Make sure the file type is sh:

for f in *.json
do
echo "Processing $f file...";
mongoimport --file $f -d testimport -c testing -u $1 -p $2 --authenticationDatabase admin && mv $f $f.processed;
done

Make the file an executable

chmod +x filename.sh

We are now ready to execute the shell script. In the file we created we set 2 parameters for username and password. This allows us to use security without storing the credentials in the file itself. Cal the file name and pass in the username and password to import files.

./filename.sh username password

Windows

When using the Windows CLI to work with Mongoimpot, make sure you are running these commands from the directory where mongoimport resides. This can usually be found in the bin folder. You could also setup environment variables for this.

Import a single file

mongoimport --db databaseName --collection collectionName --file C:\MongoDB\myFile.json

For all files in a folder

C:\Users\bhettrick> for %i in (C:\JSONDocuments\tmp\*) do mongoimport --file %i --type json --db MyDB --collection teststore


Windows Batch version:

@echo off
for %%f in (*.json) do (
"mongoimport.exe" --jsonArray --db MyDB --collection collectioname --file %%~nf.json
)

Guide to Working With JSON in SQL 2016 Part 2 - Output and Format

In part 2 of this series we are going to explore the process of getting data out of a SQL database and in to a usable JSON document. Like FOR XML in the past, Microsoft introduced the FOR JSON clause to format the result of a query as a JSON document. If you want to produce a flat document the FOR JSON clause works just fine for that, but most documents are not flat and will contain nested objects and arrays so we need to know how to deal with that. That will be the focus of this post.

Part of the trick to outputting a meaningful JSON document is understanding the limitations and behaviors of the FOR JSON clause and how to work around them. We will explain in the examples below.

First we need to create our tables that represent the relational data.

-- Create the relational tables

CREATE TABLE #Products (
   ProductNumber int
  ,ProductName varchar(25)
  ,ProductType varchar(20)
  ,Runtime smallint );

INSERT INTO #Products 
SELECT 45975, 'Spiderman', 'Movie', 110
INSERT INTO #Products 
SELECT 96587, 'Batman', 'Movie', 98
INSERT INTO #Products 
SELECT 25893, 'SuperMan', 'Movie', 102

-- For releases and assets, only include records for a single movie to keep it simple

CREATE TABLE #Releases (
   Country varchar(30)
  ,HomeReleaseDate datetime2(0)
  ,ProductNumber int )

INSERT INTO #Releases 
SELECT 'United States', DATEADD(dd, -329, GETDATE()), 96587;
INSERT INTO #Releases 
SELECT 'France', DATEADD(dd, -256, GETDATE()), 96587;
INSERT INTO #Releases 
SELECT 'Germany', DATEADD(dd, -215, GETDATE()), 96587;

CREATE TABLE #Assets (
   ProductNumber int
  ,Languages varchar(20)
  ,TitleName varchar(50) )

INSERT INTO #Assets 
SELECT 96587, 'English', 'Batman';
INSERT INTO #Assets 
SELECT 96587, 'French', 'LeBat';
INSERT INTO #Assets 
SELECT 96587, 'German', 'Bachman';

Now we need to create a table where we can combine our results. We need to produce multiple JSON outputs for a single record and then combine all of those outputs. We are going to create our combine table and then insert the first record. We will come back and update this record with supporting data for the final output.

CREATE TABLE #Combine (
   Product nvarchar(max)
  ,Releases nvarchar(max)
  ,Assets nvarchar(max) )

INSERT INTO #Combine (Product)
VALUES ((
   SELECT
      ProductNumber
     ,ProductName
     ,ProductType
     ,Runtime
   FROM #Products
   WHERE ProductNumber = 96587
   FOR JSON PATH, WITHOUT_ARRAY_WRAPPER))

The product field in the #Combine table represents a JSON document for the product 96587, Batman. The releases field represents an array of embedded objects that is part of the product document, the same with assets. We are producing three different related JSON results and then combining them in to one. The JSON PATH clause by itself cannot output embedded documents so we have to do this manually (well actually it can, kind of, if you use ROOT, but you don't have much control of the output and when dealing with multiple table joins, it does what it wants, which is not likely what you want). Also to note, in the FOR JSON PATH above we use the option WITHOUT_ARRAY_WRAPPER. This eliminates the square brackets from the JSON output. We only want to exclude this for the parent document. Now we are going to update the other fields and take a look at what we have so far.

UPDATE #Combine 
   SET Releases = (
      SELECT
         ProductNumber
        ,Country
        ,HomeReleaseDate
       FROM #Releases
FOR JSON PATH, ROOT('Releases'))

UPDATE #Combine
   SET Assets = (
      SELECT 
         Languages
        ,TitleName
      FROM #Assets
FOR JSON AUTO, ROOT('Assets') )

SELECT * FROM #Combine

So we effectively have 3 JSON documents we need to combine in to a single document. To do this we use the following.

SELECT
   STUFF(Product,LEN(Product),1, ',') + 
   STUFF(STUFF(Releases, 1, 1, ''), LEN(Releases) - 1, 1, ',') +
   STUFF(Assets, 1, 1, '')
FROM #Combine

The query above doesn't do anything special except combine the documents. Copy the result and paste it in to a JSON validator to see the end result.   

It's not pretty but it works. Would I use it in production? Probably not. Keep in mind this is only version 1.0, hopefully we will see improvements to this in future releases. We can hope that Microsoft follows in the footsteps of Postgres, with a true native data type. Or of course you could do all of this in an application just as easily.

Guide to Working With JSON in SQL 2016 Part 1 - Storing and Querying

Microsoft SQL Server 2016 introduced support for JSON. Microsoft added the FOR JSON clause to build a JSON structured result from an existing relational DB, and they also added a handful of new functions to query a JSON document stored in a SQL table. That is the focus of Part 1 in this series, how to store JSON documents in a SQL database and how to query those documents.

Here are the new functions we will be exploring in this post:

JSON_VALUE – Extract a value from a JSON string
OPENJSON – Table Valued Function that parses JSON text. Returns objects and properties as rows and columns.

A few important notes:

  • JSON is case sensitive. If the JSON object is "FilmDetails" and you try querying "filmdetails", it will fail. Make sure you check your case when trouble shooting your queries.
  • There isn't a JSON data type in SQL. We have JSON support in SQL 2016 not a data type. JSON documents are stored as nvarchar(max), but don't let this discourage you from exploring these features.
  • You can create a CHECK constraint on the table you store your JSON documents to check the document is valid JSON, but this is not required (ISJSON(<JSONData>) > 0
  • You can create your table as memory optimized, FYI.

If you need a refresher on JSON structures, Look Here.

Here is our table definition and the JSON document we will be storing in the table:

CREATE TABLE JSONTable (
   ID int IDENTITY (1,1) PRIMARY KEY CLUSTERED
  ,JSONDocument nvarchar(max) );

INSERT INTO JSONTable
SELECT '{  
   "FilmDetails":{  
      "ProductNumber":"9912088751",
      "Title":"Brave",
      "Type":"Movie",
      "Runtime":93,
      "ReleaseYear":2012,
      "Synopses":[  
         {  
            "Locale":"en",
            "Text":"Princess Merida uses her bravery and archery skills to battle a curse and restore peace..."
         },
         {  
            "Locale":"de",
            "Text":"Animiert"
         },
         {  
            "Locale":"fr",
            "Text":"Au coeur des contrées sauvages dÉcosse, Merida, la fille du roi Fergus et de la reine Elinor..."
         }
      ],
      "Assets":[  
         {  
            "Name":"Stickers",
            "AssetType":"Stickers",
            "FileType":"PDF",
            "Location":"http://media.brave.stickers.pdf",
            "Locale":"en-US"
         },
         {  
            "Name":"Trailer - WMV",
            "AssetType":"Trailer - WMV",
            "FileType":"Video",
            "Location":"http://youtu.be/Shg79Shgn",
            "Locale":"en-US"
         }
      ]
   }
}';

First let's look at the JSON_VALUE function, we will use this to return values from our JSON document.

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Type') AS ContentType
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Runtime') AS Runtime
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.ReleaseYear') AS ReleaseYear
FROM JSONTable

In the example above, we are simply grabbing the values from the name/value pairs found in the FilmDetails object. Simple enough, but we also have a few arrays defined and we want to extract values from the different arrays as well. You could try calling the specific array key, but if you have many arrays you might not know the position. For example:

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Synopses[0].Text') AS SynopsesTextEn
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Synopses[1].Locale') AS Locale
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Assets[1].Location') AS TrailerLocation
FROM JSONTable

You see the [0], [1] which follows the Synopses object? That is the key value for the element we are trying to extract our values from. While this does work, let's take a look at another option, OPENJSON. OPENJSON is a table valued function that returns objects and properties as rows and columns. You can use OPENJSON with the default schema or you can create a custom schema for your output. Each produces very different results so let's look at both of them. First using the default schema:

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,[key] AS JSONObject
  ,[value] AS JSONValue
  ,[type] AS TypeOfValue
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument)

When you use OPENJSON with the default schema, the function returns a table with one row for each property of the object or for each element in the array. What happens if we target an object in the function?

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,[key] AS ArrayElementKey
  ,[value] AS ArrayValue
  ,[type] AS TypeOfValue
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')

If jsonExpression contains a JSON object (in this case $.FilmDetails.Synopses) , OPENJSON returns all the first level properties of the object. That's cool, but we still need to return values from the different arrays. Let's try to define a custom schema and see what happens:

SELECT *
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')
WITH (
   Locale varchar(3) '$.Locale'
  ,SynopsesText nvarchar(2000) '$.Text')
WHERE JSON_VALUE(JSONDocument, '$.FilmDetails.Title') LIKE '%Brave%'

Added a where clause to simplify the result.

We now have results for Locale and SynopsesText, but we are also getting the entire JSONDocument and ID. We want to be able to return a single filtered result. Notice in the query below we can use the column names defined in the WITH clause of the OPENJSON table valued function at the beginning of our select. We are also adding the value from JSON_VALUE from the FilmDetails object. The output is starting to look more meaningful.

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Type') AS ContentType
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Runtime') AS Runtime
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.ReleaseYear') AS ReleaseYear
  ,Locale
  ,SynopsesText
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')
WITH (
   Locale varchar(3) '$.Locale'
  ,SynopsesText nvarchar(2000) '$.Text')
WHERE JSON_VALUE(JSONDocument, '$.FilmDetails.Title') LIKE '%Brave%'
   AND Locale = 'en'

We can assume that our JSON document will have multiple arrays, and we'll need to extract the details from those arrays, so how do we do that? Let's try to add another CROSS APPLY to the "Assets" array and add the values to our main SELECT to see what we get:

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Type') AS ContentType
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Runtime') AS Runtime
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.ReleaseYear') AS ReleaseYear
  ,Locale
  ,SynopsesText
  ,Name AS AssetName
  ,FileType AS AssetFileType
  ,[Location] AS AssetLocation
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')
WITH (
   Locale varchar(3) '$.Locale'
  ,SynopsesText nvarchar(2000) '$.Text')
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Assets')
WITH (
   Name varchar(25) '$.Name'
  ,FileType varchar(25) '$.FileType'
  ,[Location] nvarchar(500) '$.Location' )
WHERE JSON_VALUE(JSONDocument, '$.FilmDetails.Title') LIKE '%Brave%'
   AND Locale = 'en'
   AND FileType = 'video'

At this point you should be returning a single line of filtered data that gets values from multiple arrays and objects in your JSON document. You can add indexes to your JSON data to increase performance.  Microsoft has introduced an interesting new feature set with JSON support in 2016, you should check it out in its entirety.

In part 2 of this series we are going to explore outputting JSON documents from normalized tables. We will look at outputting arrarys and assembling complex JSON documents from SQL.

Guide to Using Temporal Tables in SQL 2016

Temporal tables give us the ability to add version history to our data. If you want to track data changes to a table, see what a value was at a specific point in time, find when a value was last changed, or union historic records with current table records, temporal tables can do all of that. Temporal tables can be created on new or existing tables. Temporal tables are an exact replica of a new or existing table. You can think of temporal tables as a versioning table with the same definition as the table it's tracking, they are paired. This sounds more complicated than it is, let's get to the scripts.

First we're going to create a new table that also includes our temporal table. 

CREATE TABLE dbo.Orders (
   Id int IDENTITY (1,1) PRIMARY KEY CLUSTERED
  ,OrderNumber int NOT NULL
  ,ProductID int NOT NULL
  ,SalesPersonID int NOT NULL
  ,Amount decimal (7,2) NOT NULL
  ,SalesDate datetime2(0) DEFAULT GETDATE()
  ,ValidFrom datetime2(2) GENERATED ALWAYS AS ROW START -- Required field, can be named whatever you like.
  ,Validto datetime2(2) GENERATED ALWAYS AS ROW END -- Required field, can be named whatever you like.
  ,PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo) )
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Orders_Archive));

Let's look at what's new in our table definition. See the ValidFrom and ValidTo columns, these are referred to as period columns and are used to track the state of the row at the time the row was updated, the data type here should be datetime2(2), these columns are required. The GENERATED ALWAYS (START | END) arguments specify a start and end time for when the record was valid. The PERIOD FOR SYSTEM_TIME argument specifies the names of the columns the system will use to record the valid period of a row of data, in this case ValidfFrom and ValidTo. Finally we have the table option WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Orders_Archive)). SYSTEM_VERSIONING just enables versioning (for the temporal table) on the table you are creating. HISTORY_TABLE lets us target a specific history table, this could be an existing history table or a new history table. The above script will create a new history table named dbo.Orderes_Archive.

Now that we have our temporal table setup, let's see how it works. First we need to insert some data, notice we do not specify values for the ValidFrom and ValidTo columns. 

INSERT INTO dbo.Orders (OrderNumber, ProductID, SalesPersonID, Amount)
SELECT 85462, 989, 11, 10.99
INSERT INTO dbo.Orders (OrderNumber, ProductID, SalesPersonID, Amount)
SELECT 85486, 325, 12, 14.95
INSERT INTO dbo.Orders (OrderNumber, ProductID, SalesPersonID, Amount)
SELECT 85501, 193, 11, 21.99 

Now let's see what's in the tables we just created. We won't have anything in the history table Orders_Archive, not until we actually modify data in the Orders table. Query the tables, then run the update, and then query both tables again to see the results. 

SELECT * FROM dbo.Orders;
SELECT * FROM dbo.Orders_Archive;

 --Now we need to update the Orders table.

UPDATE dbo.Orders
   SET Amount = 50.00
WHERE Id = 2; 

After the update we should now have a single row in the Orders_Archive history table. This row represents the data before it was modified and the period of time that is was valid using the ValidFrom and ValidTo values.

Now what if we want to query the Orders table and this time include the table history values. We can do this using the SYSTEM_TIME clause in our FROM. Here we want to return all values that were valid between yesterday and tomorrow from both the parent and history table. Notice we don't have to call the Orders_Archive table directly, using SYSTEM_TIME allows us to query both tables by referencing only the parent Orders table. 

DECLARE 
   @Start datetime2(2) = DATEADD(DD, -1, GETDATE())
  ,@End datetime2(2) = DATEADD(DD, 1, GETDATE())
SELECT * 
FROM dbo.Orders
FOR SYSTEM_TIME BETWEEN @Start AND @End

There are 4 temporal-specific sub-clauses we can use with the SYSTEM_TIME clause to perform different time based analysis. As follows:

AS OF <date_time>
FROM <start_date_time> TO <end_date_time>
BETWEEN <start_date_time> AND <end_date_time>
CONTAINED IN (<start_date_time> , <end_date_time>)

This gives a lot of flexibility to how we query the data. We can query just the Orders data, just the Orders_Archive data, or both.

Some notes about table management:

  • You can create indexes on either the Orders or Orders_Archive table, creating an index on one does not create it on the other. Same goes for statistics.
  • Modifying a data type on the parent table Orders, will automatically modify the data type in the history table, Orders_Archive.
  • Adding a new column to the Orders table will automatically add that column to the history table.
  • Before you can drop a table that has versioning enabled on it, we must first disable versioning. You can drop a parent table without dropping the history table, they are independent of each other when you disable versioning.
-- Adding an index

CREATE NONCLUSTERED INDEX NCIX_SalesPersonID ON dbo.Orders (
SalesPersonID )
WITH (FILLFACTOR = 90);

-- Must be added to both

CREATE NONCLUSTERED INDEX NCIX_SalesPersonID ON dbo.Orders_Archive (
SalesPersonID )
WITH (FILLFACTOR = 90);

-- Altering the data type of an existing column or adding a new column automatically adds it to the history table

ALTER TABLE dbo.Orders ALTER COLUMN Amount decimal(12,2);
ALTER TABLE dbo.Orders ADD Region varchar(3);

-- Disable versioning and drop the table

ALTER TABLE dbo.Orders SET (SYSTEM_VERSIONING = OFF);
DROP TABLE dbo.Orders;
DROP TABLE dbo.Orders_Archive;

Temporal tables are a nice addition to SQL 2016, and believe or not I think they are available in standard edition. Have fun!

Migrating SSRS to 2016 - Subscriptions and Eliminating Errors in the Error Log

Migrating from SQL Reporting Services 2012 to 2016 is a fairly straightforward process that doesn't require a lot of attention here. What I do want to talk about is SSRS subscriptions. When you migrate an SSRS instance from one server to another, unless you also restore msdb those subscription jobs won't be coming with you. The subscriptions still exist in your ReportServer database and because the subscriptions are still in the ReportServer database and not in msdb, SSRS starts recording errors to the SSRS error logs to the tune of around 32MB a minute worth of logs. This happens because SSRS is looking for the Agent job for a subscription in the ReportServer database but it's not there so it logs an error. And it will keep checking over and over and over (Microsoft will likely change this behavior, we can hope). So how do we keep our logs in check and fix our subscriptions? Just run the following script:

DECLARE 
   @ownerLogin VARCHAR(100)
  ,@scheduleId VARCHAR(36)

SET @ownerLogin=N'MyDomain\ssrsAdmin' --SSRS Service Account which is also an admin in Report Manager

DECLARE JobCreateCursor CURSOR STATIC FORWARD_ONLY
FOR
SELECT CAST(ScheduleID AS VARCHAR(36)) AS ScheduleId 
FROM schedule

OPEN JobCreateCursor
FETCH NEXT FROM JobCreateCursor INTO @scheduleId

WHILE @@FETCH_STATUS = 0
BEGIN

EXEC msdb.dbo.sp_add_job @job_name = @scheduleId,@owner_login_name = @ownerLogin

FETCH NEXT FROM JobCreateCursor INTO @scheduleId

END

CLOSE JobCreateCursor
DEALLOCATE JobCreateCursor

Notice we aren't creating any job steps here, that's because SSRS will do that for us. Once you create the Agent jobs SSRS goes and validates those jobs and adds the steps as part of the subscriptions. If you try to go to the properties of one of the newly created jobs you will get an error until SSRS runs it's linking process, then you will be able to view the jobs and the subscription will run as it did on the old server. That and you won't be filling up the drive with SSRS error logs.

Troubleshooting Custom dll’s (ionic) in Script Tasks in SSIS

Script tasks are a great way to enrich your ETL, especially when you have complex requirements that can't easily be accomplished in a simple task or expression. The other day while working on a project I ran across an issue involving a script task and a custom dll, specifically ionic.zip.dll.  The script task was failing so I started looking in to the dll. As I was troubleshooting the issue I put together this guide so we could replicate our steps in production. Here is a simple checklist you can follow when working with script tasks and custom dll's in SSIS.

  • Copy the dll's to your target assemblies directory (depending on your environment, your directory may be different): C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.0
  • Once you copy the dll's, right click and go to properties of each dll and make sure the file is not marked as "blocked". Windows will mark a copied file it thinks is dangerous to run on the machine, like a dll or executable, as blocked. Click Unblock and apply/ok.
  • Open the Script task in SSIS. Click on the references folder in Solution Explorer. Highlight the dll you are calling. Under reference properties, set the Copy Local option to True.
  • When working with a custom dll you may need to register it to the GAC (Global Assembly Cache), unless you have a proper installer for it. Open a CMD prompt and navigate to the folder where the gacutil.exe is. Mine could be found here: C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools, though yours might be different. Copy your dll to the folder and register it. Gacutil usage can be found here: https://msdn.microsoft.com/en-us/library/ex0ss12c(v=vs.110).aspx

Capture DTU Performance Stats in SQL Azure Database

One of the challenges of working in SQL Azure database is ensuring your database is positioned in the correct service tier so that you are not overpaying or underperforming. SQL Azure database introduced a new DMV and system view that tracks performance related to Database Throughput Units (DTU). It's not going to answer a whole lot of questions by itself, but it can be helpful to view over-time stats for baselines. Used with other monitoring tools, query execution stats, etc., you should be able to dial in performance and find a good match for your service tier.

The sys.dm_db_resource_stats DMV is only available in the target user database and displays the last hour of performance data, to get anything beyond that you need to look at the sys view in the master database.

-- Basic view, last hour. Make sure you are in the target user database.

SELECT * FROM sys.dm_db_resource_stats

-- Last hour of performance using the sys.dm_db_resource_stats DMV.

SELECT DISTINCT
   MIN(end_time) AS StartTime
  ,MAX(end_time) AS EndTime
  ,CAST(AVG(avg_cpu_percent) AS decimal(4,2)) AS Avg_CPU
  ,MAX(avg_cpu_percent) AS Max_CPU
  ,CAST(AVG(avg_data_io_percent) AS decimal(4,2)) AS Avg_IO
  ,MAX(avg_data_io_percent) AS Max_IO
  ,CAST(AVG(avg_log_write_percent) AS decimal(4,2)) AS Avg_LogWrite
  ,MAX(avg_log_write_percent) AS Max_LogWrite
  ,CAST(AVG(avg_memory_usage_percent) AS decimal(4,2)) AS Avg_Memory
  ,MAX(avg_memory_usage_percent) AS Max_Memory    
FROM sys.dm_db_resource_stats

-----------------------------------------------------------------------------------
-- Performance over period of time. Uses the sys.resource_stats table in master db

DECLARE @StartDate date = DATEADD(day, -7, GETDATE()) -- 7 Days

SELECT DISTINCT
   MAX(database_name) AS DatabaseName
  ,MAX(sku) AS PlatformTier
  ,MAX(storage_in_megabytes) AS StorageMB
  ,MIN(end_time) AS StartTime
  ,MAX(end_time) AS EndTime
  ,CAST(AVG(avg_cpu_percent) AS decimal(4,2)) AS Avg_CPU
  ,MAX(avg_cpu_percent) AS Max_CPU
  ,CAST(AVG(avg_data_io_percent) AS decimal(4,2)) AS Avg_IO
  ,MAX(avg_data_io_percent) AS Max_IO
  ,CAST(AVG(avg_log_write_percent) AS decimal(4,2)) AS Avg_LogWrite
  ,MAX(avg_log_write_percent) AS Max_LogWrite
FROM sys.resource_stats
WHERE database_name = 'YourDatabaseName'
   AND start_time > @StartDate;

----------------------------------------------------------------------------
-- Find how well the database workload fits in to its current service tier.

SELECT
   (COUNT(database_name) - SUM(CASE WHEN avg_cpu_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [CPU Fit %]
  ,(COUNT(database_name) - SUM(CASE WHEN avg_log_write_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Log Write Fit %]
  ,(COUNT(database_name) - SUM(CASE WHEN avg_data_io_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Data IO Fit %]
FROM sys.resource_stats
WHERE database_name = 'YourDatabaseName' 
   AND start_time > DATEADD(day, -7, GETDATE()); 

If the query above returns a value of less than 99.9 for any of the three resources, then you probably have some kind of bottleneck that needs addressing. Look for poor performing queries in the sys.dm_exec_query_stats DMV. If you can't solve your problems with tuning, you might have to make the jump to a higher tier. 

This should help you get started dialing in your service tier for a SQL Azure Database.

Dynamically Build a MERGE Statement - Auto Generated MERGE

Once I worked  on a project that involved moving data from DB2 to MSSQL. We setup jobs in SSIS to copy the data over to staging tables in MSSQL server and then we ran MERGE procedures to merge the data from our staging tables to production. We had more than a 100 jobs that followed this pattern so imagine all the work writing those merge statements. After writing a handful of those I decided to come up with a solution that would write them for me, so here you go, the auto generated merge. 

In the script below, we have a table definition that will hold details for your merge. I usually create this once and then truncate it each time I want to write a new merge statement. Following that we have our variables which define your target and source data. All you need to do is make sure the DynamicMerge table is created and plug in the database name, schema name, and table name of your target table and source table in your variables. Follow the naming convention in the example below. The merge statement assumes you are merging two tables that are identical, but of course you could modify it to meet your needs.

This script can handle null values in the WHEN MATCHED lookup, we use EXCEPT to find unmatched values between source and target.

I like to output the results to text so I can review it before running or placing it in an SP. So when you run the query you will want to output to text and not grid (the default). To do this you will need to change a few settings in SSMS, more specifically you need to change query and text results options in SSMS to allow more characters to be output to text. From the menu: Query>QueryOptions>Results>Text (Uncheck 'Include Column Headers') and Tools>Options>QueryResults>SQLServer>ResultsToText (Increase the maximum number of characters displayed in each column to like 8000). If you don't make these two changes then the output text will be truncated and or you will have a bunch of dashes at the top of the script. 

--Create the DynamicMerge table

CREATE TABLE Staging.dbo.DynamicMerge (
   TableCatalog varchar(500)
  ,TableSchema varchar(100)
  ,TableName varchar(500)
  ,ColumnName varchar(500)
  ,MergeOn varchar(1000)
  ,UpdateOn varchar(1000)
  ,InsertOn varchar(1000)
  ,ExceptSource varchar(500)
  ,ExceptTarget varchar(500)
  ,IsPK bit DEFAULT 0)

USE <DatabaseName>; -- This is the DB of your Target in the merge, make sure you set this.
GO

SET NOCOUNT ON;
GO

-- Truncate the DynamicMerge table every time you use it
TRUNCATE TABLE Staging.dbo.DynamicMerge;

---------------------------------------------------------

-- Insert details to the DynamicMerge table

DECLARE
   @TableCatalog varchar(500) = 'TargetDBName' -- This is the database name
  ,@TableSchema varchar(500) = 'TargetSchemaName' -- Schema for your target table
  ,@TableName varchar(500) = 'TargetTableName' -- This is the table name
  ,@TargetName varchar(1000) = 'DBName.SchemaName.TargetTableName' -- Three part name
  ,@SourceName varchar(1000) = 'DBName.SchemaName.SourceTableName' -- Three part name

INSERT INTO Staging.dbo.DynamicMerge (
   TableCatalog
  ,TableSchema
  ,TableName
  ,ColumnName
  ,MergeOn
  ,UpdateOn
  ,InsertOn
  ,ExceptSource
  ,ExceptTarget )

SELECT
   TABLE_CATALOG
  ,TABLE_SCHEMA
  ,TABLE_NAME
  ,COLUMN_NAME
  ,'Target1.' + COLUMN_NAME + ' = Source1.' + COLUMN_NAME
  ,COLUMN_NAME + ' = Source1.' + COLUMN_NAME
  ,'Source1.' + COLUMN_NAME -- Except Source
  ,'Source1.' + COLUMN_NAME -- Except Target
  ,'Target1.' + COLUMN_NAME 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = @TableCatalog
   AND TABLE_SCHEMA = @TableSchema
   AND TABLE_NAME = @TableName; --target table name

--------------------------------------------------------

-- Update the Primary Key flag

UPDATE dt
   SET IsPK = 1
FROM Staging.dbo.DynamicMerge dt
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE ku ON dt.ColumnName =ku.COLUMN_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc ON tc.CONSTRAINT_TYPE ='PRIMARY KEY'
   AND tc.CONSTRAINT_NAME = ku.CONSTRAINT_NAME
   AND ku.TABLE_NAME = dt.TableName;

------------------------------------------------------------------

-- Check the Results (optional)

--SELECT * FROM Staging.dbo.DynamicMerge;

-----------------------------------------------------------------

-- Create the Column list variable

DECLARE @ColumnList varchar(max)
   SET @ColumnList = (
      SELECT STUFF((SELECT ',' + ColumnName + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName
      FOR XML PATH ('')), 1, 1,'') AS MergeOnColumns
      FROM Staging.dbo.DynamicMerge x2
GROUP BY TableName);

--------------------------------------------------------

-- Create the MergeOnSource variable - matching the unique key

DECLARE @MergeOn varchar(max)
   SET @MergeOn = (
      SELECT STUFF((SELECT 'AND ' + MergeOn + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 1
      FOR XML PATH ('')), 1, 4,'') -- AS MergeOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 1
GROUP BY TableName);

-----------------------------------------

-- Create the Merge EXCEPT Target varable

DECLARE @MExceptTarget varchar(max)
   SET @MExceptTarget = (
      SELECT STUFF((SELECT ', ' + ExceptTarget
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 0
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 0
GROUP BY TableName );

-----------------------------------------

-- Create the Merge EXCEPT Source variable

DECLARE @MExceptSource varchar(max)
   SET @MExceptSource = (
      SELECT STUFF((SELECT ', ' + ExceptSource
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 0
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 0
GROUP BY TableName );

-----------------------------------

-- Create the Merge UPDATE variable

DECLARE @MUpdate varchar(max)
   SET @MUpdate = (
      SELECT STUFF((SELECT ',' + UpdateOn + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 0
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 0
GROUP BY TableName);

-----------------------------------

-- Create the Merge INSERT variable

DECLARE @MInsert varchar(max)
   SET @MInsert = (
      SELECT STUFF((SELECT ',' + InsertOn + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
GROUP BY TableName);

--------------------------------------------------------------------------

/* Now build the MERGE statement. In the query results window, make sure to output to text and not grid. You will need to increase the character output from the defaults. This is explained at the beginning of this document. */

-- Note* CHAR(10) is a line feed control character. CHAR(13) is a carriage return.

SELECT 'MERGE ' + @TargetName + ' AS Target1' + CHAR(10)
   + 'USING (SELECT ' + @ColumnList + 'FROM ' + @SourceName + ') AS Source1' + CHAR(10)
   + CHAR(10)
   + 'ON (' + CHAR(10)
   + @MergeOn + ')' + CHAR(10)
   + CHAR(10)
   + 'WHEN MATCHED AND EXISTS' + CHAR(10)
   + '(' + CHAR(10)
   + ' SELECT ' + @MExceptSource + CHAR(10)
   + ' EXCEPT' + CHAR(10)
   + ' SELECT ' + @MExceptTarget + CHAR(10)
   + ')' + CHAR(10)
   + CHAR(10)
   + 'THEN UPDATE' + CHAR(10)
   + 'SET ' + @MUpdate
   + CHAR(10)
   + 'WHEN NOT MATCHED BY TARGET THEN INSERT (' + @ColumnList + ')' + CHAR(10)
   + CHAR(10)
   + 'VALUES (' + @MInsert + ')' + CHAR(10)
   + 'WHEN NOT MATCHED BY SOURCE THEN DELETE;' -- Optional DELETE if that is desired

This script has been tested from SQL 2008 through 2016. It was last used against a SQL 2016 server with 2017 SSMS. It works!

Calculate Year Over Year Variance - Subtract From Previous Row Using LAG

Say you want to summarize a value by year and find the variance between each year (row). Imagine you have a table that tracks company sales. You need to summarize sales totals by year and then compare the values year over year and find the variance. Here is a simple way to do this using the LAG function.

* Note * If your data does not have contiguous dates you will need to write different queries to fill in the missing dates. This is known as the missing dates problem and solutions to that can be found HERE

IF OBJECT_ID('tempdb..#SalesData') IS NOT NULL
DROP TABLE #SalesData
DECLARE @StartDate date = '2013-01-01';

WITH rCTE AS (
   SELECT
      CAST(2679 AS decimal(7,2)) AS SalesAmount
     ,@StartDate AS SalesDate
     ,CAST('nadia' as varchar) AS SalesPerson
     ,0 AS TransID

   UNION ALL

   SELECT
      CAST(SalesAmount + (SELECT CAST(CRYPT_GEN_RANDOM(1) AS INT) + 3 - CAST(CRYPT_GEN_RANDOM(1) AS INT)) AS decimal(7,2))
    ,DATEADD(dd, + 3, SalesDate)
    ,REVERSE(SalesPerson)
    ,TransID + 1
   FROM rCTE
   WHERE TransID < 500 )

SELECT
   SalesAmount
  ,SalesDate
  ,SalesPerson
INTO #SalesData
FROM rCTE
WHERE SalesAmount > 1
OPTION (MAXRECURSION 600)

--Let's take a look at the data

SELECT * FROM #SalesData

Now that we have our sales data table, we are going to summarize the data by year and load this to a temp table. Notice the ORDER BY, this is important since we will be using the LAG function next.

SELECT DISTINCT
   YEAR(SalesDate) SalesYear
  ,SUM(SalesAmount) OVER (PARTITION BY YEAR(SalesDate)) AS TotalSalesByYear
INTO #SalesTotals
FROM #SalesData
ORDER BY SalesYear ASC;

After we load our temp table with the summarized annual sales, now we are going to use the LAG function to get the previous row for TotalSalesByYear --this will effectively create a new column for the previous year's sales. You can see in the LAG function below we use the TotalSalesByYear value and then you see a comma and the number 1. The number is the offset or how many rows back you want to look from the current row. You can also specify a default value after the offset, this is the value to return if that offset value returns NULL. By excluding that default value, the default value will return NULL, which is what we want so we don't get a divide by zero error when calculating the variance. Here is a link for LAG usage.

WITH CTEazyE AS (
   SELECT
      SalesYear
     ,TotalSalesByYear
     ,LAG(TotalSalesByYear, 1) OVER(ORDER BY SalesYear ASC) AS PreviousYearSales
   FROM #SalesTotals )    

-- Now we calculate the variance year over year.

SELECT
     SalesYear
    ,TotalSalesByYear
    ,PreviousYearSales
    ,(TotalSalesByYear - PreviousYearSales) AS VarianceAmount
    ,100 * (TotalSalesByYear - PreviousYearSales) / PreviousYearSales AS VariancePercent
FROM CTEazyE

There are other ways to calculate this but the intent of this post was to introduce you to the LAG (LEAD) function. Mind you, the LAG function is a SQL 2012 + feature.