- Lab
- A Cloud Guru
Bulk Load Data into Cosmos DB for NoSQL
Bulk load refers to scenarios where you need to move a large volume of data, and you need to do it with as much throughput as possible. Workloads can be based on batch processes, such as nightly data loads, or based on streaming processes where you are receiving hundreds of thousands of documents that you need to update. In this hands-on lab, you will use the Cosmos DB SDK along with vanilla C# code to enable bulk execution on a CosmosClient class. Then you will generate synthetic data to test a bulk load of 1,000 JSON documents into Cosmos DB for NoSQL. Students with solid experience coding in .Net C# — and/or experience with the Cosmos DB for NoSQL SDK for any language — will be the most prepared to complete this lab without assistance. However, tips are provided for developers with less experience, visit the solution videos and the lab guide for full solutions.
Path Info
Table of Contents
-
Challenge
Housekeeping
- Open an incognito or in-private window and log in to the Azure portal using the user name and password provided in the lab environment.
- From within the portal, initiate the Cloud Shell to select Bash (versus PowerShell and set up with new backing storage.
- From the Bash command prompt, execute the
git clone
command using the URL provided in the Additional Information and Resources section of the lab, followed byDP420Labs
to alias the downloaded folder to a friendly name. - Once the project is downloaded use Cloud Editor to open the
Program.cs
file. - From the Bash command prompt, change to the working directory
cd DP420Labs/DP420/BulkLoad
.
NOTE: You are free to write the code for this lab in Visual Studio Code or another IDE, if you have experience in that environment. Just make sure you download the GitHub project file to ensure you have the right library references and using directives. Be aware that the lab guide and the solution video are based on working in the Cloud Shell editor, but it won't substantially change the code you write.
-
Challenge
Instantiating the CosmosClient Object
- Navigate to the Cosmos DB account that is already set up for you and copy the primary connection string to connect to Cosmos DB in your code.
- Navigate to Data Explorer and note the name of the database and container already deployed to your account. The partition key for the container is
itemId
. You will need this information later. - Run a quick SQL query to confirm that the container is empty.
- In the main method of the
Program.cs
file, author the code required to connect to your Cosmos DB account. Operate on the database and container already set up in that account. When you instantiate the CosmosClient, you will also need to enable bulk execution.
Tips:
- You will need to instantiate a
CosmosClient
, aDatabase
, and aContainer
using the connection string, database name, and container name you retrieved from the portal. - There may be abiguity when instantiating the
Database
object due to theBogus
library that also has aDocument
class, so you can use the fully qualified path:Microsoft.Azure.Cosmos.Database
- You will need to use a
CosmosClientOptions
class in order to setAllowBulkExecution
to true ,or you can optionally use aCosmosClientBuilder
fluent class. - If you still need help after considering these tips, you can copy-paste the code from the lab guide and/or watch the solution video.
NOTE: If you do copy/paste the code from the lab guide, be sure to save the connection string you copied from the portal, first, so that you do not have to go retrieve it again.
-
Challenge
Loading Synthetic Data
You are not expected to write the data generation code from scratch. You can simply copy/paste the following code. However, do take a few minutes to study it, taking particular note of the property that generates 1000 records, which is about right, for our bulk load test; if you set it much higher, you are likely to receive a 429 throttling error.
- Inside the main method, following the Cosmos DB connection code, paste this code:
var fruit = new[] {"apple", "peach", "lemon", "strawberry", "pear"}; //get items from a source; we're using a fake data generator, here List<GenericItem> itemsToInsert = new Faker<GenericItem>() .RuleFor(i => i.id, f => Guid.NewGuid()) //itemId is partition key .RuleFor(i => i.itemId, f => f.Random.Number(1, 10)) .RuleFor(i => i.itemName, f=> f.PickRandom(fruit)) .Generate(1000);
- Outside of the main method, paste this code that creates an item class for the data generator:
public class GenericItem { public Guid id {get; set;} public string? itemName {get; set;} public int itemId {get; set;} }
-
Challenge
Executing the Code
The benefit of using the SDK to batch up data for bulk load is that you do not have to write the batching and caching logic. The SDK takes care of that under the covers. You just need to write vanilla C# code to add the items to the container.
- In the previous objective, the code populates a
List<GenericItem>
object, calleditemsToInsert
, with synthetic JSON documents. In this objective, you need to write code that iterates over that list and asynchrously inserts the items into the Cosmos DB container.
NOTE: Better yet, you can create another
List
, but this time aList<Task>
object. Iterate overitemsToInsert
and load up theList<Task>
object with the tasks that perform the container insert. Then return aTask
with what is expected by theMain
method.-
After you have written the code, save the changes to
Program.cs
file. Then, build the code. Assuming it builds without error, run the code. -
Assuming the code runs successfully, go back to the Data Explorer to run the SQL query again. You should now see the documents in the container.
Tips:
- Create a new
List<Task>
object and use aforeach
construct to loop over theItemsToInsert
list in order to build a list of tasks that insert items into the container. - Use the
CreateItemAsync<GenericItem>
member on theContainer
object, which you instantiated in the first code block, to add items to the container. - When inserting an item, you need a reference to the item and, optionally, the container partition key, which isn't required but is more efficient for the database engine. If you decide to include it, the partition key for the container is** itemId**.
- Building a list of tasks does not actually execute the inserts to the container. To do the work defined in the tasks, use this syntax to return a
Task
, which is the data type expected by the main method:await Task.WhenAll([whatever you named your batch of tasks]);
Remember: You don't have to worry about collecting up the documents into batches before inserting. The SDK code takes care of that for you. - If you still need help after considering these tips, you can copy/paste the code from the lab guide or watch the solution video.
- In the previous objective, the code populates a
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.