Store streaming data to a data lake in Azure
In my previous blog I showed how you can stream Twitter data to an Event Hub and stream the data to a Power BI live dashboard. In this post, I am going to show you how to store this data for long term storage. An Event Hub stores your events temporarily. That means it does not store them for later analysis. Say you want to analyze whether negative or positive tweets have an impact on your sales, you would need to store tweets for a historical view.
The question is where to store this data: directly to the datawarehouse, or store it to a data lake? This really depends on the architecture that you want to have. A data lake is often used to store the raw data historically. Is is especially interesting because it allows to store any kind of data, structured or unstructured and it is quite cheap compared to Azure SQL database or Azure SQL datawarehouse. So for that reason, we are going to store it in a data lake.
To persist data for long term storage, we need to create a data lake. What is exactly a data lake in Azure? It used to be the Data Lake gen 1 resource. Now it is a bit more confusing, as the current data lake (or data lake gen 2 if you will) is a storage account V2. Let's create one right now, so you understand what it is.
Create a new resource and choose " Storage account - blob, file, table, queue". In the next screen, choose a subscription and a resource group. The name of the storage account needs to be unique in Azure. To make this storage account a data lake, be sure to select StorageV2 as the account type. The second thing you need to do is go to next -> advanced, and make sure hierarchical namespace is enabled.
Click create and wait for the deployment to complete.
Store data to the data lake
So how do we persist data from our Event Hubs? Go to the Event Hub you created last time. On the page there is a button: capture events. Click on it.
If you do not have this option, you might have your Event Hub namespace in basic tier settings. Upgrade the namespace to standard to enable this feature.
On the next page choose Azure Storage account as the container and not Azure Data lake store gen1, as that is the old data lake. Choose the storage account you created earlier and create a new container named twitter inside it. Click save changes and the next time you send data to this event hub, it will have captured your data!
So how does it store your data and how can we verify it? You will need Azure Storage Explorer for this. Go download it from https://azure.microsoft.com/nl-nl/features/storage-explorer/
Open it, login to your azure account, and you will find this:
The storage account shows it is configured as a data lake gen2. By configuring your event hub to store data to this storage account, it created a blob container and data in it. Opening the container, you will find a folder structure:
You see it made a folder for year, month, day, hour. In it is an avro file. This stores the data in a compact binary format. Every time your event hub receives data, it will store it automatically to your data lake, where it is ready for further processing or analysis.
In the next blog post, I will show how to use this data in your data warehouse and combine it with other data.