Project Overview
This project builds a real-time sales data analytics pipeline on Azure, leveraging an Azure free account with $200 in credits for 30 days. It processes retail sales data, ingesting transactions, storing raw data, transforming it, analyzing trends, predicting sales, and visualizing results.
Tools Used:
- Azure Data Factory
- Azure Synapse Analytics
- Azure Databricks
- Azure Data Lake Storage
- Azure Stream Analytics
- Azure Event Hubs
- Azure Analysis Services
- Azure Machine Learning
- Power BI
Conceptual Overview
Objective: Create an end-to-end pipeline for processing sales data (product ID, quantity, price, timestamp).
Data Flow:
- Real-Time Ingestion: Azure Event Hubs captures transactions.
- Real-Time Processing: Azure Stream Analytics aggregates data.
- Storage: Azure Data Lake Storage (Bronze: raw, Silver: cleaned, Gold: aggregated).
- Batch Processing: Azure Data Factory with Databricks transformations.
- Analytics: Azure Synapse Analytics and Analysis Services.
- Machine Learning: Azure Machine Learning for predictions.
- Visualization: Power BI dashboards.
Role of Each Tool:
- Azure Data Factory: Orchestrates data movement.
- Azure Synapse Analytics: Queries structured data.
- Azure Databricks: Spark-based transformations.
- Azure Data Lake Storage: Centralized storage.
- Azure Stream Analytics: Real-time processing.
- Azure Event Hubs: High-throughput ingestion.
- Azure Analysis Services: Semantic models.
- Azure Machine Learning: Predictive models.
- Power BI: Visualizes insights.
Initial Setup
Set up Azure resources using the pipeline_setup.sh
script. Ensure Azure CLI is installed and you are logged in (az login
). Save the script, make it executable (chmod +x pipeline_setup.sh
), and run it (./pipeline_setup.sh
).
#!/bin/bash
# Variables
resourceGroup="sales-pipeline-rg"
location="eastus"
dataFactoryName="salesDataFactory"
storageAccountName="salesdatalake$(date +%s)"
dataLakeName="salesdatalake"
databricksWorkspaceName="sales-databricks"
synapseWorkspaceName="sales-synapse"
eventHubNamespace="sales-eventhub-ns"
eventHubName="sales-events"
streamAnalyticsJobName="sales-stream-job"
analysisServicesName="sales-analysis"
machineLearningWorkspaceName="sales-ml"
powerBIName="sales-powerbi"
# Step 1: Create Resource Group
echo "Creating resource group: $resourceGroup"
az group create --name $resourceGroup --location $location
# Step 2: Create Azure Data Lake Storage Gen2
echo "Creating storage account: $storageAccountName"
az storage account create \
--name $storageAccountName \
--resource-group $resourceGroup \
--location $location \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
# Create containers (Bronze, Silver, Gold)
az storage container create --name bronze --account-name $storageAccountName
az storage container create --name silver --account-name $storageAccountName
az storage container create --name gold --account-name $storageAccountName
# Step 3: Create Azure Data Factory
echo "Creating Azure Data Factory: $dataFactoryName"
az datafactory create \
--name $dataFactoryName \
--resource-group $resourceGroup \
--location $location
# Step 4: Create Azure Databricks Workspace
echo "Creating Azure Databricks workspace: $databricksWorkspaceName"
az databricks workspace create \
--name $databricksWorkspaceName \
--resource-group $resourceGroup \
--location $location \
--sku standard
# Step 5: Create Azure Synapse Analytics Workspace
echo "Creating Azure Synapse Analytics workspace: $synapseWorkspaceName"
az synapse workspace create \
--name $synapseWorkspaceName \
--resource-group $resourceGroup \
--storage-account $storageAccountName \
--location $location \
--sql-admin-login-user user_id \
--sql-admin-login-password "your_password"
# Step 6: Create Azure Event Hubs Namespace and Event Hub
echo "Creating Event Hubs namespace: $eventHubNamespace"
az eventhubs namespace create \
--name $eventHubNamespace \
--resource-group $resourceGroup \
--location $location \
--sku Basic
echo "Creating Event Hub: $eventHubName"
az eventhubs eventhub create \
--name $eventHubName \
--namespace-name $eventHubNamespace \
--resource-group $resourceGroup \
--partition-count 2
# Step 7: Create Azure Stream Analytics Job
echo "Creating Stream Analytics job: $streamAnalyticsJobName"
az stream-analytics job create \
--name $streamAnalyticsJobName \
--resource-group $resourceGroup \
--location $location \
--output-error-policy Drop \
--events-outoforder-policy Drop \
--events-late-arrival-max-delay 5 \
--events-outoforder-max-delay 5
# Step 8: Create Azure Analysis Services
echo "Creating Azure Analysis Services: $analysisServicesName"
az analysis-services server create \
--name $analysisServicesName \
--resource-group $resourceGroup \
--location $location \
--sku B1 \
--admin-users "user_id@example.com"
# Step 9: Create Azure Machine Learning Workspace
echo "Creating Azure Machine Learning workspace: $machineLearningWorkspaceName"
az ml workspace create \
--name $machineLearningWorkspaceName \
--resource-group $resourceGroup \
--location $location
# Step 10: Output instructions for Power BI
echo "Power BI setup: Use Power BI Desktop to connect to Azure Synapse Analytics and Azure Analysis Services. Create a workspace in Power BI Service after signing in with your Azure account."
echo "Setup complete! Check the Azure Portal to configure pipelines and services."
Post-Setup Steps
1. Configure Azure Data Factory Pipeline
Explanation: Creates a pipeline in Azure Data Factory (ADF) to orchestrate data movement and transformation. It includes a Copy Activity to ingest real-time sales data from Azure Event Hubs into the Bronze layer, Databricks Notebook Activities to transform data from Bronze to Silver (cleaned) and Silver to Gold (aggregated), and a Copy Activity to load Gold data into Azure Synapse Analytics.
How to Do It: Use the Azure Portal’s visual interface.
- Navigate to the Azure Portal (portal.azure.com) and search for your Data Factory instance (
salesDataFactory
). - Click the Data Factory resource, then select Launch Studio.
- In the left pane, go to Author (pencil icon).
- Create Linked Services:
- Click Manage > Linked Services > New.
- Azure Event Hubs: Name:
EventHubLinkedService
. Select namespace (sales-eventhub-ns
) and event hub (sales-events
). - Azure Data Lake Storage Gen2: Name:
DataLakeLinkedService
. Select storage account (salesdatalake<timestamp>
). - Azure Databricks: Name:
DatabricksLinkedService
. Select workspace (sales-databricks
). Provide access token from Databricks. - Azure Synapse Analytics: Name:
SynapseLinkedService
. Select workspace (sales-synapse
) and SQL pool (SalesSQLPool
). Use authentication (username:user_id
, password:your_password
).
- Create Datasets:
- In Author, click Datasets > New Dataset.
- Event Hubs: Name:
EventHubDataset
. Link toEventHubLinkedService
. Specify event hub (sales-events
). - Data Lake (Bronze): Name:
BronzeDataset
. Link toDataLakeLinkedService
. Path:bronze/sales/
. Format: JSON. - Data Lake (Silver): Name:
SilverDataset
. Link toDataLakeLinkedService
. Path:silver/sales/
. Format: Parquet. - Data Lake (Gold): Name:
GoldDataset
. Link toDataLakeLinkedService
. Path:gold/sales_aggregated/
. Format: Parquet. - Synapse: Name:
SynapseDataset
. Link toSynapseLinkedService
. Select table (GoldSales
).
- Create Pipeline:
- In Author, click Pipelines > New Pipeline. Name:
SalesPipeline
. - Add activities:
- Copy Activity: Name:
CopyFromEventHubToBronze
. Source:EventHubDataset
. Sink:BronzeDataset
. - Databricks Notebook: Name:
TransformBronzeToSilver
. Linked Service:DatabricksLinkedService
. Notebook Path:/Sales/TransformBronzeToSilver
. - Databricks Notebook: Name:
TransformSilverToGold
. Linked Service:DatabricksLinkedService
. Notebook Path:/Sales/TransformSilverToGold
. - Copy Activity: Name:
LoadToSynapse
. Source:GoldDataset
. Sink:SynapseDataset
.
- Copy Activity: Name:
- Connect activities sequentially (drag arrows).
- In Author, click Pipelines > New Pipeline. Name:
- Click Validate, then Publish All. Trigger with Add Trigger > Trigger Now or schedule.
2. Configure Azure Databricks Notebook and Mount Storage
Explanation: Creates PySpark notebooks in Azure Databricks to transform data from Bronze to Silver (cleaning) and Silver to Gold (aggregating). Mounts Data Lake Storage containers.
How to Do It: Use the Azure Portal.
Download TransformBronzeToSilver.py
Download TransformSilverToGold.py
- Navigate to the Azure Portal and search for your Databricks workspace (
sales-databricks
). - Launch the workspace.
- In Workspace, select Create > Notebook.
- Create notebook
TransformBronzeToSilver
(PySpark). - Create notebook
TransformSilverToGold
(PySpark). - Run the notebooks to test.
3. Configure Azure Stream Analytics Query, Inputs, and Outputs
Explanation: Defines a query in Azure Stream Analytics to aggregate real-time sales data from Event Hubs. Sets inputs from Event Hubs and outputs to Data Lake Storage and Power BI.
How to Do It: Use Azure CLI.
# Create input from Event Hubs
az stream-analytics input create --job-name $streamAnalyticsJobName --name "SalesInput" --resource-group $resourceGroup --properties "{\"type\":\"Stream\",\"datasource\":{\"type\":\"Microsoft.ServiceBus/EventHub\",\"properties\":{\"eventHubName\":\"$eventHubName\",\"serviceBusNamespace\":\"$eventHubNamespace\"}}}"
# Create output to Data Lake Storage (Bronze)
az stream-analytics output create --job-name $streamAnalyticsJobName --name "BronzeOutput" --resource-group $resourceGroup --datasource "{\"type\":\"Microsoft.Storage/Blob\",\"properties\":{\"storageAccounts\":[{\"accountName\":\"$storageAccountName\"}],\"container\":\"bronze\",\"pathPattern\":\"sales/{date}/{time}\"}}"
# Create output to Power BI
az stream-analytics output create --job-name $streamAnalyticsJobName --name "PowerBIOutput" --resource-group $resourceGroup --datasource "{\"type\":\"PowerBI\",\"properties\":{\"dataset\":\"SalesRealTime\",\"table\":\"HourlySales\",\"groupId\":\"your-powerbi-group-id\",\"groupName\":\"your-powerbi-group-name\"}}"
# Update job with query
az stream-analytics job update --job-name $streamAnalyticsJobName --resource-group $resourceGroup --transformation "{\"name\":\"Transformation\",\"properties\":{\"streamingUnits\":1,\"query\":\"SELECT ProductID, SUM(Quantity) as TotalQuantity INTO BronzeOutput, PowerBIOutput FROM SalesInput GROUP BY ProductID, TumblingWindow(hour, 1)\"}}"
# Start the job
az stream-analytics job start --job-name $streamAnalyticsJobName --resource-group $resourceGroup --output-start-mode "JobStartTime"
4. Configure Azure Event Hubs Producer for Mock Data
Explanation: Sets up a Python script to send mock sales data to Event Hubs.
How to Do It: Use a Python script with the Azure Event Hubs SDK. Install SDK (pip install azure-eventhub
).
Run: python send_mock_data.py
5. Configure Azure Synapse Analytics SQL Pool and Queries
Explanation: Creates a SQL pool in Synapse to store and query aggregated data, defining KPIs.
How to Do It: Use Azure CLI for the SQL pool, and the Azure Portal for queries.
az synapse sql pool create --name "SalesSQLPool" --performance-level "DW100c" --resource-group $resourceGroup --workspace-name $synapseWorkspaceName --storage-type LRS
- Go to your Synapse workspace > Develop hub.
- Create a new SQL script and run it.
6. Configure Azure Analysis Services Tabular Model
Explanation: Builds a semantic tabular model from Synapse data.
How to Do It: Use SQL Server Management Studio (SSMS).
- In the Azure Portal, go to your Analysis Services server (
sales-analysis
). - Use SSMS to connect, create a tabular project, import data from Synapse SQL pool, add measures (e.g., Total Sales = SUM(GoldSales[TotalSales])), and deploy.
7. Configure Azure Machine Learning Model
Explanation: Trains a regression model using Synapse data to predict sales.
How to Do It: Use Azure CLI.
# Create model (after training)
az ml model create --name "SalesPredictionModel" --version 1 --path "./model_path" --resource-group $resourceGroup --workspace-name $machineLearningWorkspaceName
# For training
az ml job create --file job.yaml --resource-group $resourceGroup --workspace-name $machineLearningWorkspaceName
8. Configure Power BI Dashboards
Explanation: Connects Power BI to Synapse and Analysis Services for visualizations.
How to Do It: Use Power BI Desktop and Power BI Service.
- In Power BI Desktop, select Get Data > Azure > Azure Synapse Analytics (SQL DW). Enter server (
sales-synapse.sql.azuresynapse.net
), database (SalesSQLPool
), credentials (username:user_id
, password:your_password
). - Connect to Analysis Services similarly.
- Create visuals (e.g., bar chart for total sales).
- Publish to Power BI Service and share dashboards. Ensure Stream Analytics output is linked to a Power BI dataset.
Notes
- Free Account: Use minimal resources (e.g., DW100c for Synapse, B1 for Analysis Services) to stay within $200 credit limit.
- Monitoring: Check costs in the Azure Portal under Cost Management.
- Industry Alignment: Uses a medallion architecture, mirroring enterprise practices.