Project Overview

This project builds a real-time sales data analytics pipeline on Azure, leveraging an Azure free account with $200 in credits for 30 days. It processes retail sales data, ingesting transactions, storing raw data, transforming it, analyzing trends, predicting sales, and visualizing results.

Tools Used:

Conceptual Overview

Objective: Create an end-to-end pipeline for processing sales data (product ID, quantity, price, timestamp).

Data Flow:

Role of Each Tool:

Initial Setup

Set up Azure resources using the pipeline_setup.sh script. Ensure Azure CLI is installed and you are logged in (az login). Save the script, make it executable (chmod +x pipeline_setup.sh), and run it (./pipeline_setup.sh).

Download pipeline_setup.sh

#!/bin/bash

# Variables
resourceGroup="sales-pipeline-rg"
location="eastus"
dataFactoryName="salesDataFactory"
storageAccountName="salesdatalake$(date +%s)"
dataLakeName="salesdatalake"
databricksWorkspaceName="sales-databricks"
synapseWorkspaceName="sales-synapse"
eventHubNamespace="sales-eventhub-ns"
eventHubName="sales-events"
streamAnalyticsJobName="sales-stream-job"
analysisServicesName="sales-analysis"
machineLearningWorkspaceName="sales-ml"
powerBIName="sales-powerbi"

# Step 1: Create Resource Group
echo "Creating resource group: $resourceGroup"
az group create --name $resourceGroup --location $location

# Step 2: Create Azure Data Lake Storage Gen2
echo "Creating storage account: $storageAccountName"
az storage account create \
  --name $storageAccountName \
  --resource-group $resourceGroup \
  --location $location \
  --sku Standard_LRS \
  --kind StorageV2 \
  --enable-hierarchical-namespace true

# Create containers (Bronze, Silver, Gold)
az storage container create --name bronze --account-name $storageAccountName
az storage container create --name silver --account-name $storageAccountName
az storage container create --name gold --account-name $storageAccountName

# Step 3: Create Azure Data Factory
echo "Creating Azure Data Factory: $dataFactoryName"
az datafactory create \
  --name $dataFactoryName \
  --resource-group $resourceGroup \
  --location $location

# Step 4: Create Azure Databricks Workspace
echo "Creating Azure Databricks workspace: $databricksWorkspaceName"
az databricks workspace create \
  --name $databricksWorkspaceName \
  --resource-group $resourceGroup \
  --location $location \
  --sku standard

# Step 5: Create Azure Synapse Analytics Workspace
echo "Creating Azure Synapse Analytics workspace: $synapseWorkspaceName"
az synapse workspace create \
  --name $synapseWorkspaceName \
  --resource-group $resourceGroup \
  --storage-account $storageAccountName \
  --location $location \
  --sql-admin-login-user user_id \
  --sql-admin-login-password "your_password"

# Step 6: Create Azure Event Hubs Namespace and Event Hub
echo "Creating Event Hubs namespace: $eventHubNamespace"
az eventhubs namespace create \
  --name $eventHubNamespace \
  --resource-group $resourceGroup \
  --location $location \
  --sku Basic

echo "Creating Event Hub: $eventHubName"
az eventhubs eventhub create \
  --name $eventHubName \
  --namespace-name $eventHubNamespace \
  --resource-group $resourceGroup \
  --partition-count 2

# Step 7: Create Azure Stream Analytics Job
echo "Creating Stream Analytics job: $streamAnalyticsJobName"
az stream-analytics job create \
  --name $streamAnalyticsJobName \
  --resource-group $resourceGroup \
  --location $location \
  --output-error-policy Drop \
  --events-outoforder-policy Drop \
  --events-late-arrival-max-delay 5 \
  --events-outoforder-max-delay 5

# Step 8: Create Azure Analysis Services
echo "Creating Azure Analysis Services: $analysisServicesName"
az analysis-services server create \
  --name $analysisServicesName \
  --resource-group $resourceGroup \
  --location $location \
  --sku B1 \
  --admin-users "user_id@example.com"

# Step 9: Create Azure Machine Learning Workspace
echo "Creating Azure Machine Learning workspace: $machineLearningWorkspaceName"
az ml workspace create \
  --name $machineLearningWorkspaceName \
  --resource-group $resourceGroup \
  --location $location

# Step 10: Output instructions for Power BI
echo "Power BI setup: Use Power BI Desktop to connect to Azure Synapse Analytics and Azure Analysis Services. Create a workspace in Power BI Service after signing in with your Azure account."

echo "Setup complete! Check the Azure Portal to configure pipelines and services."

Post-Setup Steps

1. Configure Azure Data Factory Pipeline

Explanation: Creates a pipeline in Azure Data Factory (ADF) to orchestrate data movement and transformation. It includes a Copy Activity to ingest real-time sales data from Azure Event Hubs into the Bronze layer, Databricks Notebook Activities to transform data from Bronze to Silver (cleaned) and Silver to Gold (aggregated), and a Copy Activity to load Gold data into Azure Synapse Analytics.

How to Do It: Use the Azure Portal’s visual interface.

2. Configure Azure Databricks Notebook and Mount Storage

Explanation: Creates PySpark notebooks in Azure Databricks to transform data from Bronze to Silver (cleaning) and Silver to Gold (aggregating). Mounts Data Lake Storage containers.

How to Do It: Use the Azure Portal.

Download TransformBronzeToSilver.py

Download TransformSilverToGold.py

3. Configure Azure Stream Analytics Query, Inputs, and Outputs

Explanation: Defines a query in Azure Stream Analytics to aggregate real-time sales data from Event Hubs. Sets inputs from Event Hubs and outputs to Data Lake Storage and Power BI.

How to Do It: Use Azure CLI.

# Create input from Event Hubs
az stream-analytics input create --job-name $streamAnalyticsJobName --name "SalesInput" --resource-group $resourceGroup --properties "{\"type\":\"Stream\",\"datasource\":{\"type\":\"Microsoft.ServiceBus/EventHub\",\"properties\":{\"eventHubName\":\"$eventHubName\",\"serviceBusNamespace\":\"$eventHubNamespace\"}}}"

# Create output to Data Lake Storage (Bronze)
az stream-analytics output create --job-name $streamAnalyticsJobName --name "BronzeOutput" --resource-group $resourceGroup --datasource "{\"type\":\"Microsoft.Storage/Blob\",\"properties\":{\"storageAccounts\":[{\"accountName\":\"$storageAccountName\"}],\"container\":\"bronze\",\"pathPattern\":\"sales/{date}/{time}\"}}"

# Create output to Power BI
az stream-analytics output create --job-name $streamAnalyticsJobName --name "PowerBIOutput" --resource-group $resourceGroup --datasource "{\"type\":\"PowerBI\",\"properties\":{\"dataset\":\"SalesRealTime\",\"table\":\"HourlySales\",\"groupId\":\"your-powerbi-group-id\",\"groupName\":\"your-powerbi-group-name\"}}"

# Update job with query
az stream-analytics job update --job-name $streamAnalyticsJobName --resource-group $resourceGroup --transformation "{\"name\":\"Transformation\",\"properties\":{\"streamingUnits\":1,\"query\":\"SELECT ProductID, SUM(Quantity) as TotalQuantity INTO BronzeOutput, PowerBIOutput FROM SalesInput GROUP BY ProductID, TumblingWindow(hour, 1)\"}}"

# Start the job
az stream-analytics job start --job-name $streamAnalyticsJobName --resource-group $resourceGroup --output-start-mode "JobStartTime"

4. Configure Azure Event Hubs Producer for Mock Data

Explanation: Sets up a Python script to send mock sales data to Event Hubs.

How to Do It: Use a Python script with the Azure Event Hubs SDK. Install SDK (pip install azure-eventhub).

Download send_mock_data.py

Run: python send_mock_data.py

5. Configure Azure Synapse Analytics SQL Pool and Queries

Explanation: Creates a SQL pool in Synapse to store and query aggregated data, defining KPIs.

How to Do It: Use Azure CLI for the SQL pool, and the Azure Portal for queries.

Download synapse_queries.sql

az synapse sql pool create --name "SalesSQLPool" --performance-level "DW100c" --resource-group $resourceGroup --workspace-name $synapseWorkspaceName --storage-type LRS

6. Configure Azure Analysis Services Tabular Model

Explanation: Builds a semantic tabular model from Synapse data.

How to Do It: Use SQL Server Management Studio (SSMS).

7. Configure Azure Machine Learning Model

Explanation: Trains a regression model using Synapse data to predict sales.

How to Do It: Use Azure CLI.

Download job.yaml

# Create model (after training)
az ml model create --name "SalesPredictionModel" --version 1 --path "./model_path" --resource-group $resourceGroup --workspace-name $machineLearningWorkspaceName

# For training
az ml job create --file job.yaml --resource-group $resourceGroup --workspace-name $machineLearningWorkspaceName

8. Configure Power BI Dashboards

Explanation: Connects Power BI to Synapse and Analysis Services for visualizations.

How to Do It: Use Power BI Desktop and Power BI Service.

Notes