Machine Learning Algorithms for Industry

A comprehensive guide to commonly used ML algorithms, their applications, and Python implementations on Azure.

Overview

Machine Learning algorithms are categorized based on their learning approach. Below, we list commonly used algorithms in industry, grouped into Supervised Learning, Unsupervised Learning, Ensemble Learning, Deep Learning, and Reinforcement Learning. Each includes a description, use cases, the best-suited Python library, and a template script for training on Azure with large datasets split into 10 parts.

Supervised Learning

Algorithms that learn from labeled data to predict outcomes (regression for continuous, classification for discrete).

Linear Regression

Description: Models the linear relationship between input features and a continuous target by fitting a line to minimize prediction errors.

Use Cases:

  • Real Estate: Predicting house prices based on size and location.
  • Finance: Forecasting stock trends from historical data.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from azure.storage.blob import BlobServiceClient

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize Model (Incremental with SGD)
model = SGDRegressor()
scaler = StandardScaler()

# Train Sequentially
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values
    y = df['target'].values
    
    # Scale and Partial Fit
    scaler.partial_fit(X)
    X_scaled = scaler.transform(X)
    model.partial_fit(X_scaled, y)
    
    os.remove(temp_file)

# Save Model
import joblib
joblib.dump(model, 'linear_regression_model.pkl')
print("Training complete. Model saved.")
                    

Logistic Regression

Description: Uses a sigmoid function to model probabilities for binary or multi-class classification.

Use Cases:

  • Healthcare: Predicting disease risk (e.g., diabetic or not).
  • Marketing: Classifying customer leads as likely to convert.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from azure.storage.blob import BlobServiceClient

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize Model (Incremental)
model = SGDClassifier(loss='log_loss')
scaler = StandardScaler()

# Train Sequentially
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values
    y = df['target'].values
    
    scaler.partial_fit(X)
    X_scaled = scaler.transform(X)
    model.partial_fit(X_scaled, y, classes=[0, 1])  # Assume binary; adjust classes
    
    os.remove(temp_file)

import joblib
joblib.dump(model, 'logistic_regression_model.pkl')
print("Training complete. Model saved.")
                    

Decision Tree

Description: Builds a tree where nodes represent feature-based decisions, splitting data to minimize impurity.

Use Cases:

  • Banking: Approving loans by evaluating applicants.
  • Retail: Segmenting customers for promotions.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from azure.storage.blob import BlobServiceClient
import numpy as np

# Note: Decision Trees are not inherently incremental. Here, we load all parts into memory progressively (assume parts fit individually).

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Accumulate Data
X_all, y_all = [], []
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values
    y = df['target'].values
    X_all.append(X)
    y_all.append(y)
    
    os.remove(temp_file)

X = np.vstack(X_all)
y = np.hstack(y_all)

# Train Model
model = DecisionTreeClassifier()
model.fit(X, y)

import joblib
joblib.dump(model, 'decision_tree_model.pkl')
print("Training complete. Model saved.")
                    

Support Vector Machine (SVM)

Description: Finds a hyperplane that best separates classes, maximizing the margin between support vectors.

Use Cases:

  • Image Recognition: Classifying handwritten digits.
  • Bioinformatics: Protein classification.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.svm import LinearSVC  # Use LinearSVC for large data approx
from sklearn.preprocessing import StandardScaler
from azure.storage.blob import BlobServiceClient
import numpy as np

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize (LinearSVC is faster for large data; no partial_fit, so accumulate)
X_all, y_all = [], []
scaler = StandardScaler()
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values
    y = df['target'].values
    X_all.append(X)
    y_all.append(y)
    
    os.remove(temp_file)

X = scaler.fit_transform(np.vstack(X_all))
y = np.hstack(y_all)

model = LinearSVC()
model.fit(X, y)

import joblib
joblib.dump(model, 'svm_model.pkl')
print("Training complete. Model saved.")
                    

K-Nearest Neighbors (KNN)

Description: Classifies data points based on the majority vote of the 'k' closest training examples.

Use Cases:

  • Recommendation Systems: Suggesting movies based on similar users.
  • Anomaly Detection: Identifying fraudulent transactions.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from azure.storage.blob import BlobServiceClient
import numpy as np

# Note: KNN stores all data; for large, use approximate like faiss, but here accumulate.

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

X_all, y_all = [], []
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values
    y = df['target'].values
    X_all.append(X)
    y_all.append(y)
    
    os.remove(temp_file)

X = np.vstack(X_all)
y = np.hstack(y_all)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y)

import joblib
joblib.dump(model, 'knn_model.pkl')
print("Training complete. Model saved.")
                    

Unsupervised Learning

Algorithms that find patterns in unlabeled data, such as clustering or dimensionality reduction.

K-Means Clustering

Description: Partitions data into 'k' clusters by minimizing variance within each cluster.

Use Cases:

  • Marketing: Grouping customers by purchasing patterns.
  • Image Compression: Reducing colors in images.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.cluster import MiniBatchKMeans  # Incremental version for large data
from azure.storage.blob import BlobServiceClient

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize Model (MiniBatchKMeans for incremental)
model = MiniBatchKMeans(n_clusters=3)

# Train Sequentially
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.values  # Assume no target for unsupervised
    
    model.partial_fit(X)
    
    os.remove(temp_file)

import joblib
joblib.dump(model, 'kmeans_model.pkl')
print("Training complete. Model saved.")
                    

Principal Component Analysis (PCA)

Description: Reduces dataset dimensions by projecting onto principal components that capture maximum variance.

Use Cases:

  • Genomics: Reducing gene expression data for analysis.
  • Finance: Simplifying stock market data for portfolio optimization.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.decomposition import IncrementalPCA
from azure.storage.blob import BlobServiceClient

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize Model (IncrementalPCA for large data)
model = IncrementalPCA(n_components=2)

# Train Sequentially
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.values
    
    model.partial_fit(X)
    
    os.remove(temp_file)

import joblib
joblib.dump(model, 'pca_model.pkl')
print("Training complete. Model saved.")
                    

Ensemble Learning

Combines multiple models for better performance, often using weak learners to build a strong model.

Random Forest

Description: Ensemble of decision trees trained on random data subsets, aggregating predictions for robustness.

Use Cases:

  • E-commerce: Recommending products based on user behavior.
  • Medicine: Diagnosing diseases from genomic data.

Best Library: scikit-learn

View Python Script
import os
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from azure.storage.blob import BlobServiceClient
import numpy as np

# Note: Random Forest not incremental; accumulate data as above.

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

X_all, y_all = [], []
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values
    y = df['target'].values
    X_all.append(X)
    y_all.append(y)
    
    os.remove(temp_file)

X = np.vstack(X_all)
y = np.hstack(y_all)

model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

import joblib
joblib.dump(model, 'random_forest_model.pkl')
print("Training complete. Model saved.")
                    

XGBoost

Description: Optimized gradient boosting that builds trees sequentially, focusing on errors from previous trees.

Use Cases:

  • Competitions: Winning Kaggle challenges for tabular data.
  • Finance: Credit risk assessment.

Best Library: xgboost

View Python Script
import os
import pandas as pd
import xgboost as xgb
from azure.storage.blob import BlobServiceClient

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize DMatrix and Model (XGBoost supports incremental via update)
dtrain = None
model = None
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    part_dmatrix = xgb.DMatrix(df.drop('target', axis=1), label=df['target'])
    
    if dtrain is None:
        dtrain = part_dmatrix
    else:
        # For large data, train incrementally
        if model is None:
            model = xgb.train({'objective': 'binary:logistic'}, dtrain, num_boost_round=10)
        model = xgb.train({'objective': 'binary:logistic'}, part_dmatrix, num_boost_round=10, xgb_model=model)
    
    os.remove(temp_file)

model.save_model('xgboost_model.json')
print("Training complete. Model saved.")
                    

Deep Learning

Uses neural networks with multiple layers to learn complex patterns, ideal for large datasets like images or text.

Multi-Layer Perceptron (MLP)

Description: Feedforward neural network with hidden layers that learns non-linear patterns via backpropagation.

Use Cases:

  • Fraud Detection: Identifying suspicious transactions.
  • Voice Recognition: Classifying audio commands.

Best Library: PyTorch

View Python Script
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from azure.storage.blob import BlobServiceClient
import pandas as pd

# Model Definition
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Initialize Model
input_size = 10  # Example; adjust
hidden_size = 64
num_classes = 2
model = MLP(input_size, hidden_size, num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train Sequentially
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download U.S. Open blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = torch.tensor(df.drop('target', axis=1).values, dtype=torch.float32)
    y = torch.tensor(df['target'].values, dtype=torch.long)
    
    dataset = TensorDataset(X, y)
    loader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    model.train()
    for epoch in range(10):
        for batch_x, batch_y in loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
    
    os.remove(temp_file)

torch.save(model.state_dict(), 'mlp_model.pth')
print("Training complete. Model saved.")
                    

Convolutional Neural Network (CNN)

Description: Uses convolutional layers to extract spatial features from grid-like data (e.g., images).

Use Cases:

  • Autonomous Vehicles: Detecting objects in road images.
  • Healthcare: Analyzing X-rays for pneumonia.

Best Library: PyTorch

View Python Script
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from azure.storage.blob import BlobServiceClient
import pandas as pd
import numpy as np

# Model Definition (Assume data reshaped to image-like, e.g., for CSV flatten images)
class CNN(nn.Module):
    def __init__(self, num_classes):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(2)
        self.fc = nn.Linear(32 * 13 * 13, num_classes)  # Adjust for shape

    def forward(self, x):
        out = self.conv1(x)
        out = self.relu(out)
        out = self.pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

num_classes = 2
model = CNN(num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train Sequentially (Assume X reshaped to [batch, 1, 28, 28] for example)
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    X = df.drop('target', axis=1).values.reshape(-1, 1, 28, 28)  # Example reshape
    y = df['target'].values
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.long)
    
    dataset = TensorDataset(X, y)
    loader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    model.train()
    for epoch in range(10):
        for batch_x, batch_y in loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
    
    os.remove(temp_file)

torch.save(model.state_dict(), 'cnn_model.pth')
print("Training complete. Model saved.")
                    

Long Short-Term Memory (LSTM)

Description: Recurrent neural network variant that handles long-term dependencies in sequential data.

Use Cases:

  • Natural Language Processing: Machine translation.
  • Time-Series Forecasting: Predicting energy consumption.

Best Library: PyTorch

View Python Script
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from azure.storage.blob import BlobServiceClient
import pandas as pd
import numpy as np

# Model Definition (Assume sequence data; X shaped [batch, seq_len, features])
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(LSTM, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

input_size = 1
hidden_size = 64
num_classes = 2
model = LSTM(input_size, hidden_size, num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train Sequentially
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    # Assume features are sequences; example reshape
    X = df.drop('target', axis=1).values.reshape(-1, 10, input_size)  # seq_len=10
    y = df['target'].values
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.long)
    
    dataset = TensorDataset(X, y)
    loader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    model.train()
    for epoch in range(10):
        for batch_x, batch_y in loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
    
    os.remove(temp_file)

torch.save(model.state_dict(), 'lstm_model.pth')
print("Training complete. Model saved.")
                    

Transformer

Description: Attention-based model that processes sequences in parallel, excelling in long-range dependencies.

Use Cases:

  • NLP: Generating text with models like GPT.
  • Search Engines: Improving query understanding.

Best Library: transformers (Hugging Face)

View Python Script
import os
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from datasets import Dataset
from azure.storage.blob import BlobServiceClient
import pandas as pd

# Note: For text data; assume CSV with 'text' and 'target' columns.

# Azure Variables (Modify for actual use)
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'
blob_prefix = 'train_data_part_'
num_parts = 10

# Connect to Azure Blob Storage
connect_str = f"DefaultEndpointsProtocol=https;AccountName={azure_account_name};AccountKey={azure_account_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client(container_name)

# Accumulate Data (Hugging Face Trainer handles batches internally)
df_all = []
for part in range(1, num_parts + 1):
    blob_name = f"{blob_prefix}{part}.csv"
    blob_client = container_client.get_blob_client(blob_name)
    temp_file = f"temp_data_part_{part}.csv"
    
    with open(temp_file, "wb") as f:
        download_stream = blob_client.download_blob()
        f.write(download_stream.readall())
    
    df = pd.read_csv(temp_file)
    df_all.append(df)
    
    os.remove(temp_file)

df = pd.concat(df_all)
dataset = Dataset.from_pandas(df.rename(columns={'target': 'labels'}))

# Tokenize
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

dataset = dataset.map(tokenize, batched=True)

# Model and Trainer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
training_args = TrainingArguments(output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)

trainer.train()
trainer.save_model('transformer_model')
print("Training complete. Model saved.")
                    

Reinforcement Learning

Algorithms that learn by interacting with an environment, optimizing actions based on rewards.

Deep Q-Network (DQN)

Description: Combines Q-learning with deep neural networks to approximate action values.

Use Cases:

  • Gaming: Training agents to play Atari games.
  • Robotics: Optimizing robot control policies.

Best Library: stable-baselines3

View Python Script
import os
import gym  # Assume environment like CartPole; for custom, load data accordingly
from stable_baselines3 import DQN
from stable_baselines3.common.vec_env import DummyVecEnv
from azure.storage.blob import BlobServiceClient

# Note: RL typically uses environments, not static data. Here, assume standard env; for large replay, adjust buffer.

# Azure Variables (Modify for actual use) - Not directly used for data, but placeholder for custom setups
azure_account_name = 'your_storage_account_name'
azure_account_key = 'your_storage_account_key'
container_name = 'your_container_name'

# Connect if needed (e.g., for saving; skipped for simplicity)

# Environment
env = DummyVecEnv([lambda: gym.make('CartPole-v1')])  # Example env

# Model
model = DQN('MlpPolicy', env, verbose=1)

# Train (RL trains via episodes, not splits; for large, increase timesteps)
model.learn(total_timesteps=10000)

model.save('dqn_model')
print("Training complete. Model saved.")