Skip to Content

Online Learning with Bandits and Streaming Data

Start writing here...

Great choice! Here's a full, organized breakdown of Online Learning, especially focusing on Bandits and Streaming Data. This is ideal for lectures, blog posts, or deep dives into practical implementations. Let me know how you'd like to format it or if you'd like code examples!

⚡ Online Learning with Bandits and Streaming Data

🎯 What is Online Learning?

Online Learning is a machine learning setting where the model learns sequentially, processing data instances one at a time (or in small batches) as they arrive.

  • Unlike batch learning (where the model sees all data at once), online learning adapts in real-time.
  • Goal: Make predictions or decisions with partial or no knowledge of future data.

🌊 Streaming Data

🔍 Characteristics of Streaming Data:

  • Arrives continuously
  • Often unbounded
  • Must be processed incrementally
  • Memory and computation constraints (can't store everything)

🛠️ Techniques for Learning from Streams:

  • Incremental Learning Algorithms: Update model parameters on-the-fly (e.g., online SGD)
  • Windowing: Use sliding or fixed-size windows of recent data
  • Sketching & Sampling: Approximate data distributions for fast stats
  • Concept Drift Detection: Adapt to changing data distributions

🎰 Bandit Algorithms (Multi-Armed Bandits - MAB)

Bandits are a classic exploration vs. exploitation problem. A learner must decide between:

  • Exploring: trying new actions to gather more information
  • Exploiting: choosing known high-reward actions

Imagine pulling arms of a slot machine (a.k.a. “one-armed bandit”) to maximize reward.

🔢 Formal Setup:

At each time step tt, the learner:

  • Chooses an action at∈Aa_t \in A
  • Receives a reward rtr_t
  • Learns only the reward of the chosen action (partial feedback)

Goal: Minimize regret:

R(T)=∑t=1Trt∗−ratR(T) = \sum_{t=1}^{T} r^*_t - r_{a_t}

Where rt∗r^*_t is the reward of the best possible action.

🧠 Bandit Algorithms

1. Epsilon-Greedy

  • With probability ϵ\epsilon: explore random action
  • Otherwise: exploit the best-known action

2. Upper Confidence Bound (UCB)

  • Choose actions with best trade-off between reward estimate and uncertainty
  • Optimism in the face of uncertainty

Select at=arg⁡max⁡a(r^a+2ln⁡tna)\text{Select } a_t = \arg\max_a \left( \hat{r}_a + \sqrt{\frac{2 \ln t}{n_a}} \right)

3. Thompson Sampling

  • Probabilistic: sample from posterior of each arm
  • More balanced exploration/exploitation

4. Contextual Bandits

  • Takes features (context) into account
  • Choose action based on context xtx_t

📦 Use cases:

  • Personalized recommendations
  • News article selection
  • Ad targeting

🔄 Bandits + Streaming Data = 💥

Online learning with bandits shines when:

  • You have limited feedback (e.g., only know if the user clicked, not what they would've clicked)
  • You need to adapt in real-time (e.g., recommenders, finance, A/B testing)

🛠️ Libraries & Tools

  • Vowpal Wabbit – blazing-fast, scalable online learning (contextual bandits included)
  • River – modern Python lib for streaming ML
  • scikit-multiflow – good for concept drift & streaming evaluation
  • MABWiser – Python library for bandit algorithms
  • Ray RLlib – supports contextual and reinforcement learning bandits

🚀 Real-World Applications

Domain Use Case
E-commerce Personalized offers, pricing, A/B testing
Social Media Feed ranking, content curation
Healthcare Adaptive clinical trials
Finance Portfolio selection
Online Ads Real-time bidding, ad selection

🧠 Key Challenges

  • Handling concept drift in streaming data
  • Balancing regret minimization with fast decision-making
  • Scalability to high-dimensional, high-frequency data
  • Limited feedback (partial observability)

🔮 Advanced Topics

  • Non-stationary Bandits: Adapt to changing reward distributions
  • Bandits with Knapsacks: Add budget/resource constraints
  • Reinforcement Learning (RL): Generalizes bandits to delayed rewards
  • Federated Bandits: Decentralized, privacy-preserving decision-making

Would you like this broken down into slides, with Python code (e.g. contextual bandits with Vowpal Wabbit or MABWiser), or want a deeper dive into concept drift + streaming data pipelines?