A Basic Guide to Performing Computer Vision Tasks with ESP32-CAM Module

Introduction

The ESP32-CAM is a powerful yet compact module combining a microcontroller, Wi-Fi, and a camera, making it ideal for IoT and computer vision tasks. In this tutorial, we will stream video from the ESP32-CAM to a host computer and run object detection using a YOLO model.

Learning Objectives

Configure the ESP32-CAM module
Stream video over Wi-Fi to a host machine
Use OpenCV in Python to capture and display frames
Run object detection with a YOLO model using Roboflow Inference
Trigger audio alerts with text-to-speech (TTS)

Background Information

Computer vision enables machines to understand images and video. The ESP32-CAM cannot run large models directly, but it can stream footage that a host device processes using powerful libraries like OpenCV and YOLO.

Getting Started

Required Downloads and Installations

Software	Description	Installation
Arduino IDE	Upload firmware to ESP32-CAM	Download
ESP32 Board Support	Adds ESP32 support to Arduino IDE	Guide
Python 3.x	Required for running detection script	Download
OpenCV	Image capture & processing	`pip install opencv-python`
inference	Roboflow’s inference SDK	`pip install inference`

Required Components

Component Name	Quantity
ESP32-S3 CAM	1
USB Cable (for flashing)	1
Power Supply	1

Required Tools and Equipment

Host PC (Linux/Windows/Mac)
Wi-Fi network
Breadboard (optional for peripherals)

Part 01: Streaming Video from ESP32-CAM

Objective

Flash ESP32-CAM and stream live video via Wi-Fi.

Instructional Steps

Open Arduino IDE.
Install the ESP32 board support and select XIAO_ESP32S3.
Use the provided ESP32-CAM code to flash the board.
Connect to the printed IP address and confirm video stream.

ESP32-CAM Firmware Code

Click to view

#include "esp_camera.h"
#include <WiFi.h>

#define CAMERA_MODEL_XIAO_ESP32S3
#include "camera_pins.h"

const char *ssid = "RESNET-GUEST-DEVICE";
const char *password = "ResnetConnect";

void startCameraServer();
void setupLedFlash(int pin);

void setup() {
  Serial.begin(115200);
  Serial.setDebugOutput(false); // Disable debug output for better performance
  
  // Optimize CPU frequency
  setCpuFrequencyMhz(240); // Max frequency for ESP32S3
  
  Serial.println("Starting optimized camera setup...");

  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sccb_sda = SIOD_GPIO_NUM;
  config.pin_sccb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  
  // Optimized camera settings for performance
  config.xclk_freq_hz = 20000000;
  config.pixel_format = PIXFORMAT_JPEG;
  config.grab_mode = CAMERA_GRAB_LATEST; // Always get latest frame
  config.fb_location = CAMERA_FB_IN_PSRAM;
  
  // Performance optimized settings
  if (psramFound()) {
    Serial.println("PSRAM found - using optimized settings");
    config.frame_size = FRAMESIZE_QQVGA;    // 800x600 - good balance
    config.jpeg_quality = 12;              // Lower quality = faster
    config.fb_count = 2;                   // Double buffering
  } else {
    Serial.println("No PSRAM - using conservative settings");
    config.frame_size = FRAMESIZE_QQVGA;     // 640x480
    config.jpeg_quality = 15;
    config.fb_count = 1;
    config.fb_location = CAMERA_FB_IN_DRAM;
  }

  // Initialize camera
  esp_err_t err = esp_camera_init(&config);
  if (err != ESP_OK) {
    Serial.printf("Camera init failed with error 0x%x", err);
    return;
  }

  // Get camera sensor for optimization
  sensor_t *s = esp_camera_sensor_get();
  
  // Optimize sensor settings for speed
  s->set_framesize(s, FRAMESIZE_QQVGA);     // Start with VGA for speed
  s->set_quality(s, 12);                  // JPEG quality (lower = faster)
  
  // Image enhancement settings
  s->set_brightness(s, 0);     // -2 to 2
  s->set_contrast(s, 0);       // -2 to 2
  s->set_saturation(s, 0);     // -2 to 2
  s->set_special_effect(s, 0); // 0 to 6 (0=No Effect)
  s->set_whitebal(s, 1);       // 0 = disable , 1 = enable
  s->set_awb_gain(s, 1);       // 0 = disable , 1 = enable
  s->set_wb_mode(s, 0);        // 0 to 4 - if awb_gain enabled
  s->set_exposure_ctrl(s, 1);  // 0 = disable , 1 = enable
  s->set_aec2(s, 0);           // 0 = disable , 1 = enable
  s->set_ae_level(s, 0);       // -2 to 2
  s->set_aec_value(s, 300);    // 0 to 1200
  s->set_gain_ctrl(s, 1);      // 0 = disable , 1 = enable
  s->set_agc_gain(s, 0);       // 0 to 30
  s->set_gainceiling(s, (gainceiling_t)0); // 0 to 6
  s->set_bpc(s, 0);            // 0 = disable , 1 = enable
  s->set_wpc(s, 1);            // 0 = disable , 1 = enable
  s->set_raw_gma(s, 1);        // 0 = disable , 1 = enable
  s->set_lenc(s, 1);           // 0 = disable , 1 = enable
  s->set_hmirror(s, 0);        // 0 = disable , 1 = enable
  s->set_vflip(s, 0);          // 0 = disable , 1 = enable
  s->set_dcw(s, 1);            // 0 = disable , 1 = enable
  s->set_colorbar(s, 0);       // 0 = disable , 1 = enable

  // Camera model specific optimizations
#if defined(CAMERA_MODEL_XIAO_ESP32S3)
  // No specific flips needed for XIAO ESP32S3
#endif

#if defined(LED_GPIO_NUM)
  setupLedFlash(LED_GPIO_NUM);
#endif

  // WiFi setup with optimizations
  WiFi.mode(WIFI_STA);
  WiFi.setSleep(false); // Disable WiFi sleep for consistent performance
  WiFi.setTxPower(WIFI_POWER_19_5dBm); // Max WiFi power
  
  Serial.printf("Connecting to %s", ssid);
  WiFi.begin(ssid, password);
  
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.println("");
  Serial.println("WiFi connected");

  startCameraServer();

  Serial.print("Camera Ready! Use 'http://");
  Serial.print(WiFi.localIP());
  Serial.println("' to connect");
  
  // Print optimization info
  Serial.println("\nOptimization Settings Applied:");
  Serial.printf("CPU Frequency: %d MHz\n", getCpuFrequencyMhz());
  Serial.printf("PSRAM Available: %s\n", psramFound() ? "Yes" : "No");
  Serial.printf("Frame Size: %s\n", psramFound() ? "QQVGA" : "QQVGA");
  Serial.printf("JPEG Quality: %d\n", psramFound() ? 12 : 15);
  Serial.printf("Frame Buffers: %d\n", psramFound() ? 2 : 1);
}

void loop() {
  // Keep loop minimal for best performance
  delay(1);
}

Part 02: Capturing and Displaying Frames on Host

Objective

Connect to ESP32 stream and show real-time video using OpenCV.

Instructional Steps

Use the ESP32 IP (e.g., http://<IP>:81/stream) in your Python code.
Use cv2.VideoCapture() to connect and read frames.

Part 03: Choosing and Training Your Model with Roboflow

Objective

Learn how to choose the best model type and train your own dataset using Roboflow.

Choosing a Model Type

Roboflow supports many model architectures. For use with an ESP32-CAM that streams to a host machine (which does the heavy computation), you’ll want to select a lightweight model optimized for speed and acceptable accuracy. Good options include:

YOLOv5n – “Nano” version of YOLOv5, very fast but less accurate.
YOLOv8n – Latest nano version of YOLOv8, offering better trade-offs.
YOLOv8s – Slightly larger, better accuracy, still usable on laptops.
MobileNet-SSD – Great for low-latency mobile applications.

✅ Tip: Start with YOLOv8n and scale up if needed.

Training a Model in Roboflow

Go to https://roboflow.com and create a free account.
Click “Create Project” and set your object detection parameters.
Upload your images and annotate them using Roboflow’s labeling interface.
After labeling, click “Generate Dataset” to resize and augment your images.
Click “Train Model” and choose a suitable model type (e.g. YOLOv8n).
When training is done, you’ll receive a model_id for use with the Inference SDK.

Using Your Trained Model

To use your trained model with the Roboflow Inference SDK:

from inference import get_model

model = get_model(model_id="your_model_id")

If you’re deploying with a .pt file locally (instead of the Roboflow-hosted model):

Download the YOLO weights (weights.pt) from Roboflow.
Use Ultralytics YOLOv8 locally:

pip install ultralytics

Then you can run:

from ultralytics import YOLO

model = YOLO("your_model.pt")
results = model.predict("image.jpg")

Summary

Scenario	Recommended Model
Streaming from ESP32-CAM to PC	YOLOv8n or YOLOv5n
Deployment on Jetson Nano or similar	YOLOv8s
Mobile deployment (e.g., Android app)	MobileNet-SSD
Need best accuracy	YOLOv8m or YOLOv5m

💡 Consider uploading diverse real-world images for best results during training.

Part 04: Performing Object Detection

Objective

Use Roboflow’s YOLO model to detect hazards in real-time.

Code Walkthrough

Model Setup: Using get_model() from Roboflow SDK
Detection: model.infer(frame) returns bounding boxes and classes
Annotation: Use supervision library to draw boxes and labels
TTS: pyttsx3 announces hazards with cooldown logic

Full Detection Script

Click to view Python code

import cv2
from inference import get_model
import supervision as sv

# Replace with your ESP32-CAM stream URL
STREAM_URL = "http://<IP>:81/stream" #change this

# Open video stream
cap = cv2.VideoCapture(STREAM_URL)
if not cap.isOpened():
    raise RuntimeError("Could not open ESP32 stream.")

# Load Roboflow model
model = get_model(model_id="YOUR MODEL ID") #change this
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()
class_names = model.class_names

# Main loop
while True:
    ret, frame = cap.read()
    if not ret:
        print("Failed to read frame.")
        break

    # Run inference
    results = model.infer(frame)[0]
    detections = sv.Detections.from_inference(results)

    # Annotate frame
    annotated = box_annotator.annotate(scene=frame, detections=detections)
    annotated = label_annotator.annotate(scene=annotated, detections=detections)

    # Show result
    cv2.imshow("YOLO Detection", annotated)

    # Press ESC to quit
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Part 05: Putting It All Together

Objective

Create a low-cost, vision-based alert system using ESP32 and YOLO.

Final Integration Steps

Power up ESP32-CAM
Run Python detection script on host
Observe detection overlay + spoken alerts

Possible Use Cases

Indoor navigation aid for visually impaired
Smart home obstacle detection
Security monitoring

Additional Resources

Created by Purab Balani

Bestest Tutorial Ever Dummy