A Basic Guide to Performing Computer Vision Tasks with ESP32-CAM Module

A Basic Guide to Performing Computer Vision Tasks with ESP32-CAM Module

image

Introduction

The ESP32-CAM is a powerful yet compact module combining a microcontroller, Wi-Fi, and a camera, making it ideal for IoT and computer vision tasks. In this tutorial, we will stream video from the ESP32-CAM to a host computer and run object detection using a YOLO model.

Learning Objectives

  • Configure the ESP32-CAM module
  • Stream video over Wi-Fi to a host machine
  • Use OpenCV in Python to capture and display frames
  • Run object detection with a YOLO model using Roboflow Inference
  • Trigger audio alerts with text-to-speech (TTS)

Background Information

Computer vision enables machines to understand images and video. The ESP32-CAM cannot run large models directly, but it can stream footage that a host device processes using powerful libraries like OpenCV and YOLO.

Getting Started

Required Downloads and Installations

SoftwareDescriptionInstallation
Arduino IDEUpload firmware to ESP32-CAMDownload
ESP32 Board SupportAdds ESP32 support to Arduino IDEGuide
Python 3.xRequired for running detection scriptDownload
OpenCVImage capture & processingpip install opencv-python
inferenceRoboflow’s inference SDKpip install inference

Required Components

Component NameQuantity
ESP32-S3 CAM1
USB Cable (for flashing)1
Power Supply1

Required Tools and Equipment

  • Host PC (Linux/Windows/Mac)
  • Wi-Fi network
  • Breadboard (optional for peripherals)

Part 01: Streaming Video from ESP32-CAM

Objective

Flash ESP32-CAM and stream live video via Wi-Fi.

Instructional Steps

  1. Open Arduino IDE.
  2. Install the ESP32 board support and select XIAO_ESP32S3.
  3. Use the provided ESP32-CAM code to flash the board.
  4. Connect to the printed IP address and confirm video stream.

ESP32-CAM Firmware Code

Click to view
#include "esp_camera.h"
#include <WiFi.h>

#define CAMERA_MODEL_XIAO_ESP32S3
#include "camera_pins.h"

const char *ssid = "RESNET-GUEST-DEVICE";
const char *password = "ResnetConnect";

void startCameraServer();
void setupLedFlash(int pin);

void setup() {
  Serial.begin(115200);
  Serial.setDebugOutput(false); // Disable debug output for better performance
  
  // Optimize CPU frequency
  setCpuFrequencyMhz(240); // Max frequency for ESP32S3
  
  Serial.println("Starting optimized camera setup...");

  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sccb_sda = SIOD_GPIO_NUM;
  config.pin_sccb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  
  // Optimized camera settings for performance
  config.xclk_freq_hz = 20000000;
  config.pixel_format = PIXFORMAT_JPEG;
  config.grab_mode = CAMERA_GRAB_LATEST; // Always get latest frame
  config.fb_location = CAMERA_FB_IN_PSRAM;
  
  // Performance optimized settings
  if (psramFound()) {
    Serial.println("PSRAM found - using optimized settings");
    config.frame_size = FRAMESIZE_QQVGA;    // 800x600 - good balance
    config.jpeg_quality = 12;              // Lower quality = faster
    config.fb_count = 2;                   // Double buffering
  } else {
    Serial.println("No PSRAM - using conservative settings");
    config.frame_size = FRAMESIZE_QQVGA;     // 640x480
    config.jpeg_quality = 15;
    config.fb_count = 1;
    config.fb_location = CAMERA_FB_IN_DRAM;
  }

  // Initialize camera
  esp_err_t err = esp_camera_init(&config);
  if (err != ESP_OK) {
    Serial.printf("Camera init failed with error 0x%x", err);
    return;
  }

  // Get camera sensor for optimization
  sensor_t *s = esp_camera_sensor_get();
  
  // Optimize sensor settings for speed
  s->set_framesize(s, FRAMESIZE_QQVGA);     // Start with VGA for speed
  s->set_quality(s, 12);                  // JPEG quality (lower = faster)
  
  // Image enhancement settings
  s->set_brightness(s, 0);     // -2 to 2
  s->set_contrast(s, 0);       // -2 to 2
  s->set_saturation(s, 0);     // -2 to 2
  s->set_special_effect(s, 0); // 0 to 6 (0=No Effect)
  s->set_whitebal(s, 1);       // 0 = disable , 1 = enable
  s->set_awb_gain(s, 1);       // 0 = disable , 1 = enable
  s->set_wb_mode(s, 0);        // 0 to 4 - if awb_gain enabled
  s->set_exposure_ctrl(s, 1);  // 0 = disable , 1 = enable
  s->set_aec2(s, 0);           // 0 = disable , 1 = enable
  s->set_ae_level(s, 0);       // -2 to 2
  s->set_aec_value(s, 300);    // 0 to 1200
  s->set_gain_ctrl(s, 1);      // 0 = disable , 1 = enable
  s->set_agc_gain(s, 0);       // 0 to 30
  s->set_gainceiling(s, (gainceiling_t)0); // 0 to 6
  s->set_bpc(s, 0);            // 0 = disable , 1 = enable
  s->set_wpc(s, 1);            // 0 = disable , 1 = enable
  s->set_raw_gma(s, 1);        // 0 = disable , 1 = enable
  s->set_lenc(s, 1);           // 0 = disable , 1 = enable
  s->set_hmirror(s, 0);        // 0 = disable , 1 = enable
  s->set_vflip(s, 0);          // 0 = disable , 1 = enable
  s->set_dcw(s, 1);            // 0 = disable , 1 = enable
  s->set_colorbar(s, 0);       // 0 = disable , 1 = enable

  // Camera model specific optimizations
#if defined(CAMERA_MODEL_XIAO_ESP32S3)
  // No specific flips needed for XIAO ESP32S3
#endif

#if defined(LED_GPIO_NUM)
  setupLedFlash(LED_GPIO_NUM);
#endif

  // WiFi setup with optimizations
  WiFi.mode(WIFI_STA);
  WiFi.setSleep(false); // Disable WiFi sleep for consistent performance
  WiFi.setTxPower(WIFI_POWER_19_5dBm); // Max WiFi power
  
  Serial.printf("Connecting to %s", ssid);
  WiFi.begin(ssid, password);
  
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.println("");
  Serial.println("WiFi connected");

  startCameraServer();

  Serial.print("Camera Ready! Use 'http://");
  Serial.print(WiFi.localIP());
  Serial.println("' to connect");
  
  // Print optimization info
  Serial.println("\nOptimization Settings Applied:");
  Serial.printf("CPU Frequency: %d MHz\n", getCpuFrequencyMhz());
  Serial.printf("PSRAM Available: %s\n", psramFound() ? "Yes" : "No");
  Serial.printf("Frame Size: %s\n", psramFound() ? "QQVGA" : "QQVGA");
  Serial.printf("JPEG Quality: %d\n", psramFound() ? 12 : 15);
  Serial.printf("Frame Buffers: %d\n", psramFound() ? 2 : 1);
}

void loop() {
  // Keep loop minimal for best performance
  delay(1);
}

Part 02: Capturing and Displaying Frames on Host

Objective

Connect to ESP32 stream and show real-time video using OpenCV.

Instructional Steps

  1. Use the ESP32 IP (e.g., http://<IP>:81/stream) in your Python code.
  2. Use cv2.VideoCapture() to connect and read frames.

Part 03: Choosing and Training Your Model with Roboflow

Objective

Learn how to choose the best model type and train your own dataset using Roboflow.

Choosing a Model Type

Roboflow supports many model architectures. For use with an ESP32-CAM that streams to a host machine (which does the heavy computation), you’ll want to select a lightweight model optimized for speed and acceptable accuracy. Good options include:

  • YOLOv5n – “Nano” version of YOLOv5, very fast but less accurate.
  • YOLOv8n – Latest nano version of YOLOv8, offering better trade-offs.
  • YOLOv8s – Slightly larger, better accuracy, still usable on laptops.
  • MobileNet-SSD – Great for low-latency mobile applications.

Tip: Start with YOLOv8n and scale up if needed.


Training a Model in Roboflow

  1. Go to https://roboflow.com and create a free account.
  2. Click “Create Project” and set your object detection parameters.
  3. Upload your images and annotate them using Roboflow’s labeling interface.
  4. After labeling, click “Generate Dataset” to resize and augment your images.
  5. Click “Train Model” and choose a suitable model type (e.g. YOLOv8n).
  6. When training is done, you’ll receive a model_id for use with the Inference SDK.

Using Your Trained Model

To use your trained model with the Roboflow Inference SDK:

from inference import get_model

model = get_model(model_id="your_model_id")

If you’re deploying with a .pt file locally (instead of the Roboflow-hosted model):

pip install ultralytics

Then you can run:

from ultralytics import YOLO

model = YOLO("your_model.pt")
results = model.predict("image.jpg")

Summary

ScenarioRecommended Model
Streaming from ESP32-CAM to PCYOLOv8n or YOLOv5n
Deployment on Jetson Nano or similarYOLOv8s
Mobile deployment (e.g., Android app)MobileNet-SSD
Need best accuracyYOLOv8m or YOLOv5m

💡 Consider uploading diverse real-world images for best results during training.

Part 04: Performing Object Detection

Objective

Use Roboflow’s YOLO model to detect hazards in real-time.

Code Walkthrough

  • Model Setup: Using get_model() from Roboflow SDK
  • Detection: model.infer(frame) returns bounding boxes and classes
  • Annotation: Use supervision library to draw boxes and labels
  • TTS: pyttsx3 announces hazards with cooldown logic

Full Detection Script

Click to view Python code
import cv2
from inference import get_model
import supervision as sv

# Replace with your ESP32-CAM stream URL
STREAM_URL = "http://<IP>:81/stream" #change this

# Open video stream
cap = cv2.VideoCapture(STREAM_URL)
if not cap.isOpened():
    raise RuntimeError("Could not open ESP32 stream.")

# Load Roboflow model
model = get_model(model_id="YOUR MODEL ID") #change this
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()
class_names = model.class_names

# Main loop
while True:
    ret, frame = cap.read()
    if not ret:
        print("Failed to read frame.")
        break

    # Run inference
    results = model.infer(frame)[0]
    detections = sv.Detections.from_inference(results)

    # Annotate frame
    annotated = box_annotator.annotate(scene=frame, detections=detections)
    annotated = label_annotator.annotate(scene=annotated, detections=detections)

    # Show result
    cv2.imshow("YOLO Detection", annotated)

    # Press ESC to quit
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Part 05: Putting It All Together

Objective

Create a low-cost, vision-based alert system using ESP32 and YOLO.

Final Integration Steps

  • Power up ESP32-CAM
  • Run Python detection script on host
  • Observe detection overlay + spoken alerts

Possible Use Cases

  • Indoor navigation aid for visually impaired
  • Smart home obstacle detection
  • Security monitoring

Additional Resources


Created by Purab Balani