A Basic Guide to Performing Computer Vision Tasks with ESP32-CAM Module
Introduction
The ESP32-CAM is a powerful yet compact module combining a microcontroller, Wi-Fi, and a camera, making it ideal for IoT and computer vision tasks. In this tutorial, we will stream video from the ESP32-CAM to a host computer and run object detection using a YOLO model.
Learning Objectives
- Configure the ESP32-CAM module
- Stream video over Wi-Fi to a host machine
- Use OpenCV in Python to capture and display frames
- Run object detection with a YOLO model using Roboflow Inference
- Trigger audio alerts with text-to-speech (TTS)
Background Information
Computer vision enables machines to understand images and video. The ESP32-CAM cannot run large models directly, but it can stream footage that a host device processes using powerful libraries like OpenCV and YOLO.
Getting Started
Required Downloads and Installations
Software | Description | Installation |
---|---|---|
Arduino IDE | Upload firmware to ESP32-CAM | Download |
ESP32 Board Support | Adds ESP32 support to Arduino IDE | Guide |
Python 3.x | Required for running detection script | Download |
OpenCV | Image capture & processing | pip install opencv-python |
inference | Roboflow’s inference SDK | pip install inference |
Required Components
Component Name | Quantity |
---|---|
ESP32-S3 CAM | 1 |
USB Cable (for flashing) | 1 |
Power Supply | 1 |
Required Tools and Equipment
- Host PC (Linux/Windows/Mac)
- Wi-Fi network
- Breadboard (optional for peripherals)
Part 01: Streaming Video from ESP32-CAM
Objective
Flash ESP32-CAM and stream live video via Wi-Fi.
Instructional Steps
- Open Arduino IDE.
- Install the ESP32 board support and select
XIAO_ESP32S3
. - Use the provided ESP32-CAM code to flash the board.
- Connect to the printed IP address and confirm video stream.
ESP32-CAM Firmware Code
Click to view
#include "esp_camera.h"
#include <WiFi.h>
#define CAMERA_MODEL_XIAO_ESP32S3
#include "camera_pins.h"
const char *ssid = "RESNET-GUEST-DEVICE";
const char *password = "ResnetConnect";
void startCameraServer();
void setupLedFlash(int pin);
void setup() {
Serial.begin(115200);
Serial.setDebugOutput(false); // Disable debug output for better performance
// Optimize CPU frequency
setCpuFrequencyMhz(240); // Max frequency for ESP32S3
Serial.println("Starting optimized camera setup...");
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sccb_sda = SIOD_GPIO_NUM;
config.pin_sccb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
// Optimized camera settings for performance
config.xclk_freq_hz = 20000000;
config.pixel_format = PIXFORMAT_JPEG;
config.grab_mode = CAMERA_GRAB_LATEST; // Always get latest frame
config.fb_location = CAMERA_FB_IN_PSRAM;
// Performance optimized settings
if (psramFound()) {
Serial.println("PSRAM found - using optimized settings");
config.frame_size = FRAMESIZE_QQVGA; // 800x600 - good balance
config.jpeg_quality = 12; // Lower quality = faster
config.fb_count = 2; // Double buffering
} else {
Serial.println("No PSRAM - using conservative settings");
config.frame_size = FRAMESIZE_QQVGA; // 640x480
config.jpeg_quality = 15;
config.fb_count = 1;
config.fb_location = CAMERA_FB_IN_DRAM;
}
// Initialize camera
esp_err_t err = esp_camera_init(&config);
if (err != ESP_OK) {
Serial.printf("Camera init failed with error 0x%x", err);
return;
}
// Get camera sensor for optimization
sensor_t *s = esp_camera_sensor_get();
// Optimize sensor settings for speed
s->set_framesize(s, FRAMESIZE_QQVGA); // Start with VGA for speed
s->set_quality(s, 12); // JPEG quality (lower = faster)
// Image enhancement settings
s->set_brightness(s, 0); // -2 to 2
s->set_contrast(s, 0); // -2 to 2
s->set_saturation(s, 0); // -2 to 2
s->set_special_effect(s, 0); // 0 to 6 (0=No Effect)
s->set_whitebal(s, 1); // 0 = disable , 1 = enable
s->set_awb_gain(s, 1); // 0 = disable , 1 = enable
s->set_wb_mode(s, 0); // 0 to 4 - if awb_gain enabled
s->set_exposure_ctrl(s, 1); // 0 = disable , 1 = enable
s->set_aec2(s, 0); // 0 = disable , 1 = enable
s->set_ae_level(s, 0); // -2 to 2
s->set_aec_value(s, 300); // 0 to 1200
s->set_gain_ctrl(s, 1); // 0 = disable , 1 = enable
s->set_agc_gain(s, 0); // 0 to 30
s->set_gainceiling(s, (gainceiling_t)0); // 0 to 6
s->set_bpc(s, 0); // 0 = disable , 1 = enable
s->set_wpc(s, 1); // 0 = disable , 1 = enable
s->set_raw_gma(s, 1); // 0 = disable , 1 = enable
s->set_lenc(s, 1); // 0 = disable , 1 = enable
s->set_hmirror(s, 0); // 0 = disable , 1 = enable
s->set_vflip(s, 0); // 0 = disable , 1 = enable
s->set_dcw(s, 1); // 0 = disable , 1 = enable
s->set_colorbar(s, 0); // 0 = disable , 1 = enable
// Camera model specific optimizations
#if defined(CAMERA_MODEL_XIAO_ESP32S3)
// No specific flips needed for XIAO ESP32S3
#endif
#if defined(LED_GPIO_NUM)
setupLedFlash(LED_GPIO_NUM);
#endif
// WiFi setup with optimizations
WiFi.mode(WIFI_STA);
WiFi.setSleep(false); // Disable WiFi sleep for consistent performance
WiFi.setTxPower(WIFI_POWER_19_5dBm); // Max WiFi power
Serial.printf("Connecting to %s", ssid);
WiFi.begin(ssid, password);
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println("");
Serial.println("WiFi connected");
startCameraServer();
Serial.print("Camera Ready! Use 'http://");
Serial.print(WiFi.localIP());
Serial.println("' to connect");
// Print optimization info
Serial.println("\nOptimization Settings Applied:");
Serial.printf("CPU Frequency: %d MHz\n", getCpuFrequencyMhz());
Serial.printf("PSRAM Available: %s\n", psramFound() ? "Yes" : "No");
Serial.printf("Frame Size: %s\n", psramFound() ? "QQVGA" : "QQVGA");
Serial.printf("JPEG Quality: %d\n", psramFound() ? 12 : 15);
Serial.printf("Frame Buffers: %d\n", psramFound() ? 2 : 1);
}
void loop() {
// Keep loop minimal for best performance
delay(1);
}
Part 02: Capturing and Displaying Frames on Host
Objective
Connect to ESP32 stream and show real-time video using OpenCV.
Instructional Steps
- Use the ESP32 IP (e.g.,
http://<IP>:81/stream
) in your Python code. - Use
cv2.VideoCapture()
to connect and read frames.
Part 03: Choosing and Training Your Model with Roboflow
Objective
Learn how to choose the best model type and train your own dataset using Roboflow.
Choosing a Model Type
Roboflow supports many model architectures. For use with an ESP32-CAM that streams to a host machine (which does the heavy computation), you’ll want to select a lightweight model optimized for speed and acceptable accuracy. Good options include:
- YOLOv5n – “Nano” version of YOLOv5, very fast but less accurate.
- YOLOv8n – Latest nano version of YOLOv8, offering better trade-offs.
- YOLOv8s – Slightly larger, better accuracy, still usable on laptops.
- MobileNet-SSD – Great for low-latency mobile applications.
✅ Tip: Start with YOLOv8n and scale up if needed.
Training a Model in Roboflow
- Go to https://roboflow.com and create a free account.
- Click “Create Project” and set your object detection parameters.
- Upload your images and annotate them using Roboflow’s labeling interface.
- After labeling, click “Generate Dataset” to resize and augment your images.
- Click “Train Model” and choose a suitable model type (e.g. YOLOv8n).
- When training is done, you’ll receive a
model_id
for use with the Inference SDK.
Using Your Trained Model
To use your trained model with the Roboflow Inference SDK:
from inference import get_model
model = get_model(model_id="your_model_id")
If you’re deploying with a .pt
file locally (instead of the Roboflow-hosted model):
- Download the YOLO weights (
weights.pt
) from Roboflow. - Use Ultralytics YOLOv8 locally:
pip install ultralytics
Then you can run:
from ultralytics import YOLO
model = YOLO("your_model.pt")
results = model.predict("image.jpg")
Summary
Scenario | Recommended Model |
---|---|
Streaming from ESP32-CAM to PC | YOLOv8n or YOLOv5n |
Deployment on Jetson Nano or similar | YOLOv8s |
Mobile deployment (e.g., Android app) | MobileNet-SSD |
Need best accuracy | YOLOv8m or YOLOv5m |
💡 Consider uploading diverse real-world images for best results during training.
Part 04: Performing Object Detection
Objective
Use Roboflow’s YOLO model to detect hazards in real-time.
Code Walkthrough
- Model Setup: Using
get_model()
from Roboflow SDK - Detection:
model.infer(frame)
returns bounding boxes and classes - Annotation: Use
supervision
library to draw boxes and labels - TTS: pyttsx3 announces hazards with cooldown logic
Full Detection Script
Click to view Python code
import cv2
from inference import get_model
import supervision as sv
# Replace with your ESP32-CAM stream URL
STREAM_URL = "http://<IP>:81/stream" #change this
# Open video stream
cap = cv2.VideoCapture(STREAM_URL)
if not cap.isOpened():
raise RuntimeError("Could not open ESP32 stream.")
# Load Roboflow model
model = get_model(model_id="YOUR MODEL ID") #change this
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()
class_names = model.class_names
# Main loop
while True:
ret, frame = cap.read()
if not ret:
print("Failed to read frame.")
break
# Run inference
results = model.infer(frame)[0]
detections = sv.Detections.from_inference(results)
# Annotate frame
annotated = box_annotator.annotate(scene=frame, detections=detections)
annotated = label_annotator.annotate(scene=annotated, detections=detections)
# Show result
cv2.imshow("YOLO Detection", annotated)
# Press ESC to quit
if cv2.waitKey(1) & 0xFF == 27:
break
cap.release()
cv2.destroyAllWindows()
Part 05: Putting It All Together
Objective
Create a low-cost, vision-based alert system using ESP32 and YOLO.
Final Integration Steps
- Power up ESP32-CAM
- Run Python detection script on host
- Observe detection overlay + spoken alerts
Possible Use Cases
- Indoor navigation aid for visually impaired
- Smart home obstacle detection
- Security monitoring
Additional Resources
Created by Purab Balani