Intermediate
30 min
Full Guide

Computer Vision

Enabling machines to see and interpret visual information from images and videos

What is Computer Vision?

Computer Vision is a field of AI that enables machines to derive meaningful information from digital images, videos, and other visual inputs. It aims to replicate and exceed human vision capabilities, allowing computers to "see" and understand the visual world.

👁️ From Pixels to Understanding:

A computer sees an image as a matrix of numbers (pixels). Computer vision transforms these numbers into high-level understanding: "This is a cat sitting on a couch."

Core Computer Vision Tasks

🏷️ Image Classification

Assign a label to an entire image from a set of categories.

Example:

Input: Photo → Output: "Cat", "Dog", or "Bird"

📍 Object Detection

Identify and locate multiple objects with bounding boxes.

Example:

Input: Street scene → Output: 3 cars, 2 people, 1 traffic light (with locations)

🎨 Image Segmentation

Classify each pixel in an image (pixel-level classification).

Example:

Input: Photo → Output: Pixel masks for sky, road, buildings, trees

👤 Face Recognition

Identify or verify individuals from facial features.

Example:

Input: Face image → Output: Person ID or verification result

Image Representation & Processing

// Basic Image Processing Operations
class ImageProcessor {
  // Grayscale conversion
  toGrayscale(imageData) {
    // imageData: {width, height, data: [r,g,b,a, r,g,b,a, ...]}
    const { width, height, data } = imageData;
    const grayscale = new Uint8ClampedArray(width * height);

    for (let i = 0; i < data.length; i += 4) {
      // Luminance formula
      const gray = 0.299 * data[i] +       // R
                   0.587 * data[i + 1] +   // G
                   0.114 * data[i + 2];    // B
      grayscale[i / 4] = gray;
    }

    return { width, height, data: grayscale };
  }

  // Edge detection (Sobel operator)
  detectEdges(grayscaleImg) {
    const { width, height, data } = grayscaleImg;
    const edges = new Uint8ClampedArray(width * height);

    // Sobel kernels
    const sobelX = [-1, 0, 1, -2, 0, 2, -1, 0, 1];
    const sobelY = [-1, -2, -1, 0, 0, 0, 1, 2, 1];

    for (let y = 1; y < height - 1; y++) {
      for (let x = 1; x < width - 1; x++) {
        let gx = 0, gy = 0;

        // Apply 3x3 convolution
        for (let ky = -1; ky <= 1; ky++) {
          for (let kx = -1; kx <= 1; kx++) {
            const idx = (y + ky) * width + (x + kx);
            const kidx = (ky + 1) * 3 + (kx + 1);
            
            gx += data[idx] * sobelX[kidx];
            gy += data[idx] * sobelY[kidx];
          }
        }

        // Gradient magnitude
        const magnitude = Math.sqrt(gx * gx + gy * gy);
        edges[y * width + x] = Math.min(255, magnitude);
      }
    }

    return { width, height, data: edges };
  }

  // Image blurring (box filter)
  blur(grayscaleImg, kernelSize = 5) {
    const { width, height, data } = grayscaleImg;
    const blurred = new Uint8ClampedArray(width * height);
    const half = Math.floor(kernelSize / 2);

    for (let y = half; y < height - half; y++) {
      for (let x = half; x < width - half; x++) {
        let sum = 0;
        let count = 0;

        for (let ky = -half; ky <= half; ky++) {
          for (let kx = -half; kx <= half; kx++) {
            sum += data[(y + ky) * width + (x + kx)];
            count++;
          }
        }

        blurred[y * width + x] = sum / count;
      }
    }

    return { width, height, data: blurred };
  }
}

// Example usage
const processor = new ImageProcessor();
console.log("Image processing operations available:");
console.log("- Grayscale conversion");
console.log("- Edge detection (Sobel)");
console.log("- Blur (averaging filter)");

Convolutional Neural Networks for Vision

CNNs are the backbone of modern computer vision:

// Image Classification with CNN
class SimpleCNN {
  constructor() {
    // Typical CNN architecture layers
    this.layers = [
      { type: 'conv', filters: 32, kernelSize: 3, activation: 'relu' },
      { type: 'pool', poolSize: 2, poolType: 'max' },
      { type: 'conv', filters: 64, kernelSize: 3, activation: 'relu' },
      { type: 'pool', poolSize: 2, poolType: 'max' },
      { type: 'conv', filters: 128, kernelSize: 3, activation: 'relu' },
      { type: 'pool', poolSize: 2, poolType: 'max' },
      { type: 'flatten' },
      { type: 'dense', units: 256, activation: 'relu' },
      { type: 'dropout', rate: 0.5 },
      { type: 'dense', units: 10, activation: 'softmax' }
    ];
  }

  // Feature extraction visualization
  describeArchitecture() {
    console.log("CNN Architecture for Image Classification:
");
    console.log("INPUT: 224x224x3 RGB Image");
    console.log("│");
    
    let currentSize = { h: 224, w: 224, c: 3 };
    
    this.layers.forEach((layer, i) => {
      console.log("├─ Layer " + (i + 1) + ": " + layer.type.toUpperCase());
      
      switch(layer.type) {
        case 'conv':
          currentSize.c = layer.filters;
          console.log("│  └─ Filters: " + layer.filters + ", Size: " + currentSize.h + "x" + currentSize.w + "x" + currentSize.c);
          break;
        case 'pool':
          currentSize.h = Math.floor(currentSize.h / layer.poolSize);
          currentSize.w = Math.floor(currentSize.w / layer.poolSize);
          console.log("│  └─ Pooling: " + layer.poolType + ", Output: " + currentSize.h + "x" + currentSize.w + "x" + currentSize.c);
          break;
        case 'flatten':
          const flatSize = currentSize.h * currentSize.w * currentSize.c;
          console.log("│  └─ Flatten to: " + flatSize + " features");
          currentSize = flatSize;
          break;
        case 'dense':
          console.log("│  └─ Fully connected: " + layer.units + " neurons");
          break;
        case 'dropout':
          console.log("│  └─ Dropout rate: " + layer.rate);
          break;
      }
    });
    
    console.log("│");
    console.log("OUTPUT: 10 class probabilities");
  }

  // Feature visualization concept
  visualizeFeatureMaps(layerNum) {
    console.log("\n=== Feature Maps at Layer " + layerNum + " ===");
    console.log("Early layers detect:");
    console.log("  - Edges (horizontal, vertical, diagonal)");
    console.log("  - Colors and textures");
    console.log("  - Simple patterns");
    console.log("
Middle layers detect:");
    console.log("  - Shapes (circles, rectangles)");
    console.log("  - Object parts (wheels, eyes, corners)");
    console.log("  - Patterns and textures");
    console.log("
Deep layers detect:");
    console.log("  - Complete objects (faces, cars, animals)");
    console.log("  - Complex scenes");
    console.log("  - High-level concepts");
  }
}

const cnn = new SimpleCNN();
cnn.describeArchitecture();
cnn.visualizeFeatureMaps(5);

Object Detection: YOLO Concept

// Simplified YOLO (You Only Look Once) Concept
class SimpleYOLO {
  constructor(gridSize = 7, numClasses = 20) {
    this.gridSize = gridSize;
    this.numClasses = numClasses;
    this.confidenceThreshold = 0.5;
    this.iouThreshold = 0.5;
  }

  // Divide image into grid
  createGrid(imageWidth, imageHeight) {
    const cellWidth = imageWidth / this.gridSize;
    const cellHeight = imageHeight / this.gridSize;
    
    const grid = [];
    for (let y = 0; y < this.gridSize; y++) {
      for (let x = 0; x < this.gridSize; x++) {
        grid.push({
          x: x * cellWidth,
          y: y * cellHeight,
          width: cellWidth,
          height: cellHeight,
          gridX: x,
          gridY: y
        });
      }
    }
    return grid;
  }

  // Intersection over Union (IoU)
  calculateIoU(box1, box2) {
    const x1 = Math.max(box1.x, box2.x);
    const y1 = Math.max(box1.y, box2.y);
    const x2 = Math.min(box1.x + box1.width, box2.x + box2.width);
    const y2 = Math.min(box1.y + box1.height, box2.y + box2.height);

    const intersection = Math.max(0, x2 - x1) * Math.max(0, y2 - y1);
    const area1 = box1.width * box1.height;
    const area2 = box2.width * box2.height;
    const union = area1 + area2 - intersection;

    return intersection / union;
  }

  // Non-Maximum Suppression
  nonMaxSuppression(boxes) {
    // Sort by confidence
    boxes.sort((a, b) => b.confidence - a.confidence);
    
    const selectedBoxes = [];
    
    while (boxes.length > 0) {
      const bestBox = boxes.shift();
      selectedBoxes.push(bestBox);
      
      // Remove overlapping boxes
      boxes = boxes.filter(box => {
        const iou = this.calculateIoU(bestBox, box);
        return iou < this.iouThreshold;
      });
    }
    
    return selectedBoxes;
  }

  // Predict (simplified)
  detect(image, predictions) {
    console.log("YOLO Object Detection Process:
");
    
    // 1. Divide into grid
    console.log("Step 1: Divide " + image.width + "x" + image.height + " image into " + this.gridSize + "x" + this.gridSize + " grid");
    const grid = this.createGrid(image.width, image.height);
    
    // 2. Predictions per cell
    console.log("Step 2: Each cell predicts bounding boxes + class probabilities");
    console.log("  - Each cell: 2 bounding boxes");
    console.log("  - Each box: [x, y, width, height, confidence]");
    console.log("  - Class probabilities: " + this.numClasses + " classes");
    
    // 3. Filter by confidence
    console.log("
Step 3: Filter boxes with confidence > " + this.confidenceThreshold);
    const confidentBoxes = predictions.filter(box => 
      box.confidence > this.confidenceThreshold
    );
    console.log("  Kept " + confidentBoxes.length + " boxes");
    
    // 4. Non-maximum suppression
    console.log("
Step 4: Apply Non-Maximum Suppression (IoU threshold: " + this.iouThreshold + ")");
    const finalBoxes = this.nonMaxSuppression(confidentBoxes);
    console.log("  Final detections: " + finalBoxes.length + " objects");
    
    return finalBoxes;
  }
}

// Example
const yolo = new SimpleYOLO(7, 20);
const image = { width: 640, height: 480 };

// Simulated predictions
const predictions = [
  { x: 100, y: 100, width: 150, height: 200, confidence: 0.95, class: 'person' },
  { x: 105, y: 102, width: 145, height: 198, confidence: 0.87, class: 'person' }, // Duplicate
  { x: 400, y: 200, width: 100, height: 80, confidence: 0.78, class: 'car' },
  { x: 150, y: 350, width: 60, height: 60, confidence: 0.45, class: 'dog' } // Low confidence
];

const detections = yolo.detect(image, predictions);
console.log("
Final Detections:");
detections.forEach((det, i) => {
  console.log("" + i + 1 + ". " + det.class + ": " + (det.confidence * 100).toFixed(1) + "% at [" + det.x + ", " + det.y + "]");
});

🎯 YOLO Key Concepts:

  • Single pass: Detects all objects in one forward pass (very fast)
  • Grid-based: Divides image into SxS grid cells
  • Bounding boxes: Each cell predicts multiple boxes
  • NMS: Removes duplicate detections of same object

Real-World Applications

🚗 Autonomous Vehicles

  • • Lane detection and tracking
  • • Pedestrian and vehicle detection
  • • Traffic sign recognition
  • • Obstacle avoidance
  • • Path planning

🏥 Medical Imaging

  • • Disease detection in X-rays/MRIs
  • • Tumor segmentation
  • • Cell counting and analysis
  • • Retinal disease diagnosis
  • • Surgical assistance

🔒 Security & Surveillance

  • • Face recognition systems
  • • Anomaly detection
  • • Crowd counting and analysis
  • • Activity recognition
  • • Access control

📱 Consumer Applications

  • • Photo organization and search
  • • AR filters and effects
  • • Document scanning (OCR)
  • • Product visual search
  • • Quality inspection

💡 Key Takeaways

  • Computer vision enables machines to understand visual data
  • CNNs are the standard architecture for vision tasks
  • Key tasks: Classification, detection, segmentation
  • YOLO and similar models enable real-time object detection
  • Applications span healthcare, automotive, security, and more
  • Transfer learning allows using pre-trained models