Computer Vision
Enabling machines to see and interpret visual information from images and videos
What is Computer Vision?
Computer Vision is a field of AI that enables machines to derive meaningful information from digital images, videos, and other visual inputs. It aims to replicate and exceed human vision capabilities, allowing computers to "see" and understand the visual world.
👁️ From Pixels to Understanding:
A computer sees an image as a matrix of numbers (pixels). Computer vision transforms these numbers into high-level understanding: "This is a cat sitting on a couch."
Core Computer Vision Tasks
🏷️ Image Classification
Assign a label to an entire image from a set of categories.
Example:
Input: Photo → Output: "Cat", "Dog", or "Bird"
📍 Object Detection
Identify and locate multiple objects with bounding boxes.
Example:
Input: Street scene → Output: 3 cars, 2 people, 1 traffic light (with locations)
🎨 Image Segmentation
Classify each pixel in an image (pixel-level classification).
Example:
Input: Photo → Output: Pixel masks for sky, road, buildings, trees
👤 Face Recognition
Identify or verify individuals from facial features.
Example:
Input: Face image → Output: Person ID or verification result
Image Representation & Processing
// Basic Image Processing Operations
class ImageProcessor {
// Grayscale conversion
toGrayscale(imageData) {
// imageData: {width, height, data: [r,g,b,a, r,g,b,a, ...]}
const { width, height, data } = imageData;
const grayscale = new Uint8ClampedArray(width * height);
for (let i = 0; i < data.length; i += 4) {
// Luminance formula
const gray = 0.299 * data[i] + // R
0.587 * data[i + 1] + // G
0.114 * data[i + 2]; // B
grayscale[i / 4] = gray;
}
return { width, height, data: grayscale };
}
// Edge detection (Sobel operator)
detectEdges(grayscaleImg) {
const { width, height, data } = grayscaleImg;
const edges = new Uint8ClampedArray(width * height);
// Sobel kernels
const sobelX = [-1, 0, 1, -2, 0, 2, -1, 0, 1];
const sobelY = [-1, -2, -1, 0, 0, 0, 1, 2, 1];
for (let y = 1; y < height - 1; y++) {
for (let x = 1; x < width - 1; x++) {
let gx = 0, gy = 0;
// Apply 3x3 convolution
for (let ky = -1; ky <= 1; ky++) {
for (let kx = -1; kx <= 1; kx++) {
const idx = (y + ky) * width + (x + kx);
const kidx = (ky + 1) * 3 + (kx + 1);
gx += data[idx] * sobelX[kidx];
gy += data[idx] * sobelY[kidx];
}
}
// Gradient magnitude
const magnitude = Math.sqrt(gx * gx + gy * gy);
edges[y * width + x] = Math.min(255, magnitude);
}
}
return { width, height, data: edges };
}
// Image blurring (box filter)
blur(grayscaleImg, kernelSize = 5) {
const { width, height, data } = grayscaleImg;
const blurred = new Uint8ClampedArray(width * height);
const half = Math.floor(kernelSize / 2);
for (let y = half; y < height - half; y++) {
for (let x = half; x < width - half; x++) {
let sum = 0;
let count = 0;
for (let ky = -half; ky <= half; ky++) {
for (let kx = -half; kx <= half; kx++) {
sum += data[(y + ky) * width + (x + kx)];
count++;
}
}
blurred[y * width + x] = sum / count;
}
}
return { width, height, data: blurred };
}
}
// Example usage
const processor = new ImageProcessor();
console.log("Image processing operations available:");
console.log("- Grayscale conversion");
console.log("- Edge detection (Sobel)");
console.log("- Blur (averaging filter)");
Convolutional Neural Networks for Vision
CNNs are the backbone of modern computer vision:
// Image Classification with CNN
class SimpleCNN {
constructor() {
// Typical CNN architecture layers
this.layers = [
{ type: 'conv', filters: 32, kernelSize: 3, activation: 'relu' },
{ type: 'pool', poolSize: 2, poolType: 'max' },
{ type: 'conv', filters: 64, kernelSize: 3, activation: 'relu' },
{ type: 'pool', poolSize: 2, poolType: 'max' },
{ type: 'conv', filters: 128, kernelSize: 3, activation: 'relu' },
{ type: 'pool', poolSize: 2, poolType: 'max' },
{ type: 'flatten' },
{ type: 'dense', units: 256, activation: 'relu' },
{ type: 'dropout', rate: 0.5 },
{ type: 'dense', units: 10, activation: 'softmax' }
];
}
// Feature extraction visualization
describeArchitecture() {
console.log("CNN Architecture for Image Classification:
");
console.log("INPUT: 224x224x3 RGB Image");
console.log("│");
let currentSize = { h: 224, w: 224, c: 3 };
this.layers.forEach((layer, i) => {
console.log("├─ Layer " + (i + 1) + ": " + layer.type.toUpperCase());
switch(layer.type) {
case 'conv':
currentSize.c = layer.filters;
console.log("│ └─ Filters: " + layer.filters + ", Size: " + currentSize.h + "x" + currentSize.w + "x" + currentSize.c);
break;
case 'pool':
currentSize.h = Math.floor(currentSize.h / layer.poolSize);
currentSize.w = Math.floor(currentSize.w / layer.poolSize);
console.log("│ └─ Pooling: " + layer.poolType + ", Output: " + currentSize.h + "x" + currentSize.w + "x" + currentSize.c);
break;
case 'flatten':
const flatSize = currentSize.h * currentSize.w * currentSize.c;
console.log("│ └─ Flatten to: " + flatSize + " features");
currentSize = flatSize;
break;
case 'dense':
console.log("│ └─ Fully connected: " + layer.units + " neurons");
break;
case 'dropout':
console.log("│ └─ Dropout rate: " + layer.rate);
break;
}
});
console.log("│");
console.log("OUTPUT: 10 class probabilities");
}
// Feature visualization concept
visualizeFeatureMaps(layerNum) {
console.log("\n=== Feature Maps at Layer " + layerNum + " ===");
console.log("Early layers detect:");
console.log(" - Edges (horizontal, vertical, diagonal)");
console.log(" - Colors and textures");
console.log(" - Simple patterns");
console.log("
Middle layers detect:");
console.log(" - Shapes (circles, rectangles)");
console.log(" - Object parts (wheels, eyes, corners)");
console.log(" - Patterns and textures");
console.log("
Deep layers detect:");
console.log(" - Complete objects (faces, cars, animals)");
console.log(" - Complex scenes");
console.log(" - High-level concepts");
}
}
const cnn = new SimpleCNN();
cnn.describeArchitecture();
cnn.visualizeFeatureMaps(5);
Object Detection: YOLO Concept
// Simplified YOLO (You Only Look Once) Concept
class SimpleYOLO {
constructor(gridSize = 7, numClasses = 20) {
this.gridSize = gridSize;
this.numClasses = numClasses;
this.confidenceThreshold = 0.5;
this.iouThreshold = 0.5;
}
// Divide image into grid
createGrid(imageWidth, imageHeight) {
const cellWidth = imageWidth / this.gridSize;
const cellHeight = imageHeight / this.gridSize;
const grid = [];
for (let y = 0; y < this.gridSize; y++) {
for (let x = 0; x < this.gridSize; x++) {
grid.push({
x: x * cellWidth,
y: y * cellHeight,
width: cellWidth,
height: cellHeight,
gridX: x,
gridY: y
});
}
}
return grid;
}
// Intersection over Union (IoU)
calculateIoU(box1, box2) {
const x1 = Math.max(box1.x, box2.x);
const y1 = Math.max(box1.y, box2.y);
const x2 = Math.min(box1.x + box1.width, box2.x + box2.width);
const y2 = Math.min(box1.y + box1.height, box2.y + box2.height);
const intersection = Math.max(0, x2 - x1) * Math.max(0, y2 - y1);
const area1 = box1.width * box1.height;
const area2 = box2.width * box2.height;
const union = area1 + area2 - intersection;
return intersection / union;
}
// Non-Maximum Suppression
nonMaxSuppression(boxes) {
// Sort by confidence
boxes.sort((a, b) => b.confidence - a.confidence);
const selectedBoxes = [];
while (boxes.length > 0) {
const bestBox = boxes.shift();
selectedBoxes.push(bestBox);
// Remove overlapping boxes
boxes = boxes.filter(box => {
const iou = this.calculateIoU(bestBox, box);
return iou < this.iouThreshold;
});
}
return selectedBoxes;
}
// Predict (simplified)
detect(image, predictions) {
console.log("YOLO Object Detection Process:
");
// 1. Divide into grid
console.log("Step 1: Divide " + image.width + "x" + image.height + " image into " + this.gridSize + "x" + this.gridSize + " grid");
const grid = this.createGrid(image.width, image.height);
// 2. Predictions per cell
console.log("Step 2: Each cell predicts bounding boxes + class probabilities");
console.log(" - Each cell: 2 bounding boxes");
console.log(" - Each box: [x, y, width, height, confidence]");
console.log(" - Class probabilities: " + this.numClasses + " classes");
// 3. Filter by confidence
console.log("
Step 3: Filter boxes with confidence > " + this.confidenceThreshold);
const confidentBoxes = predictions.filter(box =>
box.confidence > this.confidenceThreshold
);
console.log(" Kept " + confidentBoxes.length + " boxes");
// 4. Non-maximum suppression
console.log("
Step 4: Apply Non-Maximum Suppression (IoU threshold: " + this.iouThreshold + ")");
const finalBoxes = this.nonMaxSuppression(confidentBoxes);
console.log(" Final detections: " + finalBoxes.length + " objects");
return finalBoxes;
}
}
// Example
const yolo = new SimpleYOLO(7, 20);
const image = { width: 640, height: 480 };
// Simulated predictions
const predictions = [
{ x: 100, y: 100, width: 150, height: 200, confidence: 0.95, class: 'person' },
{ x: 105, y: 102, width: 145, height: 198, confidence: 0.87, class: 'person' }, // Duplicate
{ x: 400, y: 200, width: 100, height: 80, confidence: 0.78, class: 'car' },
{ x: 150, y: 350, width: 60, height: 60, confidence: 0.45, class: 'dog' } // Low confidence
];
const detections = yolo.detect(image, predictions);
console.log("
Final Detections:");
detections.forEach((det, i) => {
console.log("" + i + 1 + ". " + det.class + ": " + (det.confidence * 100).toFixed(1) + "% at [" + det.x + ", " + det.y + "]");
});
🎯 YOLO Key Concepts:
- Single pass: Detects all objects in one forward pass (very fast)
- Grid-based: Divides image into SxS grid cells
- Bounding boxes: Each cell predicts multiple boxes
- NMS: Removes duplicate detections of same object
Real-World Applications
🚗 Autonomous Vehicles
- • Lane detection and tracking
- • Pedestrian and vehicle detection
- • Traffic sign recognition
- • Obstacle avoidance
- • Path planning
🏥 Medical Imaging
- • Disease detection in X-rays/MRIs
- • Tumor segmentation
- • Cell counting and analysis
- • Retinal disease diagnosis
- • Surgical assistance
Learn More
🔒 Security & Surveillance
- • Face recognition systems
- • Anomaly detection
- • Crowd counting and analysis
- • Activity recognition
- • Access control
📱 Consumer Applications
- • Photo organization and search
- • AR filters and effects
- • Document scanning (OCR)
- • Product visual search
- • Quality inspection
💡 Key Takeaways
- ✓ Computer vision enables machines to understand visual data
- ✓ CNNs are the standard architecture for vision tasks
- ✓ Key tasks: Classification, detection, segmentation
- ✓ YOLO and similar models enable real-time object detection
- ✓ Applications span healthcare, automotive, security, and more
- ✓ Transfer learning allows using pre-trained models