Intermediate
35 min
Full Guide

Natural Language Processing

Teaching machines to understand and generate human language

What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of AI that enables computers to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding, powering applications from chatbots to translation services.

🗣️ The Challenge:

Human language is ambiguous, context-dependent, and constantly evolving. Teaching machines to understand it requires handling grammar, semantics, pragmatics, and cultural nuances!

Key NLP Tasks

📝 Text Classification

Categorizing text into predefined classes.

Examples: Spam detection, sentiment analysis, topic labeling

🏷️ Named Entity Recognition (NER)

Identifying and classifying entities in text.

Examples: Person names, locations, organizations, dates

🌍 Machine Translation

Automatically translating text between languages.

Examples: Google Translate, DeepL, multilingual chat

❓ Question Answering

Understanding questions and providing answers.

Examples: Chatbots, virtual assistants, search engines

📄 Text Summarization

Generating concise summaries of longer texts.

Examples: News summaries, document abstracts, meeting notes

✍️ Text Generation

Creating human-like text from prompts.

Examples: ChatGPT, content creation, code generation

Text Preprocessing Pipeline

// NLP Text Preprocessing
class TextPreprocessor {
  constructor() {
    this.stopWords = new Set([
      'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at',
      'to', 'for', 'of', 'with', 'by', 'from', 'as', 'is', 'was'
    ]);
  }

  // 1. Tokenization: Split text into words
  tokenize(text) {
    return text
      .toLowerCase()
      .replace(/[^a-z0-9s]/g, '') // Remove punctuation
      .split(/s+/)
      .filter(word => word.length > 0);
  }

  // 2. Remove stop words
  removeStopWords(tokens) {
    return tokens.filter(token => !this.stopWords.has(token));
  }

  // 3. Stemming: Reduce words to root form
  stem(word) {
    // Simple suffix removal (Porter Stemmer simplified)
    const suffixes = ['ing', 'ed', 'es', 's', 'ly'];
    for (const suffix of suffixes) {
      if (word.endsWith(suffix)) {
        return word.slice(0, -suffix.length);
      }
    }
    return word;
  }

  // 4. Process entire text
  process(text) {
    const tokens = this.tokenize(text);
    const filtered = this.removeStopWords(tokens);
    const stemmed = filtered.map(token => this.stem(token));
    return stemmed;
  }
}

// Example usage
const preprocessor = new TextPreprocessor();
const text = "Natural Language Processing is revolutionizing how machines understand human language!";

console.log("Original text:", text);
const processed = preprocessor.process(text);
console.log("Processed tokens:", processed);
// Output: ['natur', 'languag', 'process', 'revolut', 'machin', 'understand', 'human', 'languag']

Sentiment Analysis Implementation

Let's build a simple sentiment analyzer:

// Sentiment Analysis Classifier
class SentimentAnalyzer {
  constructor() {
    // Sentiment lexicon (simplified)
    this.positiveWords = new Set([
      'good', 'great', 'excellent', 'amazing', 'wonderful',
      'fantastic', 'love', 'best', 'perfect', 'awesome',
      'happy', 'beautiful', 'brilliant', 'outstanding'
    ]);
    
    this.negativeWords = new Set([
      'bad', 'terrible', 'awful', 'horrible', 'worst',
      'hate', 'poor', 'disappointing', 'useless', 'sad',
      'angry', 'disgusting', 'pathetic', 'boring'
    ]);

    this.intensifiers = new Map([
      ['very', 1.5],
      ['really', 1.5],
      ['extremely', 2.0],
      ['absolutely', 2.0]
    ]);

    this.negations = new Set(['not', 'no', 'never', 'nothing', 'nobody']);
  }

  tokenize(text) {
    return text.toLowerCase()
      .replace(/[^a-zs]/g, '')
      .split(/s+/)
      .filter(word => word.length > 0);
  }

  analyzeSentiment(text) {
    const tokens = this.tokenize(text);
    let score = 0;
    let multiplier = 1;
    let negated = false;

    for (let i = 0; i < tokens.length; i++) {
      const token = tokens[i];

      // Check for intensifiers
      if (this.intensifiers.has(token)) {
        multiplier = this.intensifiers.get(token);
        continue;
      }

      // Check for negations
      if (this.negations.has(token)) {
        negated = true;
        continue;
      }

      // Calculate sentiment
      if (this.positiveWords.has(token)) {
        score += negated ? -1 * multiplier : 1 * multiplier;
      } else if (this.negativeWords.has(token)) {
        score += negated ? 1 * multiplier : -1 * multiplier;
      }

      // Reset modifiers
      multiplier = 1;
      negated = false;
    }

    // Normalize score
    const normalized = Math.max(-1, Math.min(1, score / tokens.length * 2));

    return {
      score: normalized,
      sentiment: normalized > 0.2 ? 'positive' : 
                 normalized < -0.2 ? 'negative' : 'neutral',
      confidence: Math.abs(normalized)
    };
  }
}

// Example usage
const analyzer = new SentimentAnalyzer();

const reviews = [
  "This product is absolutely amazing! I love it!",
  "Terrible experience. Very disappointed.",
  "It's okay, nothing special.",
  "Not bad, but could be better."
];

console.log("Sentiment Analysis Results:
");
reviews.forEach((review, i) => {
  const result = analyzer.analyzeSentiment(review);
  console.log("Review " + (i + 1) + ": "" + review + """);
  console.log("Sentiment: " + result.sentiment + " (score: " + result.score.toFixed(2) + ", confidence: " + result.confidence.toFixed(2) + ")");
  console.log('');
});

🎯 Key Techniques:

  • Lexicon-based: Uses predefined word sentiment scores
  • Intensifiers: Words like "very" amplify sentiment
  • Negation handling: "not good" reverses polarity
  • Score normalization: Converts to -1 to +1 scale

Word Embeddings: Representing Words as Vectors

Word embeddings capture semantic relationships:

// Simple Word2Vec-style embedding (simplified)
class WordEmbedding {
  constructor(embeddingDim = 50) {
    this.embeddingDim = embeddingDim;
    this.vocabulary = new Map();
    this.embeddings = new Map();
  }

  // Initialize random embeddings
  initializeEmbeddings(words) {
    words.forEach(word => {
      if (!this.embeddings.has(word)) {
        const embedding = Array(this.embeddingDim)
          .fill(0)
          .map(() => Math.random() * 2 - 1);
        this.embeddings.set(word, embedding);
        this.vocabulary.set(word, this.vocabulary.size);
      }
    });
  }

  // Get word embedding
  getEmbedding(word) {
    return this.embeddings.get(word) || Array(this.embeddingDim).fill(0);
  }

  // Calculate cosine similarity
  cosineSimilarity(vec1, vec2) {
    const dotProduct = vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
    const mag1 = Math.sqrt(vec1.reduce((sum, val) => sum + val * val, 0));
    const mag2 = Math.sqrt(vec2.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (mag1 * mag2);
  }

  // Find similar words
  findSimilar(word, topK = 5) {
    const wordEmb = this.getEmbedding(word);
    const similarities = [];

    for (const [otherWord, otherEmb] of this.embeddings) {
      if (otherWord !== word) {
        const sim = this.cosineSimilarity(wordEmb, otherEmb);
        similarities.push({ word: otherWord, similarity: sim });
      }
    }

    return similarities
      .sort((a, b) => b.similarity - a.similarity)
      .slice(0, topK);
  }
}

// Example
const embedding = new WordEmbedding(100);
const words = ['king', 'queen', 'man', 'woman', 'royal', 'prince', 'princess'];

embedding.initializeEmbeddings(words);
console.log("Word embeddings initialized for:", words);
console.log("
Finding similar words to 'king':");
const similar = embedding.findSimilar('king', 3);
similar.forEach(({ word, similarity }) => {
  console.log("  " + word + ": " + similarity.toFixed(4));
});

// Famous word2vec analogy: king - man + woman ≈ queen
const kingEmb = embedding.getEmbedding('king');
const manEmb = embedding.getEmbedding('man');
const womanEmb = embedding.getEmbedding('woman');

const result = kingEmb.map((v, i) => v - manEmb[i] + womanEmb[i]);
const queenEmb = embedding.getEmbedding('queen');
const similarity = embedding.cosineSimilarity(result, queenEmb);

console.log("
Word analogy test:");
console.log("king - man + woman ≈ queen");
console.log("Similarity to 'queen': " + similarity.toFixed(4));

💡 Word Embeddings Capture:

  • • Semantic similarity ("king" close to "queen")
  • • Analogical relationships (king - man + woman = queen)
  • • Contextual meaning (same word, different contexts)

Sequence-to-Sequence Models

Seq2Seq Architecture for Translation:

Encoder RNN

Processes input
Creates context vector

Context Vector

Fixed-size representation
of input meaning

Decoder RNN

Generates output
word by word

Modern NLP: Attention Mechanism

// Simplified Attention Mechanism
class Attention {
  constructor() {
    this.weights = null;
  }

  // Calculate attention scores
  calculateScores(query, keys) {
    // Dot product attention
    return keys.map(key => {
      return query.reduce((sum, val, i) => sum + val * key[i], 0);
    });
  }

  // Softmax to get attention weights
  softmax(scores) {
    const maxScore = Math.max(...scores);
    const exps = scores.map(s => Math.exp(s - maxScore));
    const sumExps = exps.reduce((a, b) => a + b, 0);
    return exps.map(e => e / sumExps);
  }

  // Apply attention
  forward(query, keys, values) {
    // Calculate attention scores
    const scores = this.calculateScores(query, keys);
    
    // Convert to weights (probabilities)
    this.weights = this.softmax(scores);
    
    // Weighted sum of values
    const output = Array(values[0].length).fill(0);
    for (let i = 0; i < values.length; i++) {
      for (let j = 0; j < values[i].length; j++) {
        output[j] += this.weights[i] * values[i][j];
      }
    }
    
    return output;
  }
}

// Example: Translating "I love AI"
const attention = new Attention();

// Encoder outputs (simplified as random vectors)
const encoderOutputs = [
  [0.5, 0.3, 0.2],  // "I"
  [0.8, 0.6, 0.4],  // "love"
  [0.3, 0.9, 0.7]   // "AI"
];

// Decoder query (what word we're generating)
const decoderQuery = [0.7, 0.5, 0.3];

// Calculate attention
const contextVector = attention.forward(
  decoderQuery,
  encoderOutputs,  // keys
  encoderOutputs   // values
);

console.log("Attention weights:", attention.weights.map(w => w.toFixed(3)));
console.log("Context vector:", contextVector.map(v => v.toFixed(3)));

  
console.log("
Interpretation: The model is paying most attention to:");
attention.weights.forEach((weight, i) => {
  const words = ['I', 'love', 'AI'];
  console.log("  " + words[i] + ": " + (weight * 100).toFixed(1) + "%");
});

💡 Key Takeaways

  • NLP bridges human language and computers
  • Text preprocessing is crucial (tokenization, stemming, stop words)
  • Word embeddings represent words as dense vectors
  • Sentiment analysis extracts opinions from text
  • Attention mechanisms help models focus on relevant parts
  • Modern NLP uses transformers (covered in next topic)