Natural Language Processing
Teaching machines to understand and generate human language
What is Natural Language Processing?
Natural Language Processing (NLP) is a branch of AI that enables computers to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding, powering applications from chatbots to translation services.
🗣️ The Challenge:
Human language is ambiguous, context-dependent, and constantly evolving. Teaching machines to understand it requires handling grammar, semantics, pragmatics, and cultural nuances!
Key NLP Tasks
📝 Text Classification
Categorizing text into predefined classes.
Examples: Spam detection, sentiment analysis, topic labeling
🏷️ Named Entity Recognition (NER)
Identifying and classifying entities in text.
Examples: Person names, locations, organizations, dates
🌍 Machine Translation
Automatically translating text between languages.
Examples: Google Translate, DeepL, multilingual chat
❓ Question Answering
Understanding questions and providing answers.
Examples: Chatbots, virtual assistants, search engines
📄 Text Summarization
Generating concise summaries of longer texts.
Examples: News summaries, document abstracts, meeting notes
✍️ Text Generation
Creating human-like text from prompts.
Examples: ChatGPT, content creation, code generation
Text Preprocessing Pipeline
// NLP Text Preprocessing
class TextPreprocessor {
constructor() {
this.stopWords = new Set([
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at',
'to', 'for', 'of', 'with', 'by', 'from', 'as', 'is', 'was'
]);
}
// 1. Tokenization: Split text into words
tokenize(text) {
return text
.toLowerCase()
.replace(/[^a-z0-9s]/g, '') // Remove punctuation
.split(/s+/)
.filter(word => word.length > 0);
}
// 2. Remove stop words
removeStopWords(tokens) {
return tokens.filter(token => !this.stopWords.has(token));
}
// 3. Stemming: Reduce words to root form
stem(word) {
// Simple suffix removal (Porter Stemmer simplified)
const suffixes = ['ing', 'ed', 'es', 's', 'ly'];
for (const suffix of suffixes) {
if (word.endsWith(suffix)) {
return word.slice(0, -suffix.length);
}
}
return word;
}
// 4. Process entire text
process(text) {
const tokens = this.tokenize(text);
const filtered = this.removeStopWords(tokens);
const stemmed = filtered.map(token => this.stem(token));
return stemmed;
}
}
// Example usage
const preprocessor = new TextPreprocessor();
const text = "Natural Language Processing is revolutionizing how machines understand human language!";
console.log("Original text:", text);
const processed = preprocessor.process(text);
console.log("Processed tokens:", processed);
// Output: ['natur', 'languag', 'process', 'revolut', 'machin', 'understand', 'human', 'languag']
Sentiment Analysis Implementation
Let's build a simple sentiment analyzer:
// Sentiment Analysis Classifier
class SentimentAnalyzer {
constructor() {
// Sentiment lexicon (simplified)
this.positiveWords = new Set([
'good', 'great', 'excellent', 'amazing', 'wonderful',
'fantastic', 'love', 'best', 'perfect', 'awesome',
'happy', 'beautiful', 'brilliant', 'outstanding'
]);
this.negativeWords = new Set([
'bad', 'terrible', 'awful', 'horrible', 'worst',
'hate', 'poor', 'disappointing', 'useless', 'sad',
'angry', 'disgusting', 'pathetic', 'boring'
]);
this.intensifiers = new Map([
['very', 1.5],
['really', 1.5],
['extremely', 2.0],
['absolutely', 2.0]
]);
this.negations = new Set(['not', 'no', 'never', 'nothing', 'nobody']);
}
tokenize(text) {
return text.toLowerCase()
.replace(/[^a-zs]/g, '')
.split(/s+/)
.filter(word => word.length > 0);
}
analyzeSentiment(text) {
const tokens = this.tokenize(text);
let score = 0;
let multiplier = 1;
let negated = false;
for (let i = 0; i < tokens.length; i++) {
const token = tokens[i];
// Check for intensifiers
if (this.intensifiers.has(token)) {
multiplier = this.intensifiers.get(token);
continue;
}
// Check for negations
if (this.negations.has(token)) {
negated = true;
continue;
}
// Calculate sentiment
if (this.positiveWords.has(token)) {
score += negated ? -1 * multiplier : 1 * multiplier;
} else if (this.negativeWords.has(token)) {
score += negated ? 1 * multiplier : -1 * multiplier;
}
// Reset modifiers
multiplier = 1;
negated = false;
}
// Normalize score
const normalized = Math.max(-1, Math.min(1, score / tokens.length * 2));
return {
score: normalized,
sentiment: normalized > 0.2 ? 'positive' :
normalized < -0.2 ? 'negative' : 'neutral',
confidence: Math.abs(normalized)
};
}
}
// Example usage
const analyzer = new SentimentAnalyzer();
const reviews = [
"This product is absolutely amazing! I love it!",
"Terrible experience. Very disappointed.",
"It's okay, nothing special.",
"Not bad, but could be better."
];
console.log("Sentiment Analysis Results:
");
reviews.forEach((review, i) => {
const result = analyzer.analyzeSentiment(review);
console.log("Review " + (i + 1) + ": "" + review + """);
console.log("Sentiment: " + result.sentiment + " (score: " + result.score.toFixed(2) + ", confidence: " + result.confidence.toFixed(2) + ")");
console.log('');
});
🎯 Key Techniques:
- Lexicon-based: Uses predefined word sentiment scores
- Intensifiers: Words like "very" amplify sentiment
- Negation handling: "not good" reverses polarity
- Score normalization: Converts to -1 to +1 scale
Word Embeddings: Representing Words as Vectors
Word embeddings capture semantic relationships:
// Simple Word2Vec-style embedding (simplified)
class WordEmbedding {
constructor(embeddingDim = 50) {
this.embeddingDim = embeddingDim;
this.vocabulary = new Map();
this.embeddings = new Map();
}
// Initialize random embeddings
initializeEmbeddings(words) {
words.forEach(word => {
if (!this.embeddings.has(word)) {
const embedding = Array(this.embeddingDim)
.fill(0)
.map(() => Math.random() * 2 - 1);
this.embeddings.set(word, embedding);
this.vocabulary.set(word, this.vocabulary.size);
}
});
}
// Get word embedding
getEmbedding(word) {
return this.embeddings.get(word) || Array(this.embeddingDim).fill(0);
}
// Calculate cosine similarity
cosineSimilarity(vec1, vec2) {
const dotProduct = vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
const mag1 = Math.sqrt(vec1.reduce((sum, val) => sum + val * val, 0));
const mag2 = Math.sqrt(vec2.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (mag1 * mag2);
}
// Find similar words
findSimilar(word, topK = 5) {
const wordEmb = this.getEmbedding(word);
const similarities = [];
for (const [otherWord, otherEmb] of this.embeddings) {
if (otherWord !== word) {
const sim = this.cosineSimilarity(wordEmb, otherEmb);
similarities.push({ word: otherWord, similarity: sim });
}
}
return similarities
.sort((a, b) => b.similarity - a.similarity)
.slice(0, topK);
}
}
// Example
const embedding = new WordEmbedding(100);
const words = ['king', 'queen', 'man', 'woman', 'royal', 'prince', 'princess'];
embedding.initializeEmbeddings(words);
console.log("Word embeddings initialized for:", words);
console.log("
Finding similar words to 'king':");
const similar = embedding.findSimilar('king', 3);
similar.forEach(({ word, similarity }) => {
console.log(" " + word + ": " + similarity.toFixed(4));
});
// Famous word2vec analogy: king - man + woman ≈ queen
const kingEmb = embedding.getEmbedding('king');
const manEmb = embedding.getEmbedding('man');
const womanEmb = embedding.getEmbedding('woman');
const result = kingEmb.map((v, i) => v - manEmb[i] + womanEmb[i]);
const queenEmb = embedding.getEmbedding('queen');
const similarity = embedding.cosineSimilarity(result, queenEmb);
console.log("
Word analogy test:");
console.log("king - man + woman ≈ queen");
console.log("Similarity to 'queen': " + similarity.toFixed(4));
💡 Word Embeddings Capture:
- • Semantic similarity ("king" close to "queen")
- • Analogical relationships (king - man + woman = queen)
- • Contextual meaning (same word, different contexts)
Sequence-to-Sequence Models
Seq2Seq Architecture for Translation:
Processes input
Creates context vector
Fixed-size representation
of input meaning
Generates output
word by word
Modern NLP: Attention Mechanism
// Simplified Attention Mechanism
class Attention {
constructor() {
this.weights = null;
}
// Calculate attention scores
calculateScores(query, keys) {
// Dot product attention
return keys.map(key => {
return query.reduce((sum, val, i) => sum + val * key[i], 0);
});
}
// Softmax to get attention weights
softmax(scores) {
const maxScore = Math.max(...scores);
const exps = scores.map(s => Math.exp(s - maxScore));
const sumExps = exps.reduce((a, b) => a + b, 0);
return exps.map(e => e / sumExps);
}
// Apply attention
forward(query, keys, values) {
// Calculate attention scores
const scores = this.calculateScores(query, keys);
// Convert to weights (probabilities)
this.weights = this.softmax(scores);
// Weighted sum of values
const output = Array(values[0].length).fill(0);
for (let i = 0; i < values.length; i++) {
for (let j = 0; j < values[i].length; j++) {
output[j] += this.weights[i] * values[i][j];
}
}
return output;
}
}
// Example: Translating "I love AI"
const attention = new Attention();
// Encoder outputs (simplified as random vectors)
const encoderOutputs = [
[0.5, 0.3, 0.2], // "I"
[0.8, 0.6, 0.4], // "love"
[0.3, 0.9, 0.7] // "AI"
];
// Decoder query (what word we're generating)
const decoderQuery = [0.7, 0.5, 0.3];
// Calculate attention
const contextVector = attention.forward(
decoderQuery,
encoderOutputs, // keys
encoderOutputs // values
);
console.log("Attention weights:", attention.weights.map(w => w.toFixed(3)));
console.log("Context vector:", contextVector.map(v => v.toFixed(3)));
console.log("
Interpretation: The model is paying most attention to:");
attention.weights.forEach((weight, i) => {
const words = ['I', 'love', 'AI'];
console.log(" " + words[i] + ": " + (weight * 100).toFixed(1) + "%");
});
💡 Key Takeaways
- ✓ NLP bridges human language and computers
- ✓ Text preprocessing is crucial (tokenization, stemming, stop words)
- ✓ Word embeddings represent words as dense vectors
- ✓ Sentiment analysis extracts opinions from text
- ✓ Attention mechanisms help models focus on relevant parts
- ✓ Modern NLP uses transformers (covered in next topic)