AI Ethics & Responsible AI
Navigate AI bias, fairness metrics, model explainability with SHAP and LIME, AI regulations like the EU AI Act, and safety guardrails
Why AI Ethics Matters Now
As AI systems make decisions about hiring, lending, healthcare, and criminal justice, the stakes of getting it wrong are enormous. Biased models perpetuate discrimination. Opaque models undermine trust. Irresponsible deployment causes real harm to real people.
Real-World AI Failures:
- Amazon scrapped an AI hiring tool that discriminated against women
- Healthcare algorithms gave less care to Black patients despite equal needs
- Facial recognition has 10-100x higher error rates for darker skin tones
- Predictive policing reinforces existing racial biases in arrest data
Types of Bias in AI
Data Bias
Training data reflects historical biases. If past hiring data favored men, the model will too. Underrepresented groups in training data get worse predictions.
Algorithmic Bias
Model design choices can amplify biases. Optimizing for overall accuracy may sacrifice minority group performance. Feature selection can introduce proxies for protected attributes.
Measurement Bias
What we measure as the "target" may be biased. Using arrests as a proxy for crime rate bakes in policing biases. Using grades as a proxy for ability reflects systemic inequities.
Deployment Bias
Models deployed in contexts they were not designed for. A model trained on one demographic applied universally. Automation bias: over-trusting model outputs.
Fairness Metrics
import numpy as np
from sklearn.metrics import confusion_matrix
def compute_fairness_metrics(y_true, y_pred, protected_attr):
"""
Compute key fairness metrics across demographic groups.
Common fairness definitions:
- Demographic Parity: P(Y_hat=1 | A=0) = P(Y_hat=1 | A=1)
- Equal Opportunity: P(Y_hat=1 | Y=1, A=0) = P(Y_hat=1 | Y=1, A=1)
- Equalized Odds: Equal TPR and FPR across groups
"""
results = {}
for group_val in np.unique(protected_attr):
mask = protected_attr == group_val
y_t = y_true[mask]
y_p = y_pred[mask]
tn, fp, fn, tp = confusion_matrix(y_t, y_p, labels=[0,1]).ravel()
results[f"group_{group_val}"] = {
"selection_rate": y_p.mean(), # demographic parity
"true_positive_rate": tp / (tp + fn) if (tp + fn) > 0 else 0, # equal opportunity
"false_positive_rate": fp / (fp + tn) if (fp + tn) > 0 else 0,
"accuracy": (tp + tn) / len(y_t),
"count": len(y_t)
}
# Calculate disparities
groups = list(results.values())
results["disparity"] = {
"selection_rate_ratio": groups[0]["selection_rate"] / max(groups[1]["selection_rate"], 1e-8),
"tpr_difference": abs(groups[0]["true_positive_rate"] - groups[1]["true_positive_rate"]),
"fpr_difference": abs(groups[0]["false_positive_rate"] - groups[1]["false_positive_rate"]),
}
return results
# Example: Loan approval model
np.random.seed(42)
n = 1000
y_true = np.random.randint(0, 2, n)
y_pred = np.random.randint(0, 2, n)
gender = np.random.choice([0, 1], n) # 0=female, 1=male
metrics = compute_fairness_metrics(y_true, y_pred, gender)
for group, vals in metrics.items():
print(f"\n{group}:")
for k, v in vals.items():
print(f" {k}: {v:.4f}" if isinstance(v, float) else f" {k}: {v}")
# The 4/5ths (80%) rule: selection rate of protected group
# should be at least 80% of the majority group's rate
Model Explainability: SHAP and LIME
import numpy as np
# SHAP: SHapley Additive exPlanations
# Based on game theory: each feature's contribution to the prediction
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Train a model
X, y = make_classification(n_samples=500, n_features=10,
n_informative=5, random_state=42)
feature_names = [f"feature_{i}" for i in range(10)]
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
# SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X[:100])
# Global feature importance
# shap.summary_plot(shap_values[1], X[:100], feature_names=feature_names)
# Explain a single prediction
idx = 0
print(f"Prediction for sample {idx}: {model.predict(X[idx:idx+1])[0]}")
print(f"Prediction probability: {model.predict_proba(X[idx:idx+1])[0]}")
print(f"\nTop feature contributions (SHAP values):")
sv = shap_values[1][idx] if isinstance(shap_values, list) else shap_values[idx]
for name, val in sorted(zip(feature_names, sv), key=lambda x: abs(x[1]), reverse=True)[:5]:
direction = "increases" if val > 0 else "decreases"
print(f" {name}: {val:+.4f} ({direction} prediction)")
# LIME: Local Interpretable Model-agnostic Explanations
# Approximates the model locally with a simple, interpretable model
import lime.lime_tabular
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
X, feature_names=feature_names, class_names=['No', 'Yes']
)
exp = lime_explainer.explain_instance(X[0], model.predict_proba)
print("\nLIME explanation:")
for feature, weight in exp.as_list()[:5]:
print(f" {feature}: {weight:+.4f}")
AI Regulations: The Landscape
EU AI Act (2024)
World's first comprehensive AI law. Risk-based framework: banned (social scoring), high-risk (hiring, credit), limited risk (chatbots), minimal risk (spam filters). Fines up to 7% of global revenue.
US Executive Order on AI (2023)
Requires safety testing for powerful AI models. Mandates watermarking of AI-generated content. Directs agencies to develop sector-specific guidance.
Industry Self-Regulation
Model cards (Hugging Face), responsible AI principles (Google, Microsoft, Anthropic), voluntary commitments on safety testing, red-teaming, and transparency.
Hallucination Mitigation and Safety Guardrails
# Strategies for reducing LLM hallucinations and unsafe outputs
# 1. Retrieval-Augmented Generation (RAG)
# Ground responses in factual documents
def rag_pipeline(query, knowledge_base):
"""
Instead of relying on model memory, retrieve relevant
documents and include them in the prompt.
"""
relevant_docs = knowledge_base.search(query, top_k=3)
context = "\n".join(doc.text for doc in relevant_docs)
prompt = f"""Answer based ONLY on the following context.
If the answer is not in the context, say "I don't know."
Context: {context}
Question: {query}
Answer:"""
return prompt
# 2. Output validation and guardrails
class SafetyGuardrails:
def __init__(self):
self.blocked_topics = ["harmful", "illegal", "dangerous"]
self.pii_patterns = [
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b\d{16}\b', # credit card
]
def check_input(self, text):
"""Screen inputs for safety."""
text_lower = text.lower()
for topic in self.blocked_topics:
if topic in text_lower:
return False, f"Blocked: contains '{topic}'"
return True, "OK"
def check_output(self, text):
"""Screen outputs for PII leakage and safety."""
import re
for pattern in self.pii_patterns:
if re.search(pattern, text):
return False, "Output contains potential PII"
return True, "OK"
def add_confidence(self, response, sources):
"""Add confidence indicators to responses."""
if not sources:
return response + "\n\n(Note: This response is not grounded in verified sources.)"
return response + f"\n\n(Sources: {', '.join(sources)})"
# 3. Constitutional AI approach
# Train the model to self-critique and revise its responses
# based on a set of principles (Anthropic's approach for Claude)
guardrails = SafetyGuardrails()
safe, msg = guardrails.check_input("How do I bake a cake?")
print(f"Input check: {msg}")
print("\nKey strategies: RAG for grounding, guardrails for safety,")
print("Constitutional AI for self-correction")
Key Takeaways
- AI bias comes from data, algorithms, measurement, and deployment -- address all four
- Use fairness metrics (demographic parity, equal opportunity) alongside accuracy
- SHAP and LIME provide post-hoc explanations for any model's predictions
- The EU AI Act is the first comprehensive AI regulation; expect more worldwide
- Combine RAG, guardrails, and Constitutional AI to mitigate hallucinations and unsafe outputs