The Python Data Science Ecosystem
Python has become the dominant language for data science, machine learning, and scientific computing. Its ecosystem includes powerful libraries for data manipulation (pandas), numerical computing (NumPy), visualization (Matplotlib, Seaborn), and machine learning (scikit-learn, PyTorch, TensorFlow). This lesson covers the foundations.
Core Data Science Libraries
- NumPy: Fast numerical arrays and mathematical operations
- pandas: DataFrames for data manipulation and analysis
- Matplotlib: Comprehensive plotting and visualization
- Seaborn: Statistical visualization built on Matplotlib
- scikit-learn: Machine learning algorithms and utilities
- Jupyter: Interactive notebooks for exploration and presentation
Jupyter Notebooks
# Install Jupyter
# pip install jupyterlab
# Launch Jupyter Lab
# jupyter lab
# Jupyter notebooks (.ipynb) combine:
# - Code cells (Python, R, Julia, etc.)
# - Markdown cells (text, equations, images)
# - Output (tables, charts, images)
# Common Jupyter magic commands
# %timeit sum(range(1000)) # Time a single line
# %%timeit # Time an entire cell
# %matplotlib inline # Show plots inline
# %who # List variables
# !pip install package # Run shell commands
# %load_ext autoreload # Auto-reload modules
# %autoreload 2
Data Visualization with Matplotlib
import matplotlib.pyplot as plt
import numpy as np
# Line plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y1, label="sin(x)", color="blue", linewidth=2)
ax.plot(x, y2, label="cos(x)", color="red", linestyle="--")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Trigonometric Functions")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("trig.png", dpi=150)
plt.show()
# Bar chart
categories = ["Python", "JavaScript", "Rust", "Go", "TypeScript"]
popularity = [85, 78, 42, 55, 72]
fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(categories, popularity, color=["#3776AB", "#F7DF1E", "#DEA584", "#00ADD8", "#3178C6"])
ax.set_ylabel("Popularity Score")
ax.set_title("Programming Language Popularity")
# Add value labels on bars
for bar, val in zip(bars, popularity):
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
str(val), ha="center", va="bottom", fontweight="bold")
plt.tight_layout()
plt.show()
# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title("Sine")
axes[0, 1].plot(x, np.cos(x))
axes[0, 1].set_title("Cosine")
axes[1, 0].plot(x, np.exp(-x/5))
axes[1, 0].set_title("Exponential Decay")
axes[1, 1].plot(x, np.log(x + 1))
axes[1, 1].set_title("Logarithm")
plt.tight_layout()
plt.show()
Seaborn for Statistical Visualization
import seaborn as sns
import pandas as pd
import numpy as np
# Load a built-in dataset
tips = sns.load_dataset("tips")
# Distribution plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(data=tips, x="total_bill", kde=True, ax=axes[0])
axes[0].set_title("Distribution of Total Bill")
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[1])
axes[1].set_title("Total Bill by Day")
plt.tight_layout()
plt.show()
# Scatter plot with regression line
sns.lmplot(data=tips, x="total_bill", y="tip", hue="smoker",
height=6, aspect=1.2)
plt.title("Tip vs Total Bill")
plt.show()
# Heatmap for correlation matrix
numeric_tips = tips.select_dtypes(include=[np.number])
correlation = numeric_tips.corr()
sns.heatmap(correlation, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()
Basic Machine Learning with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Key Takeaways
- Jupyter for exploration: Use notebooks for interactive data analysis
- Matplotlib for control: Full control over every aspect of your plots
- Seaborn for statistics: Beautiful statistical visualizations with minimal code
- scikit-learn pipeline: Split, scale, train, evaluate is the standard workflow