El ecosistema de ciencia de datos de Python
Python se ha convertido en el lenguaje dominante para ciencia de datos, aprendizaje automatico y computacion cientifica. Su ecosistema incluye bibliotecas poderosas para manipulacion de datos (pandas), computacion numerica (NumPy), visualizacion (Matplotlib, Seaborn) y aprendizaje automatico (scikit-learn, PyTorch, TensorFlow). Esta leccion cubre los fundamentos.
Bibliotecas principales de ciencia de datos
- NumPy: Arrays numericos rapidos y operaciones matematicas
- pandas: DataFrames para manipulacion y analisis de datos
- Matplotlib: Graficos y visualizacion completa
- Seaborn: Visualizacion estadistica construida sobre Matplotlib
- scikit-learn: Algoritmos de aprendizaje automatico y utilidades
- Jupyter: Notebooks interactivos para exploracion y presentacion
Jupyter Notebooks
# Install Jupyter
# pip install jupyterlab
# Launch Jupyter Lab
# jupyter lab
# Jupyter notebooks (.ipynb) combine:
# - Code cells (Python, R, Julia, etc.)
# - Markdown cells (text, equations, images)
# - Output (tables, charts, images)
# Common Jupyter magic commands
# %timeit sum(range(1000)) # Time a single line
# %%timeit # Time an entire cell
# %matplotlib inline # Show plots inline
# %who # List variables
# !pip install package # Run shell commands
# %load_ext autoreload # Auto-reload modules
# %autoreload 2
Visualizacion de datos con Matplotlib
import matplotlib.pyplot as plt
import numpy as np
# Line plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y1, label="sin(x)", color="blue", linewidth=2)
ax.plot(x, y2, label="cos(x)", color="red", linestyle="--")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Trigonometric Functions")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("trig.png", dpi=150)
plt.show()
# Bar chart
categories = ["Python", "JavaScript", "Rust", "Go", "TypeScript"]
popularity = [85, 78, 42, 55, 72]
fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(categories, popularity, color=["#3776AB", "#F7DF1E", "#DEA584", "#00ADD8", "#3178C6"])
ax.set_ylabel("Popularity Score")
ax.set_title("Programming Language Popularity")
# Add value labels on bars
for bar, val in zip(bars, popularity):
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
str(val), ha="center", va="bottom", fontweight="bold")
plt.tight_layout()
plt.show()
# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title("Sine")
axes[0, 1].plot(x, np.cos(x))
axes[0, 1].set_title("Cosine")
axes[1, 0].plot(x, np.exp(-x/5))
axes[1, 0].set_title("Exponential Decay")
axes[1, 1].plot(x, np.log(x + 1))
axes[1, 1].set_title("Logarithm")
plt.tight_layout()
plt.show()
Seaborn para visualizacion estadistica
import seaborn as sns
import pandas as pd
import numpy as np
# Load a built-in dataset
tips = sns.load_dataset("tips")
# Distribution plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(data=tips, x="total_bill", kde=True, ax=axes[0])
axes[0].set_title("Distribution of Total Bill")
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[1])
axes[1].set_title("Total Bill by Day")
plt.tight_layout()
plt.show()
# Scatter plot with regression line
sns.lmplot(data=tips, x="total_bill", y="tip", hue="smoker",
height=6, aspect=1.2)
plt.title("Tip vs Total Bill")
plt.show()
# Heatmap for correlation matrix
numeric_tips = tips.select_dtypes(include=[np.number])
correlation = numeric_tips.corr()
sns.heatmap(correlation, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()
Aprendizaje automatico basico con scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Puntos clave
- Jupyter para exploracion: Usa notebooks para analisis interactivo de datos
- Matplotlib para control: Control total sobre cada aspecto de tus graficos
- Seaborn para estadisticas: Visualizaciones estadisticas hermosas con codigo minimo
- Pipeline de scikit-learn: Dividir, escalar, entrenar, evaluar es el flujo de trabajo estandar