10.2 Pipeline构建 - AI应用开发

🎯 学习目标

理解Pipeline的作用与优势
掌握Pipeline和FeatureUnion的使用
学会使用ColumnTransformer处理异构数据
了解Pipeline与网格搜索的结合

为什么使用Pipeline

Pipeline将多个数据处理步骤和模型串联成一个整体，避免数据泄露，简化代码，便于交叉验证和超参数调优。它是构建可复现机器学习流程的关键工具。

🔄 Pipeline工作流程

📥

原始数据

🔄

预处理

🧠

模型训练

📤

预测输出

💻 基本Pipeline使用

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

# 创建Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),      # 步骤1: 标准化
    ('pca', PCA(n_components=10)),       # 步骤2: 降维
    ('classifier', RandomForestClassifier())  # 步骤3: 分类
])

# 训练
pipeline.fit(X_train, y_train)

# 预测
y_pred = pipeline.predict(X_test)

# 评估
score = pipeline.score(X_test, y_test)

# 访问Pipeline中的步骤
pipeline.named_steps['scaler']
pipeline['classifier']
      

📊 ColumnTransformer处理异构数据

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# 定义数值型和类别型特征
numeric_features = ['age', 'income']
categorical_features = ['gender', 'city']

# 创建预处理变换器
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

# 组合成完整Pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])
      

🔧 Pipeline与网格搜索

from sklearn.model_selection import GridSearchCV

# 定义参数网格（使用步骤名__参数名格式）
param_grid = {
    'pca__n_components': [5, 10, 20],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, None]
}

# 网格搜索
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f'最佳参数: {grid_search.best_params_}')
      

💡

避免数据泄露

使用Pipeline进行交叉验证时，每个fold的预处理只在训练集上进行，避免了测试集信息泄露到训练过程中。

图：Pipeline将多个步骤串联成完整工作流

📝 本节小结

✅

• Pipeline串联多个处理步骤，形成完整工作流
• 避免数据泄露，确保交叉验证正确性
• ColumnTransformer处理异构数据类型
• 与GridSearchCV结合实现全流程调优
• 参数访问格式：步骤名__参数名