MarkTechPost@AI 09月25日
结合机器学习与Gemini,构建端到端数据科学工作流
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程展示了如何结合传统机器学习与Gemini的强大功能,构建一个先进的端到端数据科学工作流。我们首先准备和建模糖尿病数据集,然后深入进行评估、特征重要性分析和部分依赖性分析。在此过程中,Gemini作为AI数据科学家,帮助解释结果、回答探索性问题并指出潜在风险。通过这种方式,我们在构建预测模型的同时,增强了通过自然语言交互获得的洞察力和决策能力。

🔬 **数据准备与模型构建**:教程详细介绍了如何加载糖尿病数据集,进行特征预处理(包括标准化和分位数变换),并构建一个包含HistGradientBoostingRegressor的健壮机器学习管道。通过训练集和测试集的划分,以及K折交叉验证,对模型性能进行了初步估计和验证,确保模型能够有效泛化。

📊 **模型评估与特征洞察**:通过计算训练集、测试集以及交叉验证的RMSE、MAE和R²等关键指标,对模型进行了全面评估。残差图的绘制有助于检查预测误差的分布。此外,利用置换重要性分析(Permutation Importance)识别出对模型预测贡献最大的前10个特征,并以图表形式直观展示。

💡 **部分依赖性分析与AI辅助**:针对最重要的三个特征,计算并可视化了部分依赖性图(PDP),展示了这些特征值变化对模型预测目标的影响。最后,利用Gemini AI生成了执行摘要、风险评估、下一步实验建议以及快速特征工程想法,极大地提升了数据科学工作的效率和深度。

In this tutorial, we walk through an advanced end-to-end data science workflow where we combine traditional machine learning with the power of Gemini. We begin by preparing and modeling the diabetes dataset, then we dive into evaluation, feature importance, and partial dependence. Along the way, we bring in Gemini as our AI data scientist to explain results, answer exploratory questions, and highlight risks. By doing this, we build a predictive model while also enhancing our insights and decision-making through natural language interaction. Check out the FULL CODES here.

!pip -qU google-generativeai scikit-learn matplotlib pandas numpyfrom getpass import getpassimport os, json, numpy as np, pandas as pd, matplotlib.pyplot as pltif not os.environ.get("GOOGLE_API_KEY"):   os.environ["GOOGLE_API_KEY"] = getpass(" Enter your Gemini API key (hidden): ")import google.generativeai as genaigenai.configure(api_key=os.environ["GOOGLE_API_KEY"])LLM = genai.GenerativeModel("gemini-1.5-flash")def ask_llm(prompt, sys=None):   p = prompt if sys is None else f"System:\n{sys}\n\nUser:\n{prompt}"   r = LLM.generate_content(p)   return (getattr(r, "text", "") or "").strip()from sklearn.datasets import load_diabetesraw = load_diabetes(as_frame=True)df  = raw.frame.rename(columns={"target":"disease_progression"})print("Shape:", df.shape); display(df.head())from sklearn.model_selection import train_test_split, KFold, cross_val_scorefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScaler, QuantileTransformerfrom sklearn.ensemble import HistGradientBoostingRegressorfrom sklearn.pipeline import PipelineX = df.drop(columns=["disease_progression"]); y = df["disease_progression"]num_cols = X.columns.tolist()pre = ColumnTransformer(   [("scale", StandardScaler(), num_cols),    ("rank",  QuantileTransformer(n_quantiles=min(200, len(X)), output_distribution="normal"), num_cols)],   remainder="drop", verbose_feature_names_out=False)model = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07,                                     l2_regularization=0.0, max_iter=500,                                     early_stopping=True, validation_fraction=0.15)pipe  = Pipeline([("prep", pre), ("hgbt", model)])Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.20, random_state=42)cv = KFold(n_splits=5, shuffle=True, random_state=42)cv_mse = -cross_val_score(pipe, Xtr, ytr, scoring="neg_mean_squared_error", cv=cv).mean()cv_rmse = float(cv_mse ** 0.5)pipe.fit(Xtr, ytr)

We load the diabetes dataset, preprocess the features, and build a robust pipeline using scaling, quantile transformation, and gradient boosting. We split the data, perform cross-validation to estimate RMSE, and then fit the final model to see how well it generalizes. Check out the FULL CODES here.

pred_tr = pipe.predict(Xtr); pred_te = pipe.predict(Xte)rmse_tr = mean_squared_error(ytr, pred_tr) ** 0.5rmse_te = mean_squared_error(yte, pred_te) ** 0.5mae_te  = mean_absolute_error(yte, pred_te)r2_te   = r2_score(yte, pred_te)print(f"CV RMSE={cv_rmse:.2f} | Train RMSE={rmse_tr:.2f} | Test RMSE={rmse_te:.2f} | Test MAE={mae_te:.2f} | R²={r2_te:.3f}")plt.figure(figsize=(5,4))plt.scatter(pred_te, yte - pred_te, s=12)plt.axhline(0, lw=1); plt.xlabel("Predicted"); plt.ylabel("Residual"); plt.title("Residuals (Test)")plt.show()from sklearn.inspection import permutation_importanceimp = permutation_importance(pipe, Xte, yte, scoring="neg_mean_squared_error", n_repeats=10, random_state=0)imp_df = pd.DataFrame({"feature": X.columns, "importance": imp.importances_mean}).sort_values("importance", ascending=False)display(imp_df.head(10))plt.figure(figsize=(6,4))top10 = imp_df.head(10).iloc[::-1]plt.barh(top10["feature"], top10["importance"])plt.title("Permutation Importance (Top 10)"); plt.xlabel("Δ(MSE)"); plt.tight_layout(); plt.show()

We evaluate our model by computing train, test, and cross-validation metrics, and visualize residuals to check prediction errors. We then calculate permutation importance to identify which features drive the model most, and display the top contributors using a clear bar plot. Check out the FULL CODES here.

def compute_pdp(pipe, Xref: pd.DataFrame, feat: str, grid=40):   xs = np.linspace(np.percentile(Xref[feat], 5), np.percentile(Xref[feat], 95), grid)   Xtmp = Xref.copy()   ys = []   for v in xs:       Xtmp[feat] = v       ys.append(pipe.predict(Xtmp).mean())   return xs, np.array(ys)top_feats = imp_df["feature"].head(3).tolist()plt.figure(figsize=(6,4))for f in top_feats:   xs, ys = compute_pdp(pipe, Xte.copy(), f, grid=40)   plt.plot(xs, ys, label=f)plt.legend(); plt.xlabel("Feature value"); plt.ylabel("Predicted target"); plt.title("Manual PDP (Top 3)")plt.tight_layout(); plt.show()report_obj = {   "dataset": {"rows": int(df.shape[0]), "cols": int(df.shape[1]-1), "target": "disease_progression"},   "metrics": {"cv_rmse": float(cv_rmse), "train_rmse": float(rmse_tr),               "test_rmse": float(rmse_te), "test_mae": float(mae_te), "r2": float(r2_te)},   "top_importances": imp_df.head(10).to_dict(orient="records")}print(json.dumps(report_obj, indent=2))sys_msg = ("You are a senior data scientist. Return: (1) ≤120-word executive summary, "          "(2) key risks/assumptions bullets, (3) 5 prioritized next experiments w/ rationale, "          "(4) quick-win feature engineering ideas as Python pseudocode.")summary = ask_llm(f"Dataset + metrics + importances:\n{json.dumps(report_obj)}", sys=sys_msg)print("\n Gemini Executive Brief\n" + "-"*80 + f"\n{summary}\n")

We compute the manual partial dependence for the top three features and visualize how changing each one affects the predictions. We then assemble a compact JSON report of dataset statistics, metrics, and importances, and ask Gemini to generate an executive brief that includes risks, next experiments, and quick-win feature engineering ideas. Check out the FULL CODES here.

SAFE_GLOBALS = {"pd": pd, "np": np}def run_generated_pandas(code: str, df_local: pd.DataFrame):   banned = ["__", "import", "open(", "exec(", "eval(", "os.", "sys.", "pd.read", "to_csv", "to_pickle", "to_sql"]   if any(b in code for b in banned): raise ValueError("Unsafe code rejected.")   loc = {"df": df_local.copy()}   exec(code, SAFE_GLOBALS, loc)   return {k:v for k,v in loc.items() if k not in ("df",)}def eda_qa(question: str):   prompt = f"""You are a Python+Pandas analyst. DataFrame `df` columns:{list(df.columns)}. Write a SHORT pandas snippet (no comments/prints) that computes the answer to:"{question}". Use only pd/np/df; assign the final result to a variable named `answer`."""   code = ask_llm(prompt, sys="Return only code. No prose.")   try:       out = run_generated_pandas(code, df)       return code, out.get("answer", None)   except Exception as e:       return code, f"[Execution error: {e}]"questions = [   "What is the Pearson correlation between BMI and disease_progression?",   "Show mean target by tertiles of BMI (low/med/high).",   "Which single feature correlates most with the target (absolute value)?"]for q in questions:   code, ans = eda_qa(q)   print("\nQ:", q, "\nCode:\n", code, "\nAnswer:\n", ans)

We build a safe sandbox to execute pandas code that Gemini generates for exploratory data analysis. We then ask natural language questions about correlations and feature relationships, let Gemini write the pandas snippets, and automatically run them to get direct answers from the dataset. Check out the FULL CODES here.

crossitique = ask_llm(   f"""Metrics: {report_obj['metrics']}Top importances: {report_obj['top_importances']}Identify risks around leakage, overfitting, calibration, OOD robustness, and fairness (even proxy-only).Propose quick checks (concise Python sketches).""")print("\n Gemini Risk & Robustness Review\n" + "-"*80 + f"\n{critique}\n")def what_if(pipe, Xref: pd.DataFrame, feat: str, delta: float = 0.05):   x0 = Xref.median(numeric_only=True).to_dict()   x1, x2 = x0.copy(), x0.copy()   if feat not in x1: return np.nan   x2[feat] = x1[feat] + delta   X1 = pd.DataFrame([x1], columns=X.columns)   X2 = pd.DataFrame([x2], columns=X.columns)   return float(pipe.predict(X2)[0] - pipe.predict(X1)[0])for f in top_feats:   print(f"Estimated Δtarget if {f} increases by +0.05 ≈ {what_if(pipe, Xte, f, 0.05):.2f}")print("\n Done: Train → Explain → Query with Gemini → Review risks → What-if analysis. "     "Swap the dataset or tweak model params to extend this notebook.")

We ask Gemini to review our model for risks like leakage, overfitting, and fairness, and get quick Python checks as suggestions. We then run simple “what-if” analyses to see how small changes in top features affect predictions, helping us interpret the model’s behavior more clearly.

In conclusion, we see how seamlessly we can blend machine learning pipelines with Gemini’s reasoning to make data science more interactive and insightful. We train, evaluate, and interpret a model, then ask Gemini to summarize findings, suggest improvements, and critique risks. Through this journey, we establish a workflow that enables us to achieve both predictive performance and interpretability, while also benefiting from having an AI collaborator in our data analysis process.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance? appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据科学 机器学习 Gemini AI 端到端工作流 模型评估 特征重要性 部分依赖性 AI辅助 Data Science Machine Learning Gemini AI End-to-End Workflow Model Evaluation Feature Importance Partial Dependence AI Assistance
相关文章