Predicting with scikit-learn

In this notebook, you’ll train your first machine learning model using scikit-learn — the most popular Python library for machine learning.

Open In Jupyter K-12

What You’ll Learn

  • What scikit-learn is and what it’s used for
  • The three-step sklearn workflow: prepare → fit → predict
  • How to train a linear regression model on a simple dataset
  • How to visualize your model’s predictions

Part 1: What is scikit-learn?

scikit-learn (imported as sklearn) is a Python library that gives you ready-to-use machine learning models.

It works alongside the tools you already know: Pandas organizes your data, Matplotlib visualizes it, and sklearn learns from it.

Every sklearn model follows the same three steps: 1. Prepare your data into features (X) and labels (y) 2. Fitmodel.fit(X, y) — the model finds the pattern in your data 3. Predictmodel.predict(...) — use the learned pattern on new data

Part 2: Our Dataset

We’ll start with a simple question: can a machine learn the relationship between hours spent studying and test scores?

Here’s data from a group of students:

import pandas as pd

hours  = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
scores = [45, 52, 58, 63, 70, 74, 80, 85, 88, 94]

df = pd.DataFrame({'hours_studied': hours, 'test_score': scores})
df

Features and Labels

Before training, we split our data into two parts: - X — the feature (what we know): hours studied - y — the label (what we want to predict): test score

We can pull these directly from our DataFrame. Notice we use double brackets df[['hours_studied']] for X — that keeps it as a table (one column), which is what sklearn expects:

X = df[['hours_studied']]   # double brackets → keeps it as a table
y = df['test_score']         # single brackets → a simple list of values

print("X — our feature (hours studied):")
print(X.head())
print("\ny — our label (test scores):")
print(y.values)

Part 3: Training a Model

We’ll use Linear Regression — a model that finds the best straight line through your data points.

model.fit(X, y) is where the learning happens. The model figures out the slope and position of the line that best matches your data:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

print("Model trained!")
print(f"  Slope:     {model.coef_[0]:.2f}  (score increases by this much per hour)")
print(f"  Intercept: {model.intercept_:.2f}  (predicted score at 0 hours)")

Part 4: Making Predictions

Now that the model has learned the pattern, we can ask it to predict scores for hours it has never seen.

model.predict() takes a feature value and returns its best guess:

test_hours = pd.DataFrame({'hours_studied': [3, 6, 11]})
predictions = model.predict(test_hours)

for h, p in zip([3, 6, 11], predictions):
  print(f"  {h:2d} hours studied  →  predicted score: {p:.1f}")

Part 5: Seeing the Model’s Line

The best way to understand a linear regression model is to see the line it drew.

The blue dots are the real data points. The red line is what the model learned:

import matplotlib.pyplot as plt

x_vals = list(range(0, 13))
x_line = pd.DataFrame({'hours_studied': x_vals})
y_line = model.predict(x_line)

plt.figure(figsize=(9, 5))
plt.scatter(df['hours_studied'], df['test_score'], color='steelblue', s=80, zorder=5, label='Actual scores')
plt.plot(x_vals, y_line, color='crimson', linewidth=2, label='Model prediction')
plt.title('Hours Studied vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

The model predicted a score above 100 for 11 hours. Does that seem realistic? What are the limits of using a straight line to model real-world data like this?

Predict Your Score

Use the slider to choose how many hours you studied, and see what the model predicts:

HOURS_STUDIED = 5 #@param {type:"slider", min:1, max:15, step:1}

prediction = model.predict(pd.DataFrame({'hours_studied': [HOURS_STUDIED]}))[0]
print(f"Hours studied: {HOURS_STUDIED}")
print(f"Predicted score: {prediction:.1f}")

Check Your Understanding

{ “question_type”: “multiple_choice”, “question”: “In sklearn, what does model.fit(X, y) do?”, “options”: [ { “key”: “a”, “text”: “It creates a chart of the data” }, { “key”: “b”, “text”: “It trains the model by learning patterns from the data” }, { “key”: “c”, “text”: “It makes a prediction for new data” }, { “key”: “d”, “text”: “It loads data from a CSV file” } ], “answer”: “b”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “In machine learning, what do we call the value we are trying to predict?”, “options”: [ { “key”: “a”, “text”: “A feature” }, { “key”: “b”, “text”: “A model” }, { “key”: “c”, “text”: “A label” }, { “key”: “d”, “text”: “A slope” } ], “answer”: “c”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “scikit-learn can only be used for linear regression.”, “answer”: “False”, “submitted_answer”: “” }