import pandas as pd
hours = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
scores = [45, 52, 58, 63, 70, 74, 80, 85, 88, 94]
df = pd.DataFrame({'hours_studied': hours, 'test_score': scores})
dfPredicting with scikit-learn
In this notebook, you’ll train your first machine learning model using scikit-learn — the most popular Python library for machine learning.
What You’ll Learn
- What scikit-learn is and what it’s used for
- The three-step sklearn workflow: prepare → fit → predict
- How to train a linear regression model on a simple dataset
- How to visualize your model’s predictions
Part 1: What is scikit-learn?
scikit-learn (imported as sklearn) is a Python library that gives you ready-to-use machine learning models.
It works alongside the tools you already know: Pandas organizes your data, Matplotlib visualizes it, and sklearn learns from it.
Every sklearn model follows the same three steps: 1. Prepare your data into features (X) and labels (y) 2. Fit — model.fit(X, y) — the model finds the pattern in your data 3. Predict — model.predict(...) — use the learned pattern on new data
Part 2: Our Dataset
We’ll start with a simple question: can a machine learn the relationship between hours spent studying and test scores?
Here’s data from a group of students:
Features and Labels
Before training, we split our data into two parts: - X — the feature (what we know): hours studied - y — the label (what we want to predict): test score
We can pull these directly from our DataFrame. Notice we use double brackets df[['hours_studied']] for X — that keeps it as a table (one column), which is what sklearn expects:
X = df[['hours_studied']] # double brackets → keeps it as a table
y = df['test_score'] # single brackets → a simple list of values
print("X — our feature (hours studied):")
print(X.head())
print("\ny — our label (test scores):")
print(y.values)Part 3: Training a Model
We’ll use Linear Regression — a model that finds the best straight line through your data points.
model.fit(X, y) is where the learning happens. The model figures out the slope and position of the line that best matches your data:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
print("Model trained!")
print(f" Slope: {model.coef_[0]:.2f} (score increases by this much per hour)")
print(f" Intercept: {model.intercept_:.2f} (predicted score at 0 hours)")Part 4: Making Predictions
Now that the model has learned the pattern, we can ask it to predict scores for hours it has never seen.
model.predict() takes a feature value and returns its best guess:
test_hours = pd.DataFrame({'hours_studied': [3, 6, 11]})
predictions = model.predict(test_hours)
for h, p in zip([3, 6, 11], predictions):
print(f" {h:2d} hours studied → predicted score: {p:.1f}")Part 5: Seeing the Model’s Line
The best way to understand a linear regression model is to see the line it drew.
The blue dots are the real data points. The red line is what the model learned:
import matplotlib.pyplot as plt
x_vals = list(range(0, 13))
x_line = pd.DataFrame({'hours_studied': x_vals})
y_line = model.predict(x_line)
plt.figure(figsize=(9, 5))
plt.scatter(df['hours_studied'], df['test_score'], color='steelblue', s=80, zorder=5, label='Actual scores')
plt.plot(x_vals, y_line, color='crimson', linewidth=2, label='Model prediction')
plt.title('Hours Studied vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()The model predicted a score above 100 for 11 hours. Does that seem realistic? What are the limits of using a straight line to model real-world data like this?
Predict Your Score
Use the slider to choose how many hours you studied, and see what the model predicts:
HOURS_STUDIED = 5 #@param {type:"slider", min:1, max:15, step:1}
prediction = model.predict(pd.DataFrame({'hours_studied': [HOURS_STUDIED]}))[0]
print(f"Hours studied: {HOURS_STUDIED}")
print(f"Predicted score: {prediction:.1f}")Check Your Understanding
{ “question_type”: “multiple_choice”, “question”: “In sklearn, what does model.fit(X, y) do?”, “options”: [ { “key”: “a”, “text”: “It creates a chart of the data” }, { “key”: “b”, “text”: “It trains the model by learning patterns from the data” }, { “key”: “c”, “text”: “It makes a prediction for new data” }, { “key”: “d”, “text”: “It loads data from a CSV file” } ], “answer”: “b”, “submitted_answer”: “” }
{ “question_type”: “multiple_choice”, “question”: “In machine learning, what do we call the value we are trying to predict?”, “options”: [ { “key”: “a”, “text”: “A feature” }, { “key”: “b”, “text”: “A model” }, { “key”: “c”, “text”: “A label” }, { “key”: “d”, “text”: “A slope” } ], “answer”: “c”, “submitted_answer”: “” }
{ “question_type”: “true_false”, “question”: “scikit-learn can only be used for linear regression.”, “answer”: “False”, “submitted_answer”: “” }