Introducing Correlation

Does colder weather mean worse air?

In the previous notebook you found that Ulaanbaatar’s air quality follows a striking seasonal pattern: dangerous in winter, relatively clean in summer.

That raises an obvious question: is temperature the cause?

In this notebook you will: - Build a scatter plot to look for a relationship between temperature and pollution - Colour the data points by season to see whether the pattern holds year-round - Calculate a correlation coefficient to put a number on the relationship - Test whether other variables in the dataset also play a role

Step 1: Load the data

We start the same way as before — loading the dataset and importing our libraries.

import pandas as pd
import io
import httpx

url = "https://raw.githubusercontent.com/simonguest/codercub/main/labs/02/notebooks/ulaanbaatar_aqi_2019_2021.csv"

response = httpx.get(url)
df = pd.read_csv(io.StringIO(response.text))

print(f"Loaded {len(df)} rows and {len(df.columns)} columns")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

Step 2: Your first scatter plot

A scatter plot puts one variable on the x-axis and another on the y-axis, with one dot per row of data. It is the fastest way to see whether two variables move together. We will put temperature on the x-axis and PM2.5 on the y-axis.

Before you run the cell, make a prediction: what shape do you expect to see?

Before you run the scatter plot: what do you predict it will look like? Do you think the dots will slope upward, downward, or be flat? Will they be tightly grouped or spread out?

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(9, 6))

ax.scatter(
  df['temp_c'],
  df['pm25'],
  alpha=0.3,       # transparency so overlapping points are visible
  s=12,            # dot size
  color='steelblue'
)

ax.set_xlabel('Temperature (°C)', fontsize=12)
ax.set_ylabel('PM2.5 (µg/m³)', fontsize=12)
ax.set_title('Temperature vs. PM2.5 — Ulaanbaatar 2019–2021', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

Look at the overall shape of the cloud of points.

Does it slope upward, downward, or is it flat?
Are the points tightly clustered, or spread out loosely?
Are there any days that don’t fit the pattern?

Describe what you see in the scatter plot. Does it slope up or down? Are the points tightly clustered or spread out? Were there any days that didn’t fit the pattern?

Step 3: Colour the points by season

The scatter plot shows a clear slope, but it mixes together data from all four seasons. Let’s colour each point by season so we can see whether the pattern is consistent throughout the year, or whether one season is driving the whole relationship.

season_colors = {
  'Winter': '#4477AA',
  'Spring': '#66BB6A',
  'Summer': '#FFA726',
  'Autumn': '#AB47BC'
}

fig, ax = plt.subplots(figsize=(9, 6))

for season, color in season_colors.items():
  subset = df[df['season'] == season]
  ax.scatter(
    subset['temp_c'],
    subset['pm25'],
    alpha=0.4,
    s=12,
    color=color,
    label=season
  )

ax.set_xlabel('Temperature (°C)', fontsize=12)
ax.set_ylabel('PM2.5 (µg/m³)', fontsize=12)
ax.set_title('Temperature vs. PM2.5 by Season', fontsize=13, fontweight='bold')
ax.legend(title='Season', fontsize=10)

plt.tight_layout()
plt.show()

The four seasons should now appear as distinct clusters.

Where do the Winter points sit on the chart?
Where do the Summer points sit?
Do Spring and Autumn overlap, or are they clearly separated?

Notice that the downward slope is not caused by one unusual season pulling the average down. Each season sits in its own region of the chart, and the overall slope emerges from the progression across all four.

Where do the Winter points appear on the chart compared to the Summer points? Do Spring and Autumn overlap, or are they clearly separated? What does the position of each season’s cluster tell you?

Step 4: Putting a number on the relationship

The scatter plot shows that colder temperatures tend to go with higher pollution. But how strong is that relationship exactly?

A correlation coefficient measures the strength and direction of a relationship between two variables. It always falls between -1 and +1:

-1 means a perfect inverse relationship: as one variable goes up, the other always goes down
0 means no relationship at all
+1 means a perfect direct relationship: both variables always move together

In pandas, you can calculate it with a single line.

correlation = df['pm25'].corr(df['temp_c'])
print(f"Correlation between temperature and PM2.5: {correlation:.2f}")

A value close to -0.85 is a strong negative correlation. In practical terms it means:

The relationship is reliable — cold days are almost always more polluted
But it is not perfect — temperature is not the only factor at play
Something else must explain the remaining variation

That “something else” is worth investigating. The dataset contains other variables that might also play a role: wind_speed_ms and pm10 for example.

Step 5: Going Deeper

You’ve confirmed a strong overall correlation between temperature and PM2.5. Let’s run three targeted investigations to test the limits of that relationship.

Investigation 1 — Does wind speed matter?

The dataset also contains wind_speed_ms. You might expect that on windier days, pollution would be dispersed and PM2.5 would drop.

Before you run: predict whether wind speed will show a stronger or weaker correlation than temperature.

Write your prediction: will wind speed show a stronger, weaker, or similar correlation with PM2.5 compared to temperature? Why?

import matplotlib.pyplot as plt 

fig, ax = plt.subplots(figsize=(9, 6))

ax.scatter(df['wind_speed_ms'], df['pm25'], alpha=0.3, s=12, color='steelblue')
ax.set_xlabel('Wind Speed (m/s)', fontsize=12)
ax.set_ylabel('PM2.5 (µg/m³)', fontsize=12)
ax.set_title('Wind Speed vs. PM2.5 — Ulaanbaatar 2019–2021', fontsize=13, fontweight='bold')

corr = df['wind_speed_ms'].corr(df['pm25'])
ax.text(0.03, 0.93, f'Correlation: {corr:.2f}', transform=ax.transAxes,
    fontsize=11, bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

The correlation is close to 0.00 — wind speed barely predicts PM2.5 at all.

This makes sense when you think about the source: coal burning for heating is a fixed daily need. Whether it is windy or calm outside, families still burn the same amount of coal to stay warm. The pollution is produced regardless, and wind does not reduce it enough to show up clearly in the data.

Was the wind speed result what you predicted? What does a near-zero correlation tell you about wind as a factor in Ulaanbaatar’s pollution?

Investigation 2 — Explore other variable pairs

Use the dropdowns to try different combinations of variables. A good starting point: how closely do PM2.5 and PM10 track each other?

X_COLUMN = "pm10"  #@param ["temp_c", "wind_speed_ms", "pm10"]
Y_COLUMN = "pm25"  #@param ["pm25", "pm10", "temp_c", "wind_speed_ms"]

fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(df[X_COLUMN], df[Y_COLUMN], alpha=0.3, s=12, color='steelblue')
ax.set_xlabel(X_COLUMN, fontsize=12)
ax.set_ylabel(Y_COLUMN, fontsize=12)
ax.set_title(f'{X_COLUMN} vs. {Y_COLUMN}', fontsize=13, fontweight='bold')

corr = df[X_COLUMN].corr(df[Y_COLUMN])
ax.text(0.03, 0.93, f'Correlation: {corr:.2f}', transform=ax.transAxes,
    fontsize=11, bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

Which variable pair had the strongest correlation? Which had the weakest? Why do you think PM2.5 and PM10 are so closely linked?

Investigation 3 — Does the temperature correlation hold within a single season?

The overall correlation between temperature and PM2.5 is −0.85. But that is calculated across all four seasons at once — winter days are cold and polluted, summer days are warm and clean.

What happens if you only look at Winter days? Use the dropdown to filter to one season and watch what happens to the correlation.

SEASON = "All"  #@param ["All", "Winter", "Spring", "Summer", "Autumn"]

df_filtered = df if SEASON == "All" else df[df['season'] == SEASON]

fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(df_filtered['temp_c'], df_filtered['pm25'],
       alpha=0.3, s=12, color='steelblue')
ax.set_xlabel('Temperature (°C)', fontsize=12)
ax.set_ylabel('PM2.5 (µg/m³)', fontsize=12)
ax.set_title(f'Temperature vs. PM2.5 — {SEASON}', fontsize=13, fontweight='bold')

corr = df_filtered['temp_c'].corr(df_filtered['pm25'])
ax.text(0.03, 0.93, f'Correlation: {corr:.2f}  (n={len(df_filtered)})',
    transform=ax.transAxes, fontsize=11,
    bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

What happened to the correlation when you switched to Winter only? Were you surprised by the result?

Within Winter, the correlation drops to nearly 0.00 — temperature stops being a useful predictor of PM2.5 on any given day.

This reveals something important: temperature is not causing the pollution directly. It is a marker for winter — and winter is when families burn coal for heating. Whether it is −10°C or −25°C on a particular day, the coal is burning either way. The real mechanism is the heating behaviour that winter brings, not the temperature itself.

This is a classic example of the difference between correlation and causation: two variables can move together strongly overall, while one is not actually causing the other.

Step 6: What have you found?

You now have a specific, evidence-backed answer to today’s question.

Take a moment to write your conclusion:

What is the relationship between temperature and PM2.5 in Ulaanbaatar?
How confident are you in this conclusion, based on the correlation value?
What other variable in the dataset had the strongest relationship with PM2.5?
Is temperature the only cause of high winter pollution, or just a reliable indicator of it? What is the actual mechanism? (Hint: think about how people heat their homes in -30°C winters.)

Write your answers in the cell below — double-click to edit it.

Write your conclusions here.

Where this is heading

You have now done something that data scientists do every day: used a scatter plot and a correlation coefficient to identify which variable best predicts an outcome.

The next step — which you will explore in tomorrow’s session — is to let a machine learn that relationship automatically from the data, and then use it to make predictions on days it has never seen before.

The question will change from “does temperature predict pollution?” to “given today’s temperature, what PM2.5 level would a model predict for tomorrow?” That is the core idea behind linear regression, and it is also the core idea behind what the micro:bit will do with your movement data later this week.

Check Your Understanding

{ “question_type”: “multiple_choice”, “question”: “What does a correlation coefficient of -0.85 tell you?”, “options”: [ { “key”: “a”, “text”: “No relationship between the two variables” }, { “key”: “b”, “text”: “A weak positive relationship” }, { “key”: “c”, “text”: “A strong negative relationship” }, { “key”: “d”, “text”: “A perfect negative relationship” } ], “answer”: “c”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “If two variables are strongly correlated, it proves that one causes the other.”, “answer”: “False”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “In a scatter plot, what does each dot represent?”, “options”: [ { “key”: “a”, “text”: “A monthly average” }, { “key”: “b”, “text”: “A single day’s reading” }, { “key”: “c”, “text”: “A full year of data” }, { “key”: “d”, “text”: “A column in the dataset” } ], “answer”: “b”, “submitted_answer”: “” }