In this notebook, we use a small transformer (Helsinki-NLP/opus-mt-fr-en) to translate from French to English.
Load model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
/Users/simon/Dev/CS-394/.venv/lib/python3.13/site-packages/transformers/models/marian/tokenization_marian.py:175: UserWarning: Recommended: pip install sacremoses.
warnings.warn("Recommended: pip install sacremoses.")
Tokenize
french_text = "Bonjour, comment allez-vous?"
input_ids = tokenizer.encode(french_text, return_tensors="pt")
print(input_ids[0])
print("Tokens:", tokenizer.convert_ids_to_tokens(input_ids[0]))
tensor([8703, 2, 1027, 5682, 21, 682, 54, 0])
Tokens: ['▁Bonjour', ',', '▁comment', '▁allez', '-', 'vous', '?', '</s>']
# @title Demonstrate contextual vectors using the encoder
# French: "Bonjour , comment allez - vous ?"
# ↓ ↓ ↓ ↓ ↓ ↓ ↓
# Encoder: [v1] [v2] [v3] [v4] [v5][v6][v7] ← 7 vectors, each 512-dim
# └─────────────────────────────────┘
encoder = model.get_encoder()
encoder_output = encoder(input_ids)
print("Encoder output shape:", encoder_output.last_hidden_state.shape)
print("Encoder output:", encoder_output)
Encoder output shape: torch.Size([1, 8, 512])
Encoder output: BaseModelOutput(last_hidden_state=tensor([[[-0.3943, 0.4660, 0.0190, ..., -0.5069, 0.2120, -0.3190],
[ 0.0957, 0.0780, 0.1918, ..., -0.0854, 0.2138, 0.1528],
[-0.6160, 0.0295, 0.1918, ..., -0.3886, 0.0770, 0.2311],
...,
[-0.1839, -0.3798, 0.1832, ..., -0.0041, -0.3633, -0.5455],
[ 0.0153, 0.0264, 0.1122, ..., 0.1966, -0.3027, -0.3659],
[-0.0484, 0.0147, 0.0078, ..., -0.1359, -0.0295, -0.0799]]],
grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)
Decode back to tokens to complete the translation
english_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Translation:", english_text)
Translation: Hello, how are you?