Fine-Tuning Transformer Models for Abstractive Text Summarization

This project explored the task of abstractive text summarization using the XSUM dataset, a challenging dataset containing 204k entries focused on generating single-sentence summaries. The objective was to evaluate and compare the performance of two pre-trained transformer-based models: T5-Small and BART-Large-XSum.

At first, I conducted Exploratory Data Analysis on the dataset to understand its structure, characteristics, and challenges. Insights included variable document lengths (mean: 2,202 characters) and more consistent summary lengths (mean: 125 characters).

The image on the left shows the distribution of document lengths while the image on right shows distribution of summary lengths.

I obtained a baseline using the T5-Small model trained on a subset of 10k documents and fine-tuned it further with 20k documents for 10 epochs. The baseline (T5-Small) achieved:

ROUGE-1: 0.2702 (10k subset) → 0.3200 (20k subset, 10 epochs)
BERTScore F1: 0.8795

To assess model performance and monitor overfitting, I tracked training and validation loss across epochs for the T5-Small model using datasets of varying sizes (10k and 20k examples). The graph below illustrates how the loss decreases consistently over epochs, indicating effective learning. Notably, the validation loss closely follows the training loss, suggesting good generalization without significant overfitting. This evaluation was instrumental in optimizing training parameters and understanding the model’s learning dynamics.

Comparison of Training and Validation Loss across different dataset sizes and epochs for the T5-Small model

Following the baseline experiments, we conducted final training on the full XSUM dataset using the BART-Large-XSum model, leveraging the computational power of dual NVIDIA A100 GPUs. This training was designed to fully utilize the model’s capacity to handle large-scale data, aiming for optimal abstractive summarization performance. The training achieved notable improvements in ROUGE scores compared to earlier experiments, with a ROUGE-L score of 0.3475 and ROUGE-1 score of 0.4281.

The figure below highlights the comparison of ROUGE scores across different models and training scenarios, showcasing the progression of performance improvements as the dataset size and model capacity increased.

Comparison of ROUGE Scores for Different Models and Training Configurations

In addition to quantitative evaluation, we performed qualitative analysis by comparing the generated summaries from both T5-Small and BART-Large-XSum models against reference summaries. This analysis revealed that while the BART-Large-XSum model produced more accurate and contextually aligned summaries, occasional inaccuracies and over-generalizations persisted in challenging cases.

The codes, notebooks and final project report containing more detailed explanation can be found in this github repository.