Breaking the Agglutinative Barrier: Custom NLP Pipelines for Turkish Summarization

How custom tokenization, Parquet data streams, and hardware-specific training optimizations unlocked Google Pegasus fine-tuning on consumer-grade GPUs.

In the landscape of Natural Language Processing (NLP), English-centric tokenizers (like WordPiece or Byte Pair Encoding) default to space-splitting and character frequency merges. While this works well for analytical languages, it struggles with agglutinative languages.

Agglutinative languages build words by attaching multiple suffixes to a root (e.g. kodlaştırdıklarımızdandır, meaning "from those whom we are making code"). Greedy space-based tokenizers fragment these words into meaningless characters or trigger vocabulary expansion.

During my AI engineering internship at Giresun University's IT Department, I built a high-performance Turkish text summarization system. By curating a 10,000-article custom dataset, configuring a SentencePiece tokenizer, and implementing Parquet pipelines, we achieved a 50% training speedup and a 5% accuracy increase on consumer-grade hardware.

Here is the engineering blueprint of the pipeline.

1. Custom Tokenization: BPE vs. Unigram

Standard tokenizers merge character sequences based on raw frequency. For morphologically rich languages, this destroys the semantic boundary between roots and suffixes.

To resolve this, we trained a custom tokenizer using Google’s SentencePiece library. SentencePiece treats input text as a raw character stream, eliminating the space-splitting constraint:

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input="turkish_corpus.txt",
    model_prefix="tr_tokenizer",
    vocab_size=32000,
    model_type='unigram', # Tested against 'bpe'
    character_coverage=0.9995,
    max_sentence_length=2048,
    normalization_rule_name='nmt_nfkc_cf'
)

The Mathematics of Unigram Tokenization

Unlike Byte Pair Encoding (BPE), which builds the vocabulary bottom-up by merging characters, the Unigram model works top-down by pruning. It starts with a massive vocabulary $\mathcal{V}$ containing all characters and millions of words/phrases from the corpus, and iteratively shrinks it.

The algorithm defines a probabilistic language model where the probability of a token sequence $x_1, x_2, \dots, x_n$ is:

$P(X) = \prod_{i=1}^n P(x_i)$

For a sentence $s$ , the set of all valid token segmentations is $S(s)$ . The marginal likelihood of the corpus $\mathcal{D}$ is:

$\mathcal{L} = \sum_{s \in \mathcal{D}} \log \left( \sum_{X \in S(s)} P(X) \right)$

During training, we run an Expectation-Maximization (EM) loop:

Expectation: Compute the most likely segmentations (using the Viterbi algorithm) for all sentences given the current token probabilities $P(x)$ .
Maximization: Re-estimate $P(x)$ based on the frequency of occurrences in the parsed corpus.
Pruning: For each token $x \in \mathcal{V}$ , calculate the loss in corpus likelihood $\mathcal{L} - \mathcal{L}_{\setminus x}$ if $x$ were removed. Prune the bottom 10–20% of tokens with the lowest loss.
Repeat until $|\mathcal{V}|$ reaches the target size (e.g., 32,000).

The Verdict: BPE vs. Unigram

Byte Pair Encoding (BPE): Showed faster initial convergence on short training runs (10-50 iterations). Because it keeps merged sub-words long, it is mathematically simpler to map on small datasets.
Unigram: Outperformed BPE on longer runs (100+ epochs). Unigram evaluates multiple tokenization splits probabilistically. This allows it to separate roots from their suffixes, helping the model learn the grammatical meanings of Turkish suffixes as training progresses.

2. Zero-Bottleneck Data Pipelines: CSV vs. Parquet

Text-based CSV files require CPU cycles for string deserialization during iteration, causing GPU starvation.

To keep the GPU operating at maximum capacity, we migrated our dataset to Apache Parquet. Parquet's columnar binary storage permits near-zero database I/O latency.

Furthermore, we loaded this data into Hugging Face's Rust-backed PegasusTokenizerFast to execute batch tokenization concurrently across CPU cores:

from transformers import PegasusTokenizerFast
from datasets import Dataset

# Load binary Parquet data
dataset = Dataset.from_parquet("articles.parquet")
tokenizer = PegasusTokenizerFast.from_pretrained("./tr_tokenizer")

def tokenize_fn(batch):
    return tokenizer(batch["article"], truncation=True, max_length=512)

# Tokenize in parallel using Rust backend
tokenized = dataset.map(tokenize_fn, batched=True, num_proc=4)

Columnar Performance Comparison

In a row-oriented format like CSV, if our model only needs the article and summary columns, the OS must still load the metadata columns (author, date, URL) into RAM, wasting I/O bandwidth.

Parquet organizes data by column blocks. The binary parser skips metadata columns entirely, resulting in sub-millisecond data load speeds.

3. GPU Memory Optimization: Fused Optimizers

Fine-tuning Google's Pegasus model on a single 16GB GPU (NVIDIA RTX A4000 or Apple Silicon M2 MPS) easily triggers Out-of-Memory (OOM) crashes. To scale training, we implemented three key optimizations:

Dynamic Batching: Used Hugging Face's DataCollatorForSeq2Seq to pad sequences to the maximum length within each specific batch rather than a fixed global limit, saving memory.
Mixed Precision (bf16): Converted training parameters to Brain Floating Point 16-bit to preserve dynamic range while cutting VRAM consumption in half.
Fused AdamW: Used optim="adamw_torch_fused" on CUDA, grouping model updates into a single GPU kernel call, reducing execution bottlenecks.

training_args = Seq2SeqTrainingArguments(
    output_dir="./pegasus_tr",
    per_device_train_batch_size=128,
    bf16=True,
    optim="adamw_torch_fused",
    learning_rate=5e-4,
    num_train_epochs=1000,
    save_strategy="best"
)

4. Inference Generation Tuning: Beam Search

At inference time, decoding token probabilities determines the quality of the generated summary. If we use greedy decoding (selecting the highest probability token at each step), the summary becomes repetitive and morphologically incorrect.

We implemented Beam Search with length penalties to balance accuracy and generation flow:

# Inference decoding configuration
summary_ids = model.generate(
    input_ids,
    num_beams=8,               # Keeps the top 8 sequence hypotheses active
    length_penalty=0.8,        # Penalizes excessively long sentences (alpha parameter)
    no_repeat_ngram_size=3,    # Prevents 3-word loops
    min_length=30,
    max_length=128,
    early_stopping=True
)

Beam Search maintains $B$ (beam size) parallel hypotheses at each generation step. The length penalty scales the sequence score by:

$s(Y) = \frac{\log P(Y)}{(5 + |Y|)^\alpha / (5 + 1)^\alpha}$

Setting $\alpha = 0.8$ encourages the model to generate concise, grammatically correct Turkish summaries.

Key Performance Metrics

By combining these optimizations, our pipeline achieved:

File Size Reduction: Parquet compressed the text dataset from 42MB to 11MB Snappy binaries.
Pipeline Acceleration: Cut overall model training time by 50%, keeping GPU utilization at 98%.
Model Quality: Reached a ROUGE-1 score of 22.2 on a compact, resource-optimized Pegasus architecture.