Breaking the Agglutinative Barrier: Custom NLP Pipelines for Turkish Summarization
How custom tokenization, Parquet data streams, and training optimizations improve Turkish summarization pipelines.
How custom tokenization, Parquet data streams, and hardware-specific training optimizations unlocked Google Pegasus fine-tuning on consumer-grade GPUs.
In the landscape of Natural Language Processing (NLP), English-centric tokenizers (like WordPiece or Byte Pair Encoding) default to space-splitting and character frequency merges. While this works well for analytical languages, it struggles with agglutinative languages.
Agglutinative languages build words by attaching multiple suffixes to a root (e.g. kodlaştırdıklarımızdandır, meaning "from those whom we are making code"). Greedy space-based tokenizers fragment these words into meaningless characters or trigger vocabulary expansion.
During my AI engineering internship at Giresun University's IT Department, I built a high-performance Turkish text summarization system. By curating a 10,000-article custom dataset, configuring a SentencePiece tokenizer, and implementing Parquet pipelines, we achieved a 50% training speedup and a 5% accuracy increase on consumer-grade hardware.
Here is the engineering blueprint of the pipeline.
1. Custom Tokenization: BPE vs. Unigram
Standard tokenizers merge character sequences based on raw frequency. For morphologically rich languages, this destroys the semantic boundary between roots and suffixes.
To resolve this, we trained a custom tokenizer using Google’s SentencePiece library. SentencePiece treats input text as a raw character stream, eliminating the space-splitting constraint:
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input="turkish_corpus.txt",
model_prefix="tr_tokenizer",
vocab_size=32000,
model_type='unigram', # Tested against 'bpe'
character_coverage=0.9995,
max_sentence_length=2048,
normalization_rule_name='nmt_nfkc_cf'
)
The Mathematics of Unigram Tokenization
Unlike Byte Pair Encoding (BPE), which builds the vocabulary bottom-up by merging characters, the Unigram model works top-down by pruning. It starts with a massive vocabulary containing all characters and millions of words/phrases from the corpus, and iteratively shrinks it.
The algorithm defines a probabilistic language model where the probability of a token sequence is:
For a sentence , the set of all valid token segmentations is . The marginal likelihood of the corpus is:
During training, we run an Expectation-Maximization (EM) loop:
- Expectation: Compute the most likely segmentations (using the Viterbi algorithm) for all sentences given the current token probabilities .
- Maximization: Re-estimate based on the frequency of occurrences in the parsed corpus.
- Pruning: For each token , calculate the loss in corpus likelihood if were removed. Prune the bottom 10–20% of tokens with the lowest loss.
- Repeat until reaches the target size (e.g., 32,000).
The Verdict: BPE vs. Unigram
- Byte Pair Encoding (BPE): Showed faster initial convergence on short training runs (10-50 iterations). Because it keeps merged sub-words long, it is mathematically simpler to map on small datasets.
- Unigram: Outperformed BPE on longer runs (100+ epochs). Unigram evaluates multiple tokenization splits probabilistically. This allows it to separate roots from their suffixes, helping the model learn the grammatical meanings of Turkish suffixes as training progresses.
2. Zero-Bottleneck Data Pipelines: CSV vs. Parquet
Text-based CSV files require CPU cycles for string deserialization during iteration, causing GPU starvation.
To keep the GPU operating at maximum capacity, we migrated our dataset to Apache Parquet. Parquet's columnar binary storage permits near-zero database I/O latency.
Furthermore, we loaded this data into Hugging Face's Rust-backed PegasusTokenizerFast to execute batch tokenization concurrently across CPU cores:
from transformers import PegasusTokenizerFast
from datasets import Dataset
# Load binary Parquet data
dataset = Dataset.from_parquet("articles.parquet")
tokenizer = PegasusTokenizerFast.from_pretrained("./tr_tokenizer")
def tokenize_fn(batch):
return tokenizer(batch["article"], truncation=True, max_length=512)
# Tokenize in parallel using Rust backend
tokenized = dataset.map(tokenize_fn, batched=True, num_proc=4)
Columnar Performance Comparison
In a row-oriented format like CSV, if our model only needs the article and summary columns, the OS must still load the metadata columns (author, date, URL) into RAM, wasting I/O bandwidth.
Parquet organizes data by column blocks. The binary parser skips metadata columns entirely, resulting in sub-millisecond data load speeds.
3. GPU Memory Optimization: Fused Optimizers
Fine-tuning Google's Pegasus model on a single 16GB GPU (NVIDIA RTX A4000 or Apple Silicon M2 MPS) easily triggers Out-of-Memory (OOM) crashes. To scale training, we implemented three key optimizations:
- Dynamic Batching: Used Hugging Face's
DataCollatorForSeq2Seqto pad sequences to the maximum length within each specific batch rather than a fixed global limit, saving memory. - Mixed Precision (
bf16): Converted training parameters to Brain Floating Point 16-bit to preserve dynamic range while cutting VRAM consumption in half. - Fused AdamW: Used
optim="adamw_torch_fused"on CUDA, grouping model updates into a single GPU kernel call, reducing execution bottlenecks.
training_args = Seq2SeqTrainingArguments(
output_dir="./pegasus_tr",
per_device_train_batch_size=128,
bf16=True,
optim="adamw_torch_fused",
learning_rate=5e-4,
num_train_epochs=1000,
save_strategy="best"
)
4. Inference Generation Tuning: Beam Search
At inference time, decoding token probabilities determines the quality of the generated summary. If we use greedy decoding (selecting the highest probability token at each step), the summary becomes repetitive and morphologically incorrect.
We implemented Beam Search with length penalties to balance accuracy and generation flow:
# Inference decoding configuration
summary_ids = model.generate(
input_ids,
num_beams=8, # Keeps the top 8 sequence hypotheses active
length_penalty=0.8, # Penalizes excessively long sentences (alpha parameter)
no_repeat_ngram_size=3, # Prevents 3-word loops
min_length=30,
max_length=128,
early_stopping=True
)
Beam Search maintains (beam size) parallel hypotheses at each generation step. The length penalty scales the sequence score by:
Setting encourages the model to generate concise, grammatically correct Turkish summaries.
Key Performance Metrics
By combining these optimizations, our pipeline achieved:
- File Size Reduction: Parquet compressed the text dataset from 42MB to 11MB Snappy binaries.
- Pipeline Acceleration: Cut overall model training time by 50%, keeping GPU utilization at 98%.
- Model Quality: Reached a ROUGE-1 score of 22.2 on a compact, resource-optimized Pegasus architecture.