Ch 5: Sequence Labeling — Natural Language Processing

Ch 5 — Sequence Labeling

POS tagging, named entity recognition, and assigning labels to every token

Index

High Level

text_fields

Tokens

arrow_forward

scatter_plot

Embed

arrow_forward

neurology

Encode

arrow_forward

link

CRF

arrow_forward

sell

Tags

arrow_forward

analytics

Evaluate

Click play or press Space to begin...

Step- / 8

data_array

What Is Sequence Labeling?

One label per token — a fundamentally different task from classification

The Task

While text classification assigns one label per document, sequence labeling assigns one label per token. Given a sentence, predict a tag for every word. The three core sequence labeling tasks are: Part-of-Speech (POS) tagging — is each word a noun, verb, adjective, etc.? Named Entity Recognition (NER) — which words are person names, organizations, locations, dates? Chunking — which words form noun phrases, verb phrases, etc.? Sequence labeling is the backbone of information extraction. Every time a search engine highlights a person's name, a chatbot extracts a date from your message, or a medical system identifies drug names in clinical notes, sequence labeling is at work. The key challenge: labels depend on neighboring labels. In "New York City", "York" is part of a location only because "New" precedes it.

Sequence Labeling Tasks

POS Tagging: "The cat sat on the mat" DET NOUN VERB ADP DET NOUN Named Entity Recognition: "Barack Obama visited New York" B-PER I-PER O B-LOC I-LOC Chunking: "[The cat] [sat] [on the mat]" NP VP PP Key difference from classification: Classification: 1 label per document Sequence labeling: 1 label per token Labels are interdependent

Key insight: The critical difference from classification is label dependencies. A B-PER tag is very likely followed by I-PER, never by I-LOC. Models that ignore these dependencies (like independent per-token classifiers) produce incoherent tag sequences.

sell

The IOB Tagging Format

How to encode entity boundaries at the token level

IOB Encoding

Multi-word entities like "New York City" need a way to mark where entities begin and continue. The IOB format (Inside-Outside-Beginning) solves this: B-TYPE marks the first token of an entity, I-TYPE marks continuation tokens, and O marks tokens outside any entity. So "New York City" is tagged B-LOC I-LOC I-LOC. Without IOB, you can't distinguish "New York" (one entity) from "New" and "York" (two separate entities). The extended BIOES format adds S (single-token entity) and E (end of entity) for finer boundaries. IOB-2 (the most common variant) requires every entity to start with B, even if it's not adjacent to another entity of the same type. This format is the universal standard for NER annotation and evaluation.

IOB Tagging Example

IOB-2 format: "Barack Obama visited New York City" B-PER I-PER O B-LOC I-LOC I-LOC B = Beginning of entity I = Inside (continuation) O = Outside (not an entity) Why IOB matters: "John Smith met Jane Doe" B-PER I-PER O B-PER I-PER Without B: can't tell 2 people apart BIOES variant: B = Begin, I = Inside, O = Outside E = End, S = Single token entity "in London" → O S-LOC "in New York" → O B-LOC E-LOC

Key insight: IOB tagging converts a span-level problem (find entity boundaries) into a token-level problem (classify each token). This is what makes sequence labeling models applicable to entity recognition.

timeline

Hidden Markov Models (HMMs)

The first probabilistic sequence labeler — still conceptually important

How HMMs Work

Hidden Markov Models were the dominant approach to POS tagging and NER before neural methods. An HMM models two probability distributions: transition probabilities — the probability of one tag following another (P(NOUN | DET) is high, P(DET | DET) is low), and emission probabilities — the probability of a word given a tag (P("cat" | NOUN) is high, P("cat" | VERB) is low). The "hidden" states are the tags (unobserved), and the "observations" are the words. The Viterbi algorithm efficiently finds the most likely tag sequence by dynamic programming, avoiding the exponential cost of checking all possible sequences. HMMs dominated POS tagging for decades, achieving ~96% accuracy. Their limitation: they only look at the current word and previous tag, missing longer-range context.

HMM Components

Transition probabilities: P(NOUN | DET) = 0.45 high P(VERB | NOUN) = 0.25 P(DET | DET) = 0.01 low Emission probabilities: P("cat" | NOUN) = 0.003 P("cat" | VERB) = 0.0001 P("the" | DET) = 0.15 Viterbi algorithm: Find: argmax P(tags | words) Dynamic programming: O(T × N²) T = sentence length, N = tag count POS tagging accuracy: ~96% Limitation: only sees current word + previous tag (Markov assumption)

Key insight: HMMs introduced the core idea that tag sequences have structure — some tag transitions are likely, others are nearly impossible. Every modern sequence labeling model builds on this insight, even if they use different mechanisms.

link

Conditional Random Fields (CRFs)

The discriminative upgrade — richer features, better accuracy

CRFs vs HMMs

Conditional Random Fields (Lafferty et al., 2001) improved on HMMs by modeling P(tags | words) directly (discriminative) rather than modeling the joint P(tags, words) (generative). This seemingly small change has a huge practical impact: CRFs can use arbitrary overlapping features without worrying about independence assumptions. You can include features like "is the word capitalized?", "does it end in -ing?", "is the previous word 'the'?", "does it appear in a gazetteer?" all simultaneously. CRFs also model the entire tag sequence jointly, not just pairwise transitions. They became the standard for NER, achieving F1 scores of 85–90% on the CoNLL-2003 benchmark. CRFs remain important today as the output layer in neural sequence labeling models (BiLSTM-CRF).

CRF Features

Feature functions for NER: f1: word is capitalized & tag = B-PER f2: previous tag = B-PER & tag = I-PER f3: word in person gazetteer & tag = B-PER f4: word ends in "-tion" & tag = O f5: word is all digits & tag = B-DATE f6: previous word = "Mr." & tag = B-PER CRF advantages over HMM: Discriminative (model P(tags|words)) Arbitrary overlapping features No independence assumptions Global sequence optimization CoNLL-2003 NER F1 scores: HMM: ~85% CRF: ~89% CRF + gazetteers: ~91%

Key insight: CRFs showed that feature engineering + global sequence modeling is a powerful combination. The CRF's ability to enforce valid tag sequences (no I-PER after B-LOC) is so valuable that it survives as a layer in neural models.

neurology

BiLSTM-CRF

The neural architecture that dominated NER for five years

Architecture

The BiLSTM-CRF (Huang et al., 2015) combines the best of neural and structured prediction. A bidirectional LSTM reads the sentence in both directions, producing context-rich representations for each token that capture both left and right context. These representations are fed into a CRF layer that models tag dependencies and produces the globally optimal tag sequence. The BiLSTM replaces hand-crafted features with learned representations, while the CRF ensures valid tag sequences. Adding character-level CNNs (BiLSTM-CNN-CRF) captures morphological patterns like capitalization and suffixes, achieving 91.18% F1 on CoNLL-2003 NER without any hand-crafted features. This architecture was the state of the art from 2015 to 2018, when BERT-based models surpassed it.

BiLSTM-CRF Architecture

Input: "Barack Obama visited London" Layer 1: Embeddings Word embeddings (GloVe/Word2Vec) + Character CNN embeddings Layer 2: BiLSTM Forward LSTM: → → → → Backward LSTM: ← ← ← ← Concatenate: [forward; backward] Layer 3: CRF Score all possible tag sequences Pick the best one (Viterbi) Output: B-PER I-PER O B-LOC CoNLL-2003 F1: 91.18% (no hand-crafted features!)

Key insight: BiLSTM-CRF demonstrated that neural feature learning + structured prediction is strictly better than either alone. The BiLSTM learns what features matter; the CRF ensures the output is coherent.

bolt

Transformer-Based Sequence Labeling

BERT for NER — pre-trained representations meet token classification

BERT for NER

BERT-based NER fine-tunes a pre-trained transformer for token classification. Each token's contextual representation from BERT is passed through a linear classification layer that predicts the IOB tag. Some implementations add a CRF layer on top for sequence coherence. BERT's advantage is its deep bidirectional context — it considers the entire sentence when representing each token, capturing long-range dependencies that BiLSTMs struggle with. On CoNLL-2003, BERT-based models achieve 92–93% F1, surpassing BiLSTM-CRF. For domain-specific NER (biomedical, legal, financial), domain-adapted models like BioBERT, LegalBERT, and FinBERT further improve performance by pre-training on domain text. The trade-off: BERT models are 10–100x more expensive to train and run than BiLSTM-CRF.

BERT NER Pipeline

Architecture: Input tokens → BERT encoder → token representations (768-dim) → linear layer (768 → num_tags) → (optional) CRF layer → IOB tag predictions CoNLL-2003 F1 progression: CRF (2003): ~89% BiLSTM-CRF (2015): ~91% BERT (2018): ~92.8% RoBERTa (2019): ~93.2% Domain-specific models: BioBERT: biomedical NER SciBERT: scientific text LegalBERT: legal documents FinBERT: financial text Cost: 10-100x more than BiLSTM-CRF

Key insight: For NER, the jump from BiLSTM-CRF to BERT is smaller than the jump from CRF to BiLSTM-CRF. Pre-trained transformers help most when labeled data is scarce — their general language knowledge compensates for limited task-specific examples.

analytics

Evaluation: Entity-Level F1

Why token-level accuracy is misleading for NER

Entity-Level vs Token-Level

Evaluating sequence labeling requires care. Token-level accuracy is misleading because most tokens are O (outside any entity) — a model that tags everything as O gets 80%+ accuracy on typical NER data. The standard metric is entity-level F1: an entity is correct only if both the span boundaries and the entity type match exactly. If the gold standard says "New York City" is B-LOC I-LOC I-LOC, predicting B-LOC I-LOC O is completely wrong at the entity level, even though 2 of 3 tokens are correct. Precision measures what fraction of predicted entities are correct. Recall measures what fraction of gold entities are found. F1 is their harmonic mean. The seqeval library is the standard tool for computing these metrics.

Evaluation Example

Gold: B-PER I-PER O B-LOC I-LOC I-LOC Pred: B-PER I-PER O B-LOC I-LOC O Token accuracy: 5/6 = 83% (misleading) Entity-level evaluation: Gold entities: [Barack Obama/PER], [New York City/LOC] Pred entities: [Barack Obama/PER], [New York/LOC] "Barack Obama" → correct (exact match) "New York City" → wrong (boundary mismatch) Precision: 1/2 = 50% Recall: 1/2 = 50% F1: 50% (much lower than 83%!)

Key insight: Entity-level F1 is much stricter than token-level accuracy. A model with 95% token accuracy might only achieve 85% entity-level F1. Always report entity-level metrics for NER — they reflect real-world usefulness.

apps

Real-World NER Applications

From search engines to medical records — where sequence labeling powers production systems

Applications

Sequence labeling powers critical production systems across industries. Search engines use NER to identify entities in queries ("flights to Paris" → LOC:Paris) for structured search. Healthcare extracts drug names, dosages, symptoms, and conditions from clinical notes for electronic health records. Finance identifies company names, monetary amounts, and dates in SEC filings and earnings calls. Legal extracts parties, dates, clauses, and obligations from contracts. Customer support identifies product names, issue types, and account numbers from tickets. The challenge in production is domain adaptation: a model trained on news text (CoNLL-2003) performs poorly on medical text because the entity types and vocabulary are completely different. Domain-specific training data and pre-trained models are essential for production NER.

Production NER

Search: "flights to Paris" → LOC: Paris "Tim Cook Apple news" → PER + ORG Healthcare: "Patient takes 500mg Metformin daily" → DRUG: Metformin, DOSE: 500mg Finance: "Apple reported $94.8B revenue in Q1" → ORG: Apple, MONEY: $94.8B, DATE: Q1 Legal: "Party A shall deliver by March 15" → PARTY: Party A, DATE: March 15 Domain gap: News NER model on medical text: ~60% F1 Domain-adapted model: ~85% F1 Domain data is essential

Key insight: In production, the entity types you need are rarely the standard ones (PER, ORG, LOC). Real applications need custom entities (drug names, product SKUs, legal clauses), which means custom training data and domain-specific models.

arrow_back Ch 4: Text Classification Ch 6: Language Models & Generation arrow_forward