Ch 5 — Sequence Labeling

POS tagging, named entity recognition, and assigning labels to every token
High Level
text_fields
Tokens
arrow_forward
scatter_plot
Embed
arrow_forward
neurology
Encode
arrow_forward
link
CRF
arrow_forward
sell
Tags
arrow_forward
analytics
Evaluate
-
Click play or press Space to begin...
Step- / 8
data_array
What Is Sequence Labeling?
One label per token — a fundamentally different task from classification
The Task
While text classification assigns one label per document, sequence labeling assigns one label per token. Given a sentence, predict a tag for every word. The three core sequence labeling tasks are: Part-of-Speech (POS) tagging — is each word a noun, verb, adjective, etc.? Named Entity Recognition (NER) — which words are person names, organizations, locations, dates? Chunking — which words form noun phrases, verb phrases, etc.? Sequence labeling is the backbone of information extraction. Every time a search engine highlights a person's name, a chatbot extracts a date from your message, or a medical system identifies drug names in clinical notes, sequence labeling is at work. The key challenge: labels depend on neighboring labels. In "New York City", "York" is part of a location only because "New" precedes it.
Sequence Labeling Tasks
POS Tagging: "The cat sat on the mat" DET NOUN VERB ADP DET NOUN Named Entity Recognition: "Barack Obama visited New York" B-PER I-PER O B-LOC I-LOC Chunking: "[The cat] [sat] [on the mat]" NP VP PP Key difference from classification: Classification: 1 label per document Sequence labeling: 1 label per token Labels are interdependent
Key insight: The critical difference from classification is label dependencies. A B-PER tag is very likely followed by I-PER, never by I-LOC. Models that ignore these dependencies (like independent per-token classifiers) produce incoherent tag sequences.
sell
The IOB Tagging Format
How to encode entity boundaries at the token level
IOB Encoding
Multi-word entities like "New York City" need a way to mark where entities begin and continue. The IOB format (Inside-Outside-Beginning) solves this: B-TYPE marks the first token of an entity, I-TYPE marks continuation tokens, and O marks tokens outside any entity. So "New York City" is tagged B-LOC I-LOC I-LOC. Without IOB, you can't distinguish "New York" (one entity) from "New" and "York" (two separate entities). The extended BIOES format adds S (single-token entity) and E (end of entity) for finer boundaries. IOB-2 (the most common variant) requires every entity to start with B, even if it's not adjacent to another entity of the same type. This format is the universal standard for NER annotation and evaluation.
IOB Tagging Example
IOB-2 format: "Barack Obama visited New York City" B-PER I-PER O B-LOC I-LOC I-LOC B = Beginning of entity I = Inside (continuation) O = Outside (not an entity) Why IOB matters: "John Smith met Jane Doe" B-PER I-PER O B-PER I-PER Without B: can't tell 2 people apart BIOES variant: B = Begin, I = Inside, O = Outside E = End, S = Single token entity "in London" → O S-LOC "in New York" → O B-LOC E-LOC
Key insight: IOB tagging converts a span-level problem (find entity boundaries) into a token-level problem (classify each token). This is what makes sequence labeling models applicable to entity recognition.
timeline
Hidden Markov Models (HMMs)
The first probabilistic sequence labeler — still conceptually important
How HMMs Work
Hidden Markov Models were the dominant approach to POS tagging and NER before neural methods. An HMM models two probability distributions: transition probabilities — the probability of one tag following another (P(NOUN | DET) is high, P(DET | DET) is low), and emission probabilities — the probability of a word given a tag (P("cat" | NOUN) is high, P("cat" | VERB) is low). The "hidden" states are the tags (unobserved), and the "observations" are the words. The Viterbi algorithm efficiently finds the most likely tag sequence by dynamic programming, avoiding the exponential cost of checking all possible sequences. HMMs dominated POS tagging for decades, achieving ~96% accuracy. Their limitation: they only look at the current word and previous tag, missing longer-range context.
HMM Components
Transition probabilities: P(NOUN | DET) = 0.45 high P(VERB | NOUN) = 0.25 P(DET | DET) = 0.01 low Emission probabilities: P("cat" | NOUN) = 0.003 P("cat" | VERB) = 0.0001 P("the" | DET) = 0.15 Viterbi algorithm: Find: argmax P(tags | words) Dynamic programming: O(T × N²) T = sentence length, N = tag count POS tagging accuracy: ~96% Limitation: only sees current word + previous tag (Markov assumption)
Key insight: HMMs introduced the core idea that tag sequences have structure — some tag transitions are likely, others are nearly impossible. Every modern sequence labeling model builds on this insight, even if they use different mechanisms.
link
Conditional Random Fields (CRFs)
The discriminative upgrade — richer features, better accuracy
CRFs vs HMMs
Conditional Random Fields (Lafferty et al., 2001) improved on HMMs by modeling P(tags | words) directly (discriminative) rather than modeling the joint P(tags, words) (generative). This seemingly small change has a huge practical impact: CRFs can use arbitrary overlapping features without worrying about independence assumptions. You can include features like "is the word capitalized?", "does it end in -ing?", "is the previous word 'the'?", "does it appear in a gazetteer?" all simultaneously. CRFs also model the entire tag sequence jointly, not just pairwise transitions. They became the standard for NER, achieving F1 scores of 85–90% on the CoNLL-2003 benchmark. CRFs remain important today as the output layer in neural sequence labeling models (BiLSTM-CRF).
CRF Features
Feature functions for NER: f1: word is capitalized & tag = B-PER f2: previous tag = B-PER & tag = I-PER f3: word in person gazetteer & tag = B-PER f4: word ends in "-tion" & tag = O f5: word is all digits & tag = B-DATE f6: previous word = "Mr." & tag = B-PER CRF advantages over HMM: Discriminative (model P(tags|words)) Arbitrary overlapping features No independence assumptions Global sequence optimization CoNLL-2003 NER F1 scores: HMM: ~85% CRF: ~89% CRF + gazetteers: ~91%
Key insight: CRFs showed that feature engineering + global sequence modeling is a powerful combination. The CRF's ability to enforce valid tag sequences (no I-PER after B-LOC) is so valuable that it survives as a layer in neural models.
neurology
BiLSTM-CRF
The neural architecture that dominated NER for five years
Architecture
The BiLSTM-CRF (Huang et al., 2015) combines the best of neural and structured prediction. A bidirectional LSTM reads the sentence in both directions, producing context-rich representations for each token that capture both left and right context. These representations are fed into a CRF layer that models tag dependencies and produces the globally optimal tag sequence. The BiLSTM replaces hand-crafted features with learned representations, while the CRF ensures valid tag sequences. Adding character-level CNNs (BiLSTM-CNN-CRF) captures morphological patterns like capitalization and suffixes, achieving 91.18% F1 on CoNLL-2003 NER without any hand-crafted features. This architecture was the state of the art from 2015 to 2018, when BERT-based models surpassed it.
BiLSTM-CRF Architecture
Input: "Barack Obama visited London" Layer 1: Embeddings Word embeddings (GloVe/Word2Vec) + Character CNN embeddings Layer 2: BiLSTM Forward LSTM: → → → → Backward LSTM: ← ← ← ← Concatenate: [forward; backward] Layer 3: CRF Score all possible tag sequences Pick the best one (Viterbi) Output: B-PER I-PER O B-LOC CoNLL-2003 F1: 91.18% (no hand-crafted features!)
Key insight: BiLSTM-CRF demonstrated that neural feature learning + structured prediction is strictly better than either alone. The BiLSTM learns what features matter; the CRF ensures the output is coherent.
bolt
Transformer-Based Sequence Labeling
BERT for NER — pre-trained representations meet token classification
BERT for NER
BERT-based NER fine-tunes a pre-trained transformer for token classification. Each token's contextual representation from BERT is passed through a linear classification layer that predicts the IOB tag. Some implementations add a CRF layer on top for sequence coherence. BERT's advantage is its deep bidirectional context — it considers the entire sentence when representing each token, capturing long-range dependencies that BiLSTMs struggle with. On CoNLL-2003, BERT-based models achieve 92–93% F1, surpassing BiLSTM-CRF. For domain-specific NER (biomedical, legal, financial), domain-adapted models like BioBERT, LegalBERT, and FinBERT further improve performance by pre-training on domain text. The trade-off: BERT models are 10–100x more expensive to train and run than BiLSTM-CRF.
BERT NER Pipeline
Architecture: Input tokens → BERT encoder → token representations (768-dim) → linear layer (768 → num_tags) → (optional) CRF layer → IOB tag predictions CoNLL-2003 F1 progression: CRF (2003): ~89% BiLSTM-CRF (2015): ~91% BERT (2018): ~92.8% RoBERTa (2019): ~93.2% Domain-specific models: BioBERT: biomedical NER SciBERT: scientific text LegalBERT: legal documents FinBERT: financial text Cost: 10-100x more than BiLSTM-CRF
Key insight: For NER, the jump from BiLSTM-CRF to BERT is smaller than the jump from CRF to BiLSTM-CRF. Pre-trained transformers help most when labeled data is scarce — their general language knowledge compensates for limited task-specific examples.
analytics
Evaluation: Entity-Level F1
Why token-level accuracy is misleading for NER
Entity-Level vs Token-Level
Evaluating sequence labeling requires care. Token-level accuracy is misleading because most tokens are O (outside any entity) — a model that tags everything as O gets 80%+ accuracy on typical NER data. The standard metric is entity-level F1: an entity is correct only if both the span boundaries and the entity type match exactly. If the gold standard says "New York City" is B-LOC I-LOC I-LOC, predicting B-LOC I-LOC O is completely wrong at the entity level, even though 2 of 3 tokens are correct. Precision measures what fraction of predicted entities are correct. Recall measures what fraction of gold entities are found. F1 is their harmonic mean. The seqeval library is the standard tool for computing these metrics.
Evaluation Example
Gold: B-PER I-PER O B-LOC I-LOC I-LOC Pred: B-PER I-PER O B-LOC I-LOC O Token accuracy: 5/6 = 83% (misleading) Entity-level evaluation: Gold entities: [Barack Obama/PER], [New York City/LOC] Pred entities: [Barack Obama/PER], [New York/LOC] "Barack Obama" → correct (exact match) "New York City" → wrong (boundary mismatch) Precision: 1/2 = 50% Recall: 1/2 = 50% F1: 50% (much lower than 83%!)
Key insight: Entity-level F1 is much stricter than token-level accuracy. A model with 95% token accuracy might only achieve 85% entity-level F1. Always report entity-level metrics for NER — they reflect real-world usefulness.
apps
Real-World NER Applications
From search engines to medical records — where sequence labeling powers production systems
Applications
Sequence labeling powers critical production systems across industries. Search engines use NER to identify entities in queries ("flights to Paris" → LOC:Paris) for structured search. Healthcare extracts drug names, dosages, symptoms, and conditions from clinical notes for electronic health records. Finance identifies company names, monetary amounts, and dates in SEC filings and earnings calls. Legal extracts parties, dates, clauses, and obligations from contracts. Customer support identifies product names, issue types, and account numbers from tickets. The challenge in production is domain adaptation: a model trained on news text (CoNLL-2003) performs poorly on medical text because the entity types and vocabulary are completely different. Domain-specific training data and pre-trained models are essential for production NER.
Production NER
Search: "flights to Paris" → LOC: Paris "Tim Cook Apple news" → PER + ORG Healthcare: "Patient takes 500mg Metformin daily" → DRUG: Metformin, DOSE: 500mg Finance: "Apple reported $94.8B revenue in Q1" → ORG: Apple, MONEY: $94.8B, DATE: Q1 Legal: "Party A shall deliver by March 15" → PARTY: Party A, DATE: March 15 Domain gap: News NER model on medical text: ~60% F1 Domain-adapted model: ~85% F1 Domain data is essential
Key insight: In production, the entity types you need are rarely the standard ones (PER, ORG, LOC). Real applications need custom entities (drug names, product SKUs, legal clauses), which means custom training data and domain-specific models.