Question Answering¶

Key Concepts¶

Overlapping Tokens¶

Overlapping tokens are segments of text that are repeated between consecutive chunks when a long text needs to be split into smaller pieces due to model's maximum token limit.

Here's a detailed explanation:

Why we need overlapping:
- When a text is too long for the model's context window (max_length)
- To maintain continuity and context between chunks
- To avoid losing information that might be split between chunks
Key parameters in the code:
- max_length: Maximum number of tokens allowed
- stride: Number of overlapping tokens between chunks
- return_overflowing_tokens: Tells tokenizer to return multiple chunks
- truncation="only_second": Only truncates the context, not the question

Let's illustrate with an example:

Suppose we have a text: "The quick brown fox jumps over the lazy sleeping dog". The tokenization might look like this:

Chunk 1: [The quick brown fox jumps over]
                    ↓ overlap ↓
Chunk 2:            [brown fox jumps over the lazy]
                                ↓ overlap ↓
Chunk 3:                        [jumps over the lazy sleeping dog]

Real-world example with actual tokens:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

question = "What did the fox do?"
context = "The quick brown fox jumps over the lazy sleeping dog. It was a beautiful sunny day."

tokenized = tokenizer(
    question,
    context,
    max_length=16,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=4
)

# Print the decoded tokens for each chunk
for encoding in tokenized["input_ids"]:
    print(tokenizer.decode(encoding))

Offset Mapping¶

Offset mapping is a feature that provides the character-level mapping between the original text and the tokenized output. It returns a list of tuples (start, end) where:

start: starting character position in the original text
end: ending character position in the original text

Here's a detailed breakdown:

Structure of offset_mapping:

[(0, 0),    # [CLS] token - special token, maps to nothing
(0, 3),     # "how" - maps to characters 0-3 in original text
(4, 8),     # "many" - maps to characters 4-8
...]

Special tokens mapping:
- [CLS], [SEP], [PAD]: represented as (0, 0)
- These tokens don't correspond to any actual text in the input

Usage example:

# Example showing how to use offset_mapping
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "How many cats?"
tokenized = tokenizer(text, return_offsets_mapping=True)

for token_id, offset in zip(tokenized["input_ids"], tokenized["offset_mapping"]):
    token = tokenizer.decode([token_id])
    start, end = offset
    original_text = text[start:end] if start != end else "[SPECIAL]"
    print(f"Token: {token}, Offset: {offset}, Original text: {original_text}")

Main purposes of offset_mapping:

Answer span location:
- Helps locate exact position of answers in QA tasks
- Maps token positions back to original text positions
Token-text alignment:
- Enables precise tracking of which parts of original text correspond to which tokens
- Useful for tasks requiring character-level precision
Handling overlapping chunks:
- Helps maintain correct position information when text is split into chunks
- Essential for combining predictions from multiple chunks

Common operations with offset_mapping:

# Finding original text for a token
def get_original_text(text, offset):
    start, end = offset
    return text[start:end] if start != end else "[SPECIAL]"

# Finding token position for a text span
def find_token_position(offset_mapping, char_start, char_end):
    for idx, (start, end) in enumerate(offset_mapping):
        if start == char_start and end == char_end:
            return idx
    return None

This feature is particularly important in Question Answering tasks where you need to:

Map predicted token positions back to original text
Handle answer spans across multiple chunks
Maintain precise position information for answer extraction

overflow_to_sample_mapping¶

overflow_to_sample_mapping is an index list that maps each feature in the overflowing tokens back to its original sample. It's particularly useful when processing multiple examples with overflow.

Here's a detailed explanation:

When a text is split into multiple chunks due to length
Each chunk needs to be traced back to its original example
overflow_to_sample_mapping provides this tracking mechanism

Here's a comprehensive example:

from transformers import AutoTokenizer
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Multiple examples
examples = {
    "question": [
        "What is the capital?",
        "Who won the game?"
    ],
    "context": [
        "Paris is the capital of France. It is known for the Eiffel Tower. The city has many historic monuments." * 5,  # Made longer by repeating
        "The Lakers won the game against the Bulls. It was a close match." * 2
    ]
}

# Tokenize with overflow
tokenized_examples = []
for q, c in zip(examples["question"], examples["context"]):
    tokenized = tokenizer(
        q,
        c,
        max_length=50,  # Small max_length for demonstration
        stride=10,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
        truncation="only_second"
    )
    tokenized_examples.append(tokenized)

# Let's see how many chunks each example was split into
for i, tokenized in enumerate(tokenized_examples):
    print(f"\nExample {i}:")
    print(f"Number of chunks: {len(tokenized['input_ids'])}")
    print(f"Overflow to sample mapping: {tokenized.overflow_to_sample_mapping}")

This might output something like:

Example 0:
Number of chunks: 4
Overflow to sample mapping: [0, 0, 0, 0]  # All chunks belong to first example

Example 1:
Number of chunks: 2
Overflow to sample mapping: [0, 0]  # All chunks belong to first example

Practical Use Case:

def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context
    sample_mapping = tokenized_examples.overflow_to_sample_mapping

    # For each feature, we need to know from which example it came from
    for i, sample_idx in enumerate(sample_mapping):
        # Get the example's original question
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_start = sequence_ids.index(1)  # Find where context starts

        # Set example ID for this feature
        tokenized_examples[i]["example_id"] = examples["id"][sample_idx]

        # Set offset mappings for answer spans
        tokenized_examples[i]["offset_mapping"] = [
            (o if sequence_ids[k] == 1 else None)
            for k, o in enumerate(tokenized_examples[i]["offset_mapping"])
        ]

    return tokenized_examples

Key Benefits:

Tracking Features:
- Maps each feature back to its source example
- Maintains relationship between chunks and original data
Data Processing:
- Helps in maintaining example-level information
- Essential for combining predictions from multiple chunks
Batch Processing:
- Enables proper batching of features
- Maintains data integrity during training

Common Use Pattern:

# Example of using overflow_to_sample_mapping in a training loop
for i, sample_idx in enumerate(tokenized_examples.overflow_to_sample_mapping):
    # Get original example ID
    original_example_id = examples["id"][sample_idx]

    # Get original answer
    original_answer = examples["answers"][sample_idx]

    # Process feature while maintaining connection to original example
    process_feature(tokenized_examples[i], original_example_id, original_answer)

This feature is particularly important in Question Answering tasks where:

Long contexts need to be split into multiple chunks
Each chunk needs to be processed separately
Results need to be combined while maintaining reference to original examples