Question Answering¶
Key Concepts¶
Overlapping Tokens¶
Overlapping tokens are segments of text that are repeated between consecutive chunks when a long text needs to be split into smaller pieces due to model's maximum token limit.
Here's a detailed explanation:
-
Why we need overlapping:
- When a text is too long for the model's context window (max_length)
- To maintain continuity and context between chunks
- To avoid losing information that might be split between chunks
-
Key parameters in the code:
- max_length: Maximum number of tokens allowed
- stride: Number of overlapping tokens between chunks
- return_overflowing_tokens: Tells tokenizer to return multiple chunks
- truncation="only_second": Only truncates the context, not the question
Let's illustrate with an example:
Suppose we have a text: "The quick brown fox jumps over the lazy sleeping dog". The tokenization might look like this:
Chunk 1: [The quick brown fox jumps over]
↓ overlap ↓
Chunk 2: [brown fox jumps over the lazy]
↓ overlap ↓
Chunk 3: [jumps over the lazy sleeping dog]
Real-world example with actual tokens:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
question = "What did the fox do?"
context = "The quick brown fox jumps over the lazy sleeping dog. It was a beautiful sunny day."
tokenized = tokenizer(
question,
context,
max_length=16,
truncation="only_second",
return_overflowing_tokens=True,
stride=4
)
# Print the decoded tokens for each chunk
for encoding in tokenized["input_ids"]:
print(tokenizer.decode(encoding))
Offset Mapping¶
Offset mapping is a feature that provides the character-level mapping between the original text and the tokenized output. It returns a list of tuples (start, end) where:
- start: starting character position in the original text
- end: ending character position in the original text
Here's a detailed breakdown:
-
Structure of offset_mapping:
-
Special tokens mapping:
- [CLS], [SEP], [PAD]: represented as (0, 0)
- These tokens don't correspond to any actual text in the input
-
Usage example:
# Example showing how to use offset_mapping from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") text = "How many cats?" tokenized = tokenizer(text, return_offsets_mapping=True) for token_id, offset in zip(tokenized["input_ids"], tokenized["offset_mapping"]): token = tokenizer.decode([token_id]) start, end = offset original_text = text[start:end] if start != end else "[SPECIAL]" print(f"Token: {token}, Offset: {offset}, Original text: {original_text}")
Main purposes of offset_mapping:
-
Answer span location:
- Helps locate exact position of answers in QA tasks
- Maps token positions back to original text positions
-
Token-text alignment:
- Enables precise tracking of which parts of original text correspond to which tokens
- Useful for tasks requiring character-level precision
-
Handling overlapping chunks:
- Helps maintain correct position information when text is split into chunks
- Essential for combining predictions from multiple chunks
Common operations with offset_mapping:
# Finding original text for a token
def get_original_text(text, offset):
start, end = offset
return text[start:end] if start != end else "[SPECIAL]"
# Finding token position for a text span
def find_token_position(offset_mapping, char_start, char_end):
for idx, (start, end) in enumerate(offset_mapping):
if start == char_start and end == char_end:
return idx
return None
This feature is particularly important in Question Answering tasks where you need to:
- Map predicted token positions back to original text
- Handle answer spans across multiple chunks
- Maintain precise position information for answer extraction
overflow_to_sample_mapping¶
overflow_to_sample_mapping
is an index list that maps each feature in the overflowing tokens back to its original sample. It's particularly useful when processing multiple examples with overflow.
Here's a detailed explanation:
- When a text is split into multiple chunks due to length
- Each chunk needs to be traced back to its original example
overflow_to_sample_mapping
provides this tracking mechanism
Here's a comprehensive example:
from transformers import AutoTokenizer
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Multiple examples
examples = {
"question": [
"What is the capital?",
"Who won the game?"
],
"context": [
"Paris is the capital of France. It is known for the Eiffel Tower. The city has many historic monuments." * 5, # Made longer by repeating
"The Lakers won the game against the Bulls. It was a close match." * 2
]
}
# Tokenize with overflow
tokenized_examples = []
for q, c in zip(examples["question"], examples["context"]):
tokenized = tokenizer(
q,
c,
max_length=50, # Small max_length for demonstration
stride=10,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
truncation="only_second"
)
tokenized_examples.append(tokenized)
# Let's see how many chunks each example was split into
for i, tokenized in enumerate(tokenized_examples):
print(f"\nExample {i}:")
print(f"Number of chunks: {len(tokenized['input_ids'])}")
print(f"Overflow to sample mapping: {tokenized.overflow_to_sample_mapping}")
This might output something like:
Example 0:
Number of chunks: 4
Overflow to sample mapping: [0, 0, 0, 0] # All chunks belong to first example
Example 1:
Number of chunks: 2
Overflow to sample mapping: [0, 0] # All chunks belong to first example
Practical Use Case:
def prepare_train_features(examples):
# Tokenize our examples with truncation and padding, but keep the overflows using a stride
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
truncation="only_second",
max_length=384,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
# Since one example might give us several features if it has a long context
sample_mapping = tokenized_examples.overflow_to_sample_mapping
# For each feature, we need to know from which example it came from
for i, sample_idx in enumerate(sample_mapping):
# Get the example's original question
sequence_ids = tokenized_examples.sequence_ids(i)
context_start = sequence_ids.index(1) # Find where context starts
# Set example ID for this feature
tokenized_examples[i]["example_id"] = examples["id"][sample_idx]
# Set offset mappings for answer spans
tokenized_examples[i]["offset_mapping"] = [
(o if sequence_ids[k] == 1 else None)
for k, o in enumerate(tokenized_examples[i]["offset_mapping"])
]
return tokenized_examples
Key Benefits:
-
Tracking Features:
- Maps each feature back to its source example
- Maintains relationship between chunks and original data
-
Data Processing:
- Helps in maintaining example-level information
- Essential for combining predictions from multiple chunks
-
Batch Processing:
- Enables proper batching of features
- Maintains data integrity during training
Common Use Pattern:
# Example of using overflow_to_sample_mapping in a training loop
for i, sample_idx in enumerate(tokenized_examples.overflow_to_sample_mapping):
# Get original example ID
original_example_id = examples["id"][sample_idx]
# Get original answer
original_answer = examples["answers"][sample_idx]
# Process feature while maintaining connection to original example
process_feature(tokenized_examples[i], original_example_id, original_answer)
This feature is particularly important in Question Answering tasks where:
- Long contexts need to be split into multiple chunks
- Each chunk needs to be processed separately
- Results need to be combined while maintaining reference to original examples