12. Properties of Context-free Languages (CFL)#
Introduction#
This section explores the fundamental concepts and properties of Context-free Languages (CFLs), a class of formal languages widely studied in theoretical computer science. We’ll cover their definition, closure properties (union, concatenation, Kleene closure), and properties related to intersection and complement.
1. Introduction to Context-free Languages#
A context-free language (CFL) is a language that can be generated by a context-free grammar (CFG). The class of CFLs is more expressive than regular languages and can be recognized by pushdown automata. For example, the following languages are all CFLs: Balanced Parentheses Language, Palindromes over {a, b}, \(L = \{ a^nb^n | n ≥ 0 \}\).
2. Code Walkthrough: Representing a Context-Free Grammar in Python#
Concept#
A context-free grammar (CFG) is the formal device that generates every context-free language (CFL). Formally, a CFG is a 4-tuple:
where \(V\) is the set of variables (non-terminals), \(\Sigma\) is the set of terminals, \(R\) is the set of production rules, and \(S \in V\) is the start symbol.
The CFG class below provides a direct, one-to-one mapping of this 4-tuple into
Python. It is intentionally minimal — its only purpose is to hold the four components
and print them readably. A richer Grammar class (with subscript-renaming support)
is defined later in Section 3.2 for the union construction.
Formal notation |
Python field |
Notes |
|---|---|---|
\(V\) |
|
|
\(\Sigma\) |
|
|
\(R\) |
|
|
\(S\) |
|
single string from |
What to look for#
The rules dict maps one left-hand side symbol to a list of alternatives.
For example, {'S': ['aSb', '']} encodes the two productions \(S \to aSb\) and
\(S \to \varepsilon\) (the empty string is represented as '').
Running outside Jupyter#
This cell defines a class and runs a small example. Save the class definition to
cfg_intro.py and run:
python cfg_intro.py
No substitutions are needed — there are no Jupyter-specific calls in this cell.
# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# A CFG is the 4-tuple G = (V, Σ, R, S) [Sipser §2.1, Definition 2.2]
# This class stores each component directly, with no parsing or simulation.
# ────────────────────────────────────────────────────────────────────────────
class CFG:
"""Lightweight container for a Context-Free Grammar G = (V, Σ, R, S)."""
def __init__(self, variables, terminals, rules, start):
# V — set of non-terminal symbols (variables)
self.variables = variables
# Σ — set of terminal symbols
self.terminals = terminals
# R — production rules stored as dict: lhs_symbol → [rhs_string, ...]
# Using a list of alternatives lets one dict entry encode
# A → α₁ | α₂ | ... | αₙ without a separate entry per alternative.
self.rules = rules
# S ∈ V — the designated start symbol
self.start = start
def __str__(self):
# ── Phase: build human-readable summary of G ──
result = f"Variables: {', '.join(self.variables)}\n"
result += f"Terminals: {', '.join(self.terminals)}\n"
result += f"Start: {self.start}\n"
result += "Rules:\n"
# Expand each A → [α₁, α₂, ...] into individual production lines
for lhs, rhs_list in self.rules.items():
for rhs in rhs_list:
# Display empty string as ∧ (lambda) to match textbook notation
display_rhs = rhs if rhs else '∧'
result += f" {lhs} → {display_rhs}\n"
return result
# ── EXAMPLE: CFG for L = {aⁿbⁿ | n ≥ 0} ──────────────────────────────────
# Production rules: S → aSb (adds one 'a' on the left, one 'b' on the right)
# S → ε (base case: generates the empty string ∧)
# This grammar is the canonical example showing CFLs go beyond regular languages.
cfg_anbn = CFG(
variables={'S'},
terminals={'a', 'b'},
rules={'S': ['aSb', '']}, # '' encodes ε (the empty string)
start='S'
)
print("CFG for {a^n b^n | n ≥ 0}:")
print(cfg_anbn)
CFG for {a^n b^n | n ≥ 0}:
Variables: S
Terminals: a, b
Start: S
Rules:
S → aSb
S → ∧
3. Closure Properties of Context-free Languages#
CFLs exhibit certain closure properties under specific operations, meaning that if we apply these operations to context-free languages, the result is also a context-free language. We’ll explore union, concatenation (product), and Kleene closure.
3.1 Union#
Definition: If \(L_1\) and \(L_2\) are CFLs, their union is defined as \(L_1 \cup L_2 = \{ x \mid x \in L_1 \text{ or } x \in L_2 \}\). This means any string \(x\) that belongs to either \(L_1\) or \(L_2\) (or both) is in \(L_1 \cup L_2\). Given grammars \(G_1 = (V_1, T_1, P_1, S_1)\) for \(L_1\) and \(G_2 = (V_2, T_2, P_2, S_2)\) for \(L_2\), we construct a CFG for \(L_1 \cup L_2\):
New start symbol: \(S\)
Rules: \(S \to S_1 \mid S_2\), plus all rules from \(G_1\) and \(G_2\).
Theorem: If \(L_1\) and \(L_2\) are context-free languages, then \(L_1 \cup L_2\) is also a context-free language.
Proof: If we have CFGs \(G_1 = (V_1, \Sigma_1, R_1)\) for \(L_1\) and \(G_2 = (V_2, \Sigma_2, R_2)\) for \(L_2\), we can construct a new grammar \(G_3 = (V_1 \cup V_2, \Sigma_1 \cup \Sigma_2, R_1 \cup R_2 \cup S \rightarrow S_1 \mid S_2)\), where \(S_1\) and \(S_2\) are the start symbols of \(G_1\) and \(G_2\), respectively, with subscirpts added. This new grammar generates \(L_1 \cup L_2\). Note that in order to create the new grammar, we need to add subscripts to non-terminals in the original grammars to serve the following purposes:
Avoid Naming Conflicts: if both grammars \(G_1\) and \(G_2\) have non-terminals with the same name (e.g., S, A, B), this could cause ambiguity when combining them. Using subscripts (e.g., \(S_1\), \(S_2\), \(A_1\), \(A_2\), \(B_1\), \(B_2\)) ensures that symbols from different grammars remain distinct.
Maintain Structural Integrity: by keeping the original non-terminals unique, the derivation rules of each grammar remain unchanged, preventing unintended interactions between the two grammars.
Enable Proper Union Construction: when defining a new start symbol \(S\) with production rules \(S \rightarrow S_1 \mid S_2\), the subscripts make it clear which original grammar each non-terminal belongs to, facilitating a correct and structured combination of the two languages.
Example 1: Let \(L_1 = \{ a^n b^n \mid n \geq 0 \}\) (e.g., “”, “ab”, “aabb”) and \(L_2 = \{ a^m b^{2m} \mid m \geq 0 \}\) (e.g., “”, “abb”, “aabbbb”). Their union \(L_1 \cup L_2\) includes strings like “”, “ab”, “abb”, “aabb”, “aabbbb”, etc.
The CFG \(G_1\) for \(L_1\) has the productions \(S \rightarrow aSb \mid \wedge\). The CFG \(G_2\) for \(L_2\) has the productions \(S \rightarrow AS \mid BS \mid \wedge; A \rightarrow aA \mid a; B \rightarrow bB \mid b\). To obtain the CFG for their union:
First, add subscripts to non-terminals in \(G_1\): \(S_1 \rightarrow aS_1b \mid \wedge\)
Second, add subscripts to non-terminals in \(G_2\): \(S_2 \rightarrow aS_2bb \mid \wedge\)
Third, add a new production rule: \(S \rightarrow S_1 \mid S_2\)
Finally, the CFG for their union: \(G\): \(\{V=\{S\}, \Sigma=\{a, b\}\, R=\{S_1 \rightarrow aS_1b \mid \wedge; S_2 \rightarrow aS_2bb \mid \wedge; S \rightarrow S_1 \mid S_2\}\}\)
Example 2: The CFG \(G_1\) for \(L_1\) has the productions \(S \rightarrow aS \mid SS \mid SA \mid \wedge; A \rightarrow AA \mid b\). The CFG \(G_2\) for \(L_2\) has the productions \(S \rightarrow AS \mid BS \mid \wedge; A \rightarrow aA \mid a; B \rightarrow bB \mid b\). To obtain the CFG for their union:
First, add subscripts to non-terminals in \(G_1\): \(S_1 \rightarrow aS_1 \mid S_1S_1 \mid S_1A_1 \mid \wedge; A_1 \rightarrow A_1A_1 \mid b\)
Second, add subscripts to non-terminals in \(G_2\): \(S_2 \rightarrow A_2S_2 \mid B_2S_2 \mid \wedge; A_2 \rightarrow aA_2 \mid a; B_2 \rightarrow bB_2 \mid b\)
Third, add a new production rule: \(S \rightarrow S_1 \mid S_2\)
Finally, the CFG for their union: \(G\): \(\{V=\{S, A, B\}, \Sigma=\{a, b\}\, R=\{S_1 \rightarrow aS_1 \mid S_1S_1 \mid S_1A_1 \mid \wedge; A_1 \rightarrow A_1A_1 \mid b; S_2 \rightarrow A_2S_2 \mid B_2S_2 \mid \wedge; A_2 \rightarrow aA_2 \mid a; B_2 \rightarrow bB_2 \mid b; S \rightarrow S_1 \mid S_2\}\}\).
Note that even \(G_1\) does not contain the non-terminal \(B\), we still add the subscript 2 to \(B\) in \(G_2\) to maintain the uniqueness of productions and clearly indicate which original grammar each non-terminal belongs to.
3.2 Code Walkthrough: Constructing the Union of Two CFLs#
Concept#
Theorem (Union Closure, §3.1). If \(L_1\) and \(L_2\) are context-free languages, then \(L_1 \cup L_2\) is also a context-free language.
Construction. Given \(G_1 = (V_1, \Sigma_1, R_1, S_1)\) for \(L_1\) and \(G_2 = (V_2, \Sigma_2, R_2, S_2)\) for \(L_2\), build:
where primes denote subscript-renamed copies of each grammar’s non-terminals. Renaming is required to prevent symbol collisions when \(V_1 \cap V_2 \neq \emptyset\) (both grammars commonly use \(S\), \(A\), \(B\), etc.).
New class: Grammar vs. the earlier CFG#
⚠️ This cell introduces a second CFG class named
Grammar. It is not the same as theCFGclass from Section 2. Key differences:
Feature
CFG(Section 2)
Grammar(this cell)Field name for variables
variables
non_terminalsField name for start
start
start_symbolPurpose
Illustrative container
Input/output for union algorithm
Do not pass a
CFGobject intounion_cfl— it expectsGrammarobjects.
Dependency and pipeline#
The three functions defined in this cell form a pipeline. The example() function
at the bottom exercises the full pipeline:
Grammar(G1) ──┐
├─→ add_subscripts(·, "1") ──→ g1_subscripted ──┐
Grammar(G2) ──┘ ├─→ union_cfl(G1, G2) ──→ union_grammar
add_subscripts(·, "2") ──→ g2_subscripted ──┘
Function |
Formal step it implements |
|---|---|
|
Renames every \(X \in V\) to \(X_i\), producing \(G_i'\) |
|
Builds \(G_3\): merges \(G_1'\) and \(G_2'\), adds \(S \to S_1' \mid S_2'\) |
|
Instantiates concrete grammars and prints the union result |
What to look for#
Focus on add_subscripts: it splits each production string on whitespace before
replacing non-terminal names. This avoids the substring-replacement bug where a
non-terminal S would incorrectly match inside a longer token like SS or SA.
Running outside Jupyter#
Save the cell (including example()) to union_cfl.py. Replace the last block with:
if __name__ == "__main__":
example()
Then run:
python union_cfl.py
No Jupyter-specific calls are present in this cell.
# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Implements Theorem 3.1: CFLs are closed under union.
# Construction: G₃ = (V₁'∪V₂', Σ₁∪Σ₂, R₁'∪R₂'∪{S→S₁'|S₂'}, S)
# Reference: Sipser §2.1 Theorem 2.12; Cohen Chapter 16
# ────────────────────────────────────────────────────────────────────────────
class Grammar:
"""
CFG container for the union construction pipeline.
Stores G = (non_terminals, terminals, rules, start_symbol).
Note: field names differ from the lightweight CFG class in Section 2.
"""
def __init__(self, non_terminals, terminals, rules, start_symbol):
self.non_terminals = non_terminals # V — set of non-terminal symbols
self.terminals = terminals # Σ — set of terminal symbols
self.rules = rules # R — dict: lhs → [rhs, ...]
self.start_symbol = start_symbol # S ∈ V
def __str__(self):
grammar_str = f"Start Symbol: {self.start_symbol}\n"
grammar_str += "Production Rules:\n"
for nt, productions in self.rules.items():
rhs = " | ".join(productions)
grammar_str += f"{nt} -> {rhs}\n"
return grammar_str
# ── STEP 1 OF CONSTRUCTION: rename non-terminals to avoid collisions ─────────
def add_subscripts(grammar, subscript):
"""
Produce G_i' by appending `subscript` to every non-terminal in `grammar`.
Formal role: ensures V₁ ∩ V₂ = ∅ before merging into G₃.
Example: non-terminal 'S' with subscript '1' → 'S_1'.
Design note: productions are split on whitespace before replacement.
This prevents the substring bug where 'S' would match inside 'SS' or 'SA'.
"""
# ── Phase 1a: build the old-name → new-name mapping for every variable ──
nt_map = {nt: f"{nt}_{subscript}" for nt in grammar.non_terminals}
# ── Phase 1b: rename the variable set itself ──
new_non_terminals = {nt_map[nt] for nt in grammar.non_terminals}
# ── Phase 1c: rename every occurrence of each variable inside productions ──
new_rules = {}
for nt, productions in grammar.rules.items():
new_nt = nt_map[nt] # rename the left-hand side
new_productions = []
for prod in productions:
new_prod = prod
# Split on whitespace so tokens like 'A B' become ['A', 'B'];
# replace whole tokens only — avoids matching 'S' inside 'SS'.
parts = new_prod.split()
for i, part in enumerate(parts):
if part in nt_map: # part is a known non-terminal
parts[i] = nt_map[part]
new_prod = " ".join(parts)
new_productions.append(new_prod)
new_rules[new_nt] = new_productions
return Grammar(
non_terminals=new_non_terminals,
terminals=grammar.terminals, # Σ is unchanged — terminals never renamed
rules=new_rules,
start_symbol=nt_map[grammar.start_symbol]
)
# ── STEP 2 OF CONSTRUCTION: merge two renamed grammars under a new start ─────
def union_cfl(grammar1, grammar2):
"""
Build G₃ = (V₁'∪V₂', Σ₁∪Σ₂, R₁'∪R₂'∪{S→S₁'|S₂'}, S).
Implements the proof of Theorem 3.1 (union closure for CFLs).
Calls add_subscripts first so V₁' and V₂' are guaranteed disjoint.
"""
# ── Phase 1: rename both grammars to make their variable sets disjoint ──
g1 = add_subscripts(grammar1, "1") # produces G₁' with V₁' = {X_1 | X ∈ V₁}
g2 = add_subscripts(grammar2, "2") # produces G₂' with V₂' = {X_2 | X ∈ V₂}
# ── Phase 2: choose a fresh start symbol S ∉ V₁' ∪ V₂' ──────────────────
# The new production will be S → S₁' | S₂' (non-deterministic choice).
new_start = "S"
while new_start in g1.non_terminals or new_start in g2.non_terminals:
new_start += "'" # keep appending ' until the name is genuinely fresh
# ── Phase 3: merge the two grammars ──────────────────────────────────────
# V₃ = V₁' ∪ V₂' ∪ {S}
union_non_terminals = g1.non_terminals | g2.non_terminals | {new_start}
# Σ₃ = Σ₁ ∪ Σ₂ (use originals, not subscripted copies — terminals unchanged)
union_terminals = grammar1.terminals | grammar2.terminals
# R₃ = R₁' ∪ R₂' ∪ {S → S₁' | S₂'}
union_rules = {**g1.rules, **g2.rules}
# The new start rule: S → S₁' (derives from L₁) or S → S₂' (derives from L₂)
union_rules[new_start] = [g1.start_symbol, g2.start_symbol]
return Grammar(
non_terminals=union_non_terminals,
terminals=union_terminals,
rules=union_rules,
start_symbol=new_start
)
# ── EXAMPLE: union of {aⁿbⁿ | n ≥ 1} and {aⁿbᵐ | n,m ≥ 1} ─────────────────
def example():
# Grammar 1: generates L₁ = {aⁿbⁿ | n ≥ 1}
# Productions: S → A, A → aAb | ab
grammar1 = Grammar(
non_terminals={"S", "A"},
terminals={"a", "b"},
rules={
"S": ["A"],
"A": ["a A b", "a b"] # recursive | base case
},
start_symbol="S"
)
# Grammar 2: generates L₂ = {aⁿbᵐ | n,m ≥ 1}
# Separates the a-block (A) from the b-block (B) with independent recursion
grammar2 = Grammar(
non_terminals={"S", "A", "B"},
terminals={"a", "b"},
rules={
"S": ["A B"],
"A": ["a A", "a"], # one or more a's
"B": ["b B", "b"] # one or more b's
},
start_symbol="S"
)
union_grammar = union_cfl(grammar1, grammar2)
print("Grammar 1 (a^n b^n for n ≥ 1):")
print(grammar1)
print("\nGrammar 2 (a^n b^m for n ≥ 1, m ≥ 1):")
print(grammar2)
print("\nUnion Grammar:")
print(union_grammar)
if __name__ == "__main__":
example()
Grammar 1 (a^n b^n for n ≥ 1):
Start Symbol: S
Production Rules:
S -> A
A -> a A b | a b
Grammar 2 (a^n b^m for n ≥ 1, m ≥ 1):
Start Symbol: S
Production Rules:
S -> A B
A -> a A | a
B -> b B | b
Union Grammar:
Start Symbol: S
Production Rules:
S_1 -> A_1
A_1 -> a A_1 b | a b
S_2 -> A_2 B_2
A_2 -> a A_2 | a
B_2 -> b B_2 | b
S -> S_1 | S_2
3.3 Concatenation (Product)#
Definition: The concatenation operation, also known as the product, is a fundamental operation in formal language theory. Given two languages \(L_1\) and \(L_2\), their concatenation \(L_1L_2\) is the set of all strings formed by taking a string from \(L_1\) and appending a string from \(L_2\) to it:
Theorem: If \(L_1\) and \(L_2\) are context-free languages, then their concatenation \(L_1L_2\) is also a context-free language.
Proof: If we have CFGs \(G_1 = (V_1, \Sigma_1, R_1)\) for \(L_1\) and \(G_2 = (V_2, \Sigma_2, R_2)\) for \(L_2\), we can construct a new grammar \(G_3 = (V_1 \cup V_2, \Sigma_1 \cup \Sigma_2, R_1 \cup R_2 \cup S \rightarrow S_1S_2)\), where \(S_1\) and \(S_2\) are the start symbols of \(G_1\) and \(G_2\), respectively, with subscirpts added. This new grammar generates \(L_1L_2\). Note that in order to create the new grammar, we need to add subscripts to non-terminals in the original grammars. Let’s verify that \(L(G_3) = L(G_1)L(G_2)\):
For any string \(w \in L(G_3)\), the derivation must start with \(S \rightarrow S_1S_2\).
Then, \(S_1\) derives some string \(x \in L(G_1)\) using rules from \(R_1\).
And \(S_2\) derives some string \(y \in L(G_2)\) using rules from \(R_2\).
So, \(w = xy \text{ where } x \in L(G_1) \text { and } y \in L(G_2)\), which means \(w \in L(G_1)L(G_2)\).
Conversely, for any string \(w \in L(G_1)L(G_2)\), we have \(w = xy \text{ where } x \in L(G_1) \text { and } y \in L(G_2)\):
Since \(x \in L(G_1)\), there exists a derivation \(S_1 \overset{*}\Rightarrow x\) using rules from \(R_1\).
Since \(y \in L(G_2)\), there exists a derivation \(S_2 \overset{*}\Rightarrow y\) using rules from \(R_2\).
Using the rule \(S \rightarrow S_1S_2\), we get \(S \Rightarrow S_1S_2 \overset{*}\Rightarrow xy = w\). So, \(w \in L(G_3)\).
Note: the asterisk (*) above the arrow notation \(\overset{*}\Rightarrow\) means zero or more steps of derivation. So, \(S_1 \overset{*}\Rightarrow x\) means that the symbol \(S_1\) can derive \(x\) through a series of production rule applications.
Example: Let \(L_1 = \{a^nb^n \mid n \geq 1\}\) and \(L_2 = \{c^md^m \mid m \geq 1\}\), their production language \(L = L_1L_2 = \{a^nb^nc^md^m \mid n, m \geq 1 \}\). For example, if we concatenate \(aabb\) from \(L_1\) with \(ccdd\) from \(L_2\), we get \(aabbccdd \in L_1L_2\). Let us find out the CFG for \(L_1L_2\):
The production rules for \(L_1\) is \(S \rightarrow aSb \mid ab\) and the production rules for \(L_2\) is \(S \rightarrow cSd \mid cd\). To obtain the CFG for their concatenation:
First, add subscripts to non-terminals in \(L_1\): \(S_1 \rightarrow aS_1b \mid ab\)
Second, add subscripts to non-terminals in \(L_2\): \(S_2 \rightarrow cS_2d \mid cd\)
Third, add a new production rule: \(S \rightarrow S_1S_2\)
Finally, the CFG for their union: \(G\): \(\{V=\{S\}, \Sigma=\{a, b, c, d\}, R=\{S_1 \rightarrow aS_1b \mid ab; S_2 \rightarrow cS_2d \mid cd; S \rightarrow S_1S_2\}\}\)
3.4 Kleene Closure (Kleene Star)#
Definition: The Kleene closure (or Kleene star) is a unary operation on a language that produces a new language consisting of all possible concatenations of zero or more strings from the original language. Formally, for a language \(L\), the Kleene closure \(L^*\) is defined as:
where n = 0 implies the empty string (\(\wedge\)) is in \(L^*\); n = 1 implies \(L \subseteq L^*\); n = 2 implies \(LL \subseteq L^*\), and so on… Equivalently, \(L^*\) can be defined recursively as: \(L^* = \{\wedge \} \cup L \cup LL \cup LLL \cup ...\). This operation allows to express “zero or more repetitions” of strings from a language, a concept widely used in pattern matching, lexical analysis, and parsing.
Theorem: If \(L\) is a context-free language, then its Kleene closure \(L^*\) is also a context-free language.
Proof: Given a CFL \(L\) generated by a CFG \(G = (V, \Sigma, R)\), we construct a CFG \(G_1 = (V_1, \Sigma_1, R_1)\) that generates \(L^*\), as follows:
\(V_1 = V \cup \{S_1\}\), where \(S_1\) is a new variable not in \(V\)
\(\Sigma_1\) = \(\Sigma\)
\(R_1\) = \(R \cup \{S_1 \rightarrow SS_1 \mid \wedge\}\), and \(S_1\) is the start symbol of the newly constructed CFG.
Let’s verify that \(L(G_1) = L^*\):
For any string \(w \in L(G_1)\), there are two cases for the initial step in its derivation:
Case 1: \(S_1 \Rightarrow \wedge\), so \(w = \wedge\), which is in \(L^*\) by definition.
Case 2: \(S_1 \Rightarrow SS_1 \overset{*}\Rightarrow xy \text{ where } S \overset{*}\Rightarrow x \text{ and } S_1 \overset{*}\Rightarrow y\).
Since \(S\) is the start symbol of \(G\), by definition, \(x \in L\).
Inductively, \(y \in L(G_1)\) as it is generated by \(S_1\) in \(G_1\).
So \(w = xy, \text{ where } x \in L \text{ and } y \in L(G_1)\), which means \(w \in L^*\).
Conversely, for any string \(w \in L^*\), we have \(w = w_1w_2...w_n, \text{ where each } w_i \in L, n \geq 0\).
If n = 0, then \(w = \wedge\), and \(S_1 \Rightarrow \wedge\) directly by the rule \(S_1 \rightarrow \wedge\).
If n ≥ 1, then \(w = w_1v, \text{ where } w_1 \in L \text{ and } v = w_2w_3...w_n \in L^*\).
Since \(w_1 \in L\), there’s a derivation \(S \overset{*}\Rightarrow w_1\).
Inductively, since \(v \in L^*\), there’s a derivation \(S_1 \overset{*}\Rightarrow v\).
Using the rule \(S_1 \rightarrow SS_1\), we get \(S_1 \Rightarrow SS_1 \overset{*}\Rightarrow w_1S_1 \overset{*}\Rightarrow w_1v = w\)
So, \(w \in L(G_1)\).
Therefore, \(L(G_1) = L^*\), proving that the Kleene closure of a context-free language is context-free.
Note that in the constructed CFG we usually replace \(S_1\) with \(S\) and vice versa in all productions to maintain consistency in the CFG, ensuring that \(S\) remains the start symbol in the new grammar, and \(S_1\) is the start symbol of the original grammar. Given that, we can update the procedures to obtain the CFG for \(L^*\) as follows:
First, update the start symbol \(S\) by adding subscripts throughout the grammar in \(L\), changing \(S\) to \(S_1\)
Second, add new production rules: \(S \rightarrow S_1S \mid \wedge\)
Example 1: Language \(L\) has production rules \(S \rightarrow AS \mid BS \mid a, A \rightarrow aA \mid a, B \rightarrow bB \mid b\). The production rules for \(L^*\) can be defined as follows:
First, add subscripts to the start symbol \(S_1 \rightarrow AS_1 \mid BS_1 \mid a, A \rightarrow aA \mid a, B \rightarrow bB \mid b\)
Second, add new production rules: \(S \rightarrow S_1S \mid \wedge\)
Example 2: Let \(L = \{a^nb^m \mid n, m \geq 1\}\). The CFG for the given language has production rules \(S \rightarrow AB, A \rightarrow aA \mid a, B \rightarrow bB \mid b\). The production rules for the Kleene star of \(L\) can be defined as follows:
First, add subscripts to the start symbol \(S\) in \(L\): \(S_1 \rightarrow AB, A \rightarrow aA \mid a, B \rightarrow bB \mid b\)
Second, add new production rules: \(S \rightarrow S_1S \mid \wedge\)
4. Non-Closure Properties: Intersection and Complement#
Unlike regular languages, context-free languages (CFLs) do not remain within the same class under all operations. Two significant operations that do not always preserve context-free properties are intersection and complement.
4.1 Intersection#
CFLs are NOT closed under intersection. This means that there exist context-free languages whose intersection results in a language that is not context-free.
4.1.1 CFLs are closed under Intersection if They are Regular Languages#
Regular languages are a subset of context-free languages. This means that every regular language is also a context-free language.
Since regular languages are closed under intersection, if \(L_1, L_2\) are both regular, their intersection \(L_1 \cap L_2\) is also regular.
Because all regular languages are also context-free, their intersection remains both regular and context-free.
4.1.2 Intersection between a CFL and a regular language#
The intersection of a regular language with a context-free language is always context-free. To prove this, the idea is that CFLs are recognized by PDAs, while regular languages are recognized by finite automata (FAs). We construct a new machine that simulates both. Due to space constraints, interested readers are encouraged to consult the further reading for the complete proof details.
4.1.3 Example of CFLs Non-Closure under Intersection#
Consider the following two context-free languages:
\(L_1 = \{a^nb^na^m \mid n,m \geq 1\}\)
\(L_2 = \{a^mb^na^n \mid n,m \geq 1\}\)
Language \(L_1\) is context-free because we can use the following CFG production rules to generate it: $\(S \rightarrow XA; X \rightarrow aXb \mid ab; A \rightarrow aA \mid a\)$
Language \(L_2\) is also context-free because we can use the following CFG production rules to generate it: $\(S \rightarrow AX; X \rightarrow bXa \mid ba; A \rightarrow aA \mid a\)$
Now consider their intersection: \(L_1 \cap L_2 = \{a^nb^na^n \mid n \geq 1\}\). This language requires an equal number of a’s, b’s, and then a’s, which is known to be NOT context-free. The proof of its non-context-freeness relies on the pumping lemma for CFLs and can be found in the further reading. Since the intersection of two given CFLs resulted in a non-CFL, this proves that CFLs are not closed under intersection.
4.1.4 Summary of Closure Properties#
Language Class |
Closed Under Intersection? |
|---|---|
Regular |
✅ Yes (result is regular and also context-free) |
Context-Free |
❌ No (in some cases, intersection may not be context-free) |
CFL & Regular |
✅ Yes (result is context-free) |
4.2 Complement#
CFLs are not closed under complement. This means that if a language \(L\) is context-free, its complement \(\overline{L}\) (i.e., all strings not in \(L\)) may not necessarily be context-free.
4.2.1 Proof#
Let us assume that CFLs were closed under complement. That is, for any context-free language \(L\), \(\overline{L}\) is also context-free.
If \(L_1\) and \(L_2\) are two CFLs, then under our assumption, their complements, \(\overline{L_1}\) and \(\overline{L_2}\), must also be CFLs.
Since the union of two CFLs is known to be a CFL, it follows that \(\overline{L_1} \cup \overline{L_2}\) is a CFL. Then under our assumption, the complement of \(\overline{L_1} \cup \overline{L_2}\), which is \(\overline{\overline{L_1} \cup \overline{L_2}}\), must also be a CFL.
De Morgan’s Law establishes the following relationship: \(L_1 \cap L_2 = \overline{\overline{L_1} \cup \overline{L_2}}\). This implies that the language of \(\overline{\overline{L_1} \cup \overline{L_2}}\) is equivalent to \(L_1 \cap L_2\), representing the intersection of \(L_1\) and \(L_2\).
However, we have already proven that CFLs are not closed under intersection. Since closure under complement would imply closure under intersection (which we know is false), the assumption that CFLs are closed under complement must be incorrect.
4.2.2 CFLs are Closed Under Complement if They are Regular Languages#
Since every regular language is a context-free language, and regular languages are closed under complement, their complements remain regular and thus context-free. Therefore, some context-free languages have complements that are also context-free. This does not contradict the general result that CFLs are not closed under complement. It simply identifies a special case where complementation preserves context-freeness.
5. The Pumping Lemma for CFLs#
5.1 Introduction to the Pumping Lemma#
The Pumping Lemma for CFLs is a tool used to prove that certain languages are not context-free. It works by showing that any sufficiently long word in a CFL can be “pumped”, meaning part of it can be repeated while still remaining in the language. If a language does not have this property, then it is not context-free.
5.2 Statement of the Pumping Lemma#
If \(L\) is a context-free language, then there exists a constant \(p > 0\) (called the pumping length) such that any string \(s\) in \(L\) with \(|s| \geq p\) can be split into five parts: $\(s = uvxyz\)$, such that:
\(|vxy| \leq p\) (the middle section is not too long to ensure the repetition happens “soon enough” in a sufficiently long string)
\(|vy| \geq 1\) (at least one of \(v\) and \(x\) is non-empty since we need something to pump, so at least one of these sections must contain something)
\(uv^nxy^nz \in L \text{ for all } n \geq 0\) (the string remains in the language when \(v\) and \(y\) are repeated or removed simultaneously).
5.3 Intuitive Explanation of the Pumping Lemma#
The pumping lemma is based on the structure of parse trees for context-free grammars. For any sufficiently long string, the parse tree must be sufficiently deep, causing some non-terminal symbol to appear multiple times on a path from the root to a leaf. This repetition creates a “pumpable” section in the parse tree, allowing certain parts of the string to be repeated while maintaining grammatical correctness. The key insight is that these repetitions must occur in balanced pairs - we can’t pump one part without pumping its corresponding part elsewhere in the string.
The following figure (Figure 1) illustrates a parse tree for a string that has been divided into five parts: \(u, v, x, y, z\). This specific structure reveals why we can “pump” certain parts of strings in CFLs. Notice how in the tree, there are two occurrences of the non-terminal \(A\), one at the top and another in the middle of the tree. When a parse tree gets tall enough (for long strings), some non-terminal symbol must repeat along a path from root to leaf. This creates a “loop” in the derivation process. In the diagram:
The section \(v\) is generated under the first occurrence of \(A\) via the \(B\) branch
The section \(y\) is generated under the first occurrence of \(A\) via the \(C\) branch
Because of this structure, when the non-terminal \(A\) appears again in the derivation, we can repeat this entire subtree any number of times. This causes both \(v\) and \(y\) to repeat in a synchronized way.
5.4 Visualization Setup: Parse Tree for the Pumping Lemma#
Note: The next two code cells are setup and rendering only — they do not implement a formal-language algorithm. Their purpose is to display Figure 1, a parse-tree diagram illustrating why the Pumping Lemma works. Read the output figure, not the code, for the theoretical content.
The parse tree shows a string \(s = uvxyz\) split into five parts. The key structural observation is that a non-terminal \(A\) appears twice on the same root-to-leaf path. This repetition is what makes pumping possible: the subtree rooted at the lower \(A\) can be substituted into the subtree rooted at the upper \(A\) any number of times, producing \(uv^nxy^nz\) for any \(n \geq 0\).
What to look for in the figure:
Figure element |
Formal meaning |
|---|---|
Node labelled \(S\) at root |
Derivation starts here |
Node labelled \(A\) (upper) |
First occurrence of the repeated non-terminal |
Node labelled \(A\) (lower) |
Second occurrence — creates the “pump loop” |
Leaves \(u, v, x, y, z\) |
The five-part decomposition of \(s\) |
Subtree \(B\) under upper \(A\) |
Generates the portion \(v\) |
Subtree \(C\) under lower \(A\) |
Generates the portion \(y\) |
Condition \(|vxy| \leq p\) (pumping length) corresponds to the upper \(A\) being within \(p\) steps of the root — guaranteed by choosing \(p = |V|^h\) where \(h\) is the grammar’s maximum rule length.
The cell below simply checks that matplotlib is available. If you are running in
a fresh virtual environment, it will install the package automatically.
# Install required Python packages using pip
import sys
import importlib.util
import subprocess
def install_if_missing(package):
if importlib.util.find_spec(package) is None:
print(f"Installing {package}...")
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
print(f"{package} has been installed")
else:
print(f"{package} is already installed")
required_packages = ['matplotlib']
for package in required_packages:
install_if_missing(package)
matplotlib is already installed
5.5 Figure 1: Parse Tree Diagram#
The cell below loads and renders the parse-tree SVG file
(parse_tree_for_pumping_lemma.svg) that must be present in the same directory as
this notebook. This is a visualization-only cell — it does not compute anything.
Reading the diagram in terms of the Pumping Lemma statement:
The Pumping Lemma asserts: for any CFL \(L\) there exists \(p > 0\) such that any \(s \in L\) with \(|s| \geq p\) can be written as \(s = uvxyz\) satisfying:
\(|vxy| \leq p\) — the pumpable section is not too far from the root
\(|vy| \geq 1\) — at least one of \(v\), \(y\) is non-empty (something actually pumps)
\(uv^nxy^nz \in L\) for all \(n \geq 0\) — pumping preserves membership
The figure makes condition 3 visually clear: because the same non-terminal \(A\) appears at two levels of the tree, you can replace the upper \(A\)-subtree with the lower one (removing one copy of \(v\) and \(y\), giving \(n=0\)) or insert the upper \(A\)-subtree inside itself repeatedly (adding more copies, giving \(n \geq 2\)).
Running outside Jupyter. The display(HTML(...)) call is Jupyter-specific. To
view the figure outside Jupyter, open parse_tree_for_pumping_lemma.svg in any
browser, or convert it to PNG with:
# Requires Inkscape
inkscape parse_tree_for_pumping_lemma.svg --export-png=parse_tree.png
# In a code cell:
from IPython.display import SVG, display, HTML, Markdown
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
# Load SVG from file
def load_svg_from_file(filename):
"""Load SVG content from a file"""
with open(filename, 'r', encoding='utf-8') as f:
svg_content = f.read()
print(f"SVG loaded from {filename}")
return svg_content
loaded_svg = load_svg_from_file('parse_tree_for_pumping_lemma.svg')
html_with_caption = f"""
<figure style="text-align: center;">
{loaded_svg}
<figcaption style="margin-top: 10px; font-style: italic; color: #666;">
Figure 1: Parse tree diagram showing a string that has been divided into five parts
</figcaption>
</figure>
"""
display(HTML(html_with_caption))
SVG loaded from parse_tree_for_pumping_lemma.svg
5.6 Using the Pumping Lemma to Prove a Language is Not Context-Free#
For any CFL, all sufficiently long strings must have this pumpable structure. If we can find a string in a language that cannot be pumped this way, then the language cannot be context-free. To prove that a language \(L\) is not context-free, we assume it is context-free and satisfies the pumping lemma. Then, we find a string in \(L\) that cannot be pumped while still remaining in \(L\), leading to a contradiction.
5.6.1 Example 1: Language \(L = \{a^nb^na^n \mid n \geq 0\}\)#
Prove Language \(L = \{a^nb^na^n \mid n \geq 0\}\) is not a CFL:
Step 1: Assume \(L = \{a^nb^na^n \mid n \geq 0\}\) is context-free, then the pumping lemma applies with some pumping length \(p\)
Step 2: Choose \(s = a^pb^pa^p\), clearly \(s \in L\) and \(|S| \geq p\).
Step 3: According to the pumping lemma, \(s\) can be written as \(uvxyz\) where \(|vy| \geq 1 \text{ and } |vxy| \leq p\)
Step 4: Since \(|vxy| \leq p\), \(vxy\) consists of characters within the first \(p\) positions of \(s\). This means \(vxy\) contains only \(a's\), or \(a's\) and \(b's\).
Step 5: Let’s analyze the possible cases for \(v\) and \(y\):
Case 1: Both \(v\) and \(y\) contain only \(a's\): then \(uv^nxy^nz\) contains too many \(a's\) relative to \(b's\) and the second \(a's\)
Case 2: \(v\) contains only \(a's\) and \(y\) contains only \(b's\) (or vice versa): then \(uv^nxy^nz\) has unequal numbers of \(a's\), \(b's\), and \(a's\)
Case 3: Both \(v\) and \(y\) contain only \(b's\): then \(uv^nxy^nz\) contains too many \(b's\) relative to first \(a's\) and second \(a's\)
Case 4: \(v\) contains both \(a's\) and \(b's\), or \(y\) spans both \(b's\) and second \(a's\): then \(uv^nxy^nz\) would disrupt the order of \(a's\), \(b's\), and second \(a's\)
Step 7: In all cases, \(uv^nxy^nz \notin L\), contradicting the pumping lemma
Conclusion: \(L = \{a^nb^na^n \mid n \geq 0\}\) is not context-free.
Concrete example with \(p = 3\) and \(s = aaabbbaaa\):
One possible decomposition: \(u=\wedge, v=aa, x=a, y=bb, z=baaa\).
Pumping with i=2: \(uv^2xy^2z = aaaabbbbaaa \notin L\).
Pumping with i=0: \(uv^0xy^0z = abaaa \notin L\).
5.6.2 Example 2: Language \(L = \{ww \mid w \in \{a, b\}^+\}\)#
The language of repeated strings consists of strings that are formed by repeating a substring at least twice. Formally, this can be represented as: \(L = \{ww \mid w \in \{a, b\}^+\}\), where \(w\) is any non-empty string over the alphabet \(\Sigma = \{a, b\}\), and \(ww\) denotes the concatenation of \(w\) with itself. Show that it is not a CFL:
Step 1: Assume it is context-free, then the pumping lemma applies with some pumping length \(p\).
Step 2: Choose \(s = a^pb^pa^pb^p\), apparently \(s \in L\) since it is the string \(w = a^pb^p\) repeated.
Step 3: According to the pumping lemma, \(s\) can be written as \(uvxyz\) where \(|vy| \geq 1 \text{ and } |vxy| \leq p\).
Step 4: Since \(|vxy| \leq p\), \(vxy\) must be contained within the first \(p\) characters of \(s\), meaning \(vxy\) consists of only \(a's\).
Step 5: When we pump with i=2: \(uv^2xy^2z\), this adds more \(a's\) to the first half without changing the second half. The resulting string cannot be of the form \(ww\) since the first half would have more \(a's\) than the second half
Step 6: Therefore, \(uv^2xy^2z \notin L\), contradicting the pumping lemma
Conclusion: \(L = \{ww \mid w \in \{a, b\}^+\}\) is not context-free.
Concrete example with \(p = 3\) and \(s = aaabbbaaabbb\):
One possible decomposition: \(u=\wedge, v=aa, x=a, y=\wedge, z=bbbaaabbb\).
Pumping with i=2: \(uv^2xy^2z = aaaabbbaaabbb \notin L\).
This is not a repeated string since the first half and second half are not equal: \(aaaabbb \neq aaabbb\).
6. Practice Exercises#
6.1 Closure Properties#
Given the following context-free languages: $\(L_1 = \{a^nb^n \mid n \geq 0\}\)\( \)\(L_2 = \{b^nc^n \mid n \geq 0\}\)$
Describe the following languages and determine whether they are context-free:
\(L_1 \cup L_2\)
\(L_1L2\)
\(L_1^*\)
\((L_1 \cup L_2)^*\)
Modify the palindrome PDA to accept palindromes with a middle marker ‘#’
6.2 True or False#
Determine whether each statement is true or false. If false, provide a counterexample.
If L₁ and L₂ are context-free languages, then L₁ ∩ L₂ is a context-free language.
If L is a context-free language, then L* is a context-free language.
If L is a context-free language, then L^R (the reverse of L) is a context-free language.
The intersection of a context-free language and a regular language is always context-free.
If L is context-free and R is regular, then L - R (set difference) is context-free.
6.3 Non-Closure Properties#
Given the following context-free languages: $\(L_1 = \{a^nb^n \mid n \geq 0\}\)\( \)\(L_2 = \{b^nc^n \mid n \geq 0\}\)$ Determine whether their intersection is context-free. If it is, provide a context-free grammar; if not, prove it using the pumping lemma.
6.4 CFG Union Builder#
Implement a complete system for combining two context-free grammars using the union operation. Your implementation should handle naming conflicts automatically and produce a valid CFG that generates the union of the two input languages.
Inputs: two CFGs representing:
G1: Language of balanced parentheses \(\{(^n )^n \mid n \ge 0 \}\)
G2: Language of equal brackets: \(\{[^n ]^n \mid n \ge 0 \}\)
# CFG for balanced parentheses
cfg_parens = CFG(
variables={'S'},
terminals={'(', ')'},
rules={'S': ['( S )', '∧']},
start='S'
)
# CFG for balanced brackets
cfg_brackets = CFG(
variables={'S'},
terminals={'[', ']'},
rules={'S': ['[ S ]', '∧']},
start='S'
)
6.5 Concatenation Constructor#
Implement a method to concatenate two CFLs. Given CFGs for \(L_1\) and \(L_2\), create a CFG for \(L_1L_2\).
Inputs: two CFGs representing:
\(L_1\): Language of \(\{a^nb^n \mid n \geq 0 \}\)
\(L_2\): Language of \(\{c^m \mid m \geq 0 \}\)
6.6 Kleene Star Generator#
Implement a function that takes a CFG and produces a new CFG that generates the Kleene closure of the original language. Handle edge cases where the original language already contains ∧.
6.7 Pumping Lemma Checker#
Create a program that verifies whether a given string decomposition satisfies the pumping lemma conditions for CFLs. The program should check all three conditions and demonstrate pumping.
7. Further Reading#
“Introduction to the Theory of Computation” by Michael Sipser, Section 2.3
“Introduction to Computer Theory” by Daniel I.A. Cohen, Chapter 16, 17
“Automata Theory, Languages, and Computation” by Hopcroft, Motwani, and Ullman, Chapter 7