Decidability in Context-Free Grammars (CFGs)

13. Decidability in Context-Free Grammars (CFGs)#

Introduction#

This section explores several decidable problems related to Context-Free Grammars (CFGs). Decidability in formal language theory refers to whether an algorithm exists that can determine the answer to a specific problem for all possible inputs within a given class of languages. For context-free grammars (CFGs), several important problems are decidable, meaning there exist algorithms to solve them effectively. This chapter covers key decidable problems in the context of CFGs, including:

Emptiness Problem
Uselessness Problem
Membership Problem
Finiteness Problem

Code Walkthrough: CFG Container#

Concept#

This cell defines a CFG class that will be used throughout Sections 1 and 2 of this chapter (Emptiness and Uselessness). It represents the same mathematical 4-tuple $G = (V, \Sigma, R, S)$ as in Chapter 12, but with a different Python interface.

⚠️ This is a third distinct CFG class. Chapter 12 introduced two CFG containers. This one differs from both:

Feature

Ch 12 CFG

Ch 12 Grammar

This CFG

Variables field

variables

non_terminals

auto-computed

Terminals field

terminals

terminals

auto-computed

Rules field

rules (dict→list of strings)

rules (dict→list of strings)

productions (dict→list of lists)

Start field

start

start_symbol

start_symbol

Do not pass this chapter’s CFG objects into Chapter 12’s functions, or vice versa — the field names and production formats are incompatible.

Notation mapping#

Formal notation	Python field	Format
$V$ (variables)	`self.non_terminals`	`set` — auto-computed from `productions.keys()`
$\Sigma$ (terminals)	`self.terminals`	`set` — auto-computed by scanning all RHS symbols
$R$ (rules)	`self.productions`	`dict`: `str → list[list[str]]`
$S$ (start)	`self.start_symbol`	`str`

Key format difference: productions are stored as lists of lists, not lists of strings. The rule $S \to (S)$ is stored as ['(', 'S', ')'] — each symbol is a separate list element. The empty string $\varepsilon$ is stored as [] (empty list).

What to look for#

The __init__ method auto-computes non_terminals and terminals by scanning all production right-hand sides. Any symbol that is not a key in productions and is not "∧" is classified as a terminal. This means the caller never needs to declare terminals explicitly — but it also means a typo in a non-terminal name (e.g., writing 'a' where 'A' was intended) will silently become a terminal.

Running outside Jupyter#

Save the class and the balanced_parens example to cfg_ch13.py and run:

python cfg_ch13.py

No Jupyter-specific calls are present.

# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Represents G = (V, Σ, R, S) where:
#   V = non_terminals (auto-computed from production left-hand sides)
#   Σ = terminals (auto-computed by scanning all RHS symbols)
#   R = productions dict: non-terminal → list of alternative RHS symbol-lists
#   S = start_symbol
# Reference: Sipser §2.1 Definition 2.2; Cohen Chapter 14
# ────────────────────────────────────────────────────────────────────────────

class CFG:
    """
    CFG container for Chapter 13's decidability algorithms.
    Productions format: {'A': [['a', 'B'], ['c']], ...}
    Each alternative is a list of symbol strings; [] encodes ε.

    Note: this class is API-incompatible with Chapter 12's CFG and Grammar classes.
    """

    def __init__(self, productions, start_symbol):
        # R — stored as dict: lhs_string → list of rhs_symbol_lists
        # Using list-of-lists (not list-of-strings) lets each symbol be
        # inspected individually without string splitting.
        self.productions = productions

        # S ∈ V
        self.start_symbol = start_symbol

        # ── Auto-compute V = keys of productions dict ──────────────────────
        # V is exactly the set of symbols that appear on a left-hand side.
        self.non_terminals = set(productions.keys())

        # ── Auto-compute Σ by scanning all RHS symbols ────────────────────
        # Any RHS symbol that is not in V and is not the lambda marker "∧"
        # is classified as a terminal.  This means typos in non-terminal names
        # will silently become terminals — verify productions carefully.
        self.terminals = set()
        for lhs, rhs_list in productions.items():
            for rhs in rhs_list:           # rhs is one alternative (a list)
                for symbol in rhs:         # symbol is one token in that alternative
                    if symbol not in self.non_terminals and symbol != "∧":
                        self.terminals.add(symbol)

    def __str__(self):
        result = f"Start symbol: {self.start_symbol}\n"
        result += "Productions:\n"
        for lhs, rhs_list in self.productions.items():
            for rhs in rhs_list:
                if not rhs:               # [] encodes ε
                    result += f"  {lhs} -> ∧\n"
                else:
                    result += f"  {lhs} -> {' '.join(rhs)}\n"
        return result


# ── EXAMPLE: CFG for balanced parentheses {(ⁿ)ⁿ | n ≥ 0} ──────────────────
# Productions:
#   S → (S)    recursive case — adds one pair of parentheses
#   S → SS     concatenation of two balanced strings
#   S → ε      base case — empty string is balanced
# Note: [] encodes the empty production ε
balanced_parens = CFG(
    productions={
        'S': [['(', 'S', ')'], ['S', 'S'], []]
    },
    start_symbol='S'
)
print(balanced_parens)

Start symbol: S
Productions:
  S -> ( S )
  S -> S S
  S -> ∧

1. Emptiness Problem#

1.1 Definition#

The emptiness problem asks: “Is the language generated by a given CFG empty?” This problem is decidable by checking whether the start symbol S can derive any terminal string. A CFG generates an empty language if and only if its start symbol cannot derive any string of terminals. To determine this, we can use a reachability algorithm that finds all non-terminals that can derive terminal strings.

Note: A language that consists solely of the empty string $\wedge$ is not an empty language.

1.2 Algorithm to Check Emptiness:#

Algorithm 1:

Initialize a set derivable with all non-terminals that have a production directly to terminals or $\wedge$ (empty string).
Iterate: Add a non-terminal to derivable if it has a production where all symbols are either terminals or non-terminals already in derivable.
Repeat until no more non-terminals can be added. If the start symbol $S$ is in derivable, the language is non-empty; otherwise, it’s empty.

Algorithm 2:

Identify directly derivable non-terminals:
- Mark any nonterminal $N$ that has a production of the form $N \rightarrow t$, where $t$ is a terminal or a string of terminals.
- Replace occurrences of $N$ in other productions with $t$, this will remove $N$ altogether.
Repeat the previous step to remove derivable non-terminals until no new non-terminals can be eliminated:
- If $S$ is eliminated, the CFG is not empty.
- If $S$ is not eliminated, the CFG is empty.

Note: Algorithm 1 maintains the original CFG structure and simply tracks which non-terminals can derive terminal strings. Algorithm 2 modifies the original CFG by eliminating non-terminals, which can be useful in further simplifications but changes the grammar structure. Both approaches correctly solve the Emptiness problem, but Algorithm 2’s approach of modifying the grammar makes it less suitable for situations where you need to preserve the original grammar structure. However, the substitution method can be more intuitive for understanding why certain strings are derivable from the grammar.

Comparison of Emptiness Problem Algorithms

Feature	Algorithm 1	Algorithm 2
Grammar Structure	Preserves the original CFG structure	Modifies the original CFG structure
Non-terminal Handling	Tracks which non-terminals can derive terminal strings	Eliminates non-terminals from the grammar
Effect on Grammar	Non-destructive - original grammar is unchanged	Destructive - transforms the grammar
Use Cases	Suitable when original grammar structure must be preserved	Useful for further grammar simplifications
Intuitiveness	More abstract - focuses on non-terminal properties	More intuitive for understanding string derivability
Output	Provides a yes/no answer about emptiness	Provides a simplified grammar (or determines emptiness)

1.3 Examples#

Consider this given CFG with productions: $S \rightarrow AB; A \rightarrow AX; B \rightarrow BY \mid XX; X \rightarrow a; Y \rightarrow b$

Using Algorithm 1:

Step 1: initialize a set $derivable = \{X, Y\}$ since they have productions directly to terminals.
Step 2: add $B$ to the set due to the production $B \rightarrow XX$ and $X$ is already in the set: $derivable = \{X, Y, B\}$
Step 3: Repeat until no more non-terminals can be added: $derivable = \{X, Y, B\}$. Since $S$ is not in the set, the language is empty.

Using Algorithm 2:

Step 1: Identify directly derivable non-terminals: $X, Y$ since they have productions directly to terminals.
Step 2: Replace occurrences of $X, Y$ in other productions with terminals and eliminate them: $S \rightarrow AB; A \rightarrow Aa; B \rightarrow Bb \mid aa$
Step 3: Repeat step 1 and 2: we identify another directly derivable non-terminal: $B$, replace and eliminate it: $S \rightarrow Aaa; A \rightarrow Aa$.
Step 4: Since $S$ is not eliminated, the CFG is empty.

1.4 Code Walkthrough: Emptiness Problem — Algorithm 1#

Concept#

Decision problem: Given a CFG $G$, is $L(G) = \emptyset$?

Theorem (Emptiness Decidability). $L(G) = \emptyset$ if and only if the start symbol $S$ cannot derive any string of terminals. Algorithm 1 decides this by computing the derivable set — the set of non-terminals that can eventually produce a string of terminals — using a fixed-point iteration.

Formal algorithm ↔ Python mapping#

Algorithm 1 step	Function phase	Python construct
Initialise: $\mathit{derivable} \leftarrow \{N \mid N \to t_1\ldots t_k \in R\}$	Phase 1	First `for` loop over `cfg.productions`
Iterate: add $N$ if some production $N \to \alpha$ has $\alpha \subseteq \mathit{derivable} \cup \Sigma$	Phase 2	`while changed` loop
Accept if $S \in \mathit{derivable}$	Phase 3	`return cfg.start_symbol not in derivable`

Fixed-point guarantee. The while changed loop terminates because derivable only grows (symbols are never removed) and $V$ is finite. At most $|V|$ iterations are needed.

Dependency. This function requires a CFG object (defined in the cell above). Do not pass a raw dict — it will raise AttributeError on cfg.terminals.

Running outside Jupyter#

Append the function and the two test calls to cfg_ch13.py from the previous cell and run:

python cfg_ch13.py

# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Implements Algorithm 1 for the Emptiness Decision Problem.
# Theorem: L(G) = ∅  iff  S ∉ derivable  after fixed-point iteration.
# Reference: Sipser §4.1; Cohen Chapter 18, Theorem 18.1
# Time complexity: O(|V| · |R|) — at most |V| iterations, each scanning all rules.
# ────────────────────────────────────────────────────────────────────────────

def is_language_empty(cfg):
    """
    Determine if L(G) = ∅ using Algorithm 1 (fixed-point / non-destructive).

    Args:
        cfg: A CFG object (from this chapter's CFG class — not Ch12's Grammar).

    Returns:
        bool: True if L(G) = ∅, False if L(G) ≠ ∅.
    """
    # ── Phase 1: Base case — non-terminals that directly derive terminals or ε ──
    # derivable = { N ∈ V | N → α ∈ R  and  α ⊆ Σ ∪ {ε} }
    derivable = set()
    for non_terminal, productions in cfg.productions.items():
        for production in productions:
            # production == []  means  N → ε  (directly derives empty string)
            # all terminals means  N → t₁t₂…tₖ  (directly derives terminal string)
            if not production or all(symbol in cfg.terminals for symbol in production):
                derivable.add(non_terminal)
                break   # one valid production is enough; stop checking this NT

    # ── Phase 2: Fixed-point iteration — extend derivable until stable ─────────
    # Invariant: derivable grows monotonically; loop terminates in ≤ |V| steps.
    changed = True
    while changed:
        changed = False
        for non_terminal, productions in cfg.productions.items():
            if non_terminal in derivable:
                continue  # already known to be derivable; skip

            for production in productions:
                # N can derive a terminal string if all its RHS symbols can
                # (either they are terminals or they are already in derivable)
                if all(symbol in derivable or symbol in cfg.terminals
                       for symbol in production):
                    derivable.add(non_terminal)
                    changed = True   # signal another pass is needed
                    break

    # ── Phase 3: Decision — is the start symbol derivable? ────────────────────
    # L(G) ≠ ∅  iff  S ∈ derivable
    return cfg.start_symbol not in derivable


# ── TEST CASES ────────────────────────────────────────────────────────────────

# balanced_parens: S → (S) | SS | ε  — S can derive ε directly, so L ≠ ∅
print(f"Is balanced_parens language empty? {is_language_empty(balanced_parens)}")

# Circular grammar: S → A, A → S — neither S nor A can reach a terminal string
empty_lang = CFG(
    productions={
        'S': [['A']],
        'A': [['S']]
    },
    start_symbol='S'
)
print(f"Is empty_lang language empty? {is_language_empty(empty_lang)}")

Is balanced_parens language empty? False
Is empty_lang language empty? True

2. Uselessness Problem#

2.1 Definition#

The uselessness problem asks: “Which non-terminals in a CFG are useless?” A non-terminal is useless if:

It cannot derive any string of terminals (non-generating), or
It cannot be reached from the start symbol.

2.2 Algorithm to Identify Useless Non-terminals#

This problem is decidable and it can be solved using the following algorithm:

Step 1: find all derivable non-terminals (as in the Emptiness problem) to determine the non-generating non-terminals.
Step 2: remove all productions that involves non-generating non-terminals.
Step 3: based on the revised CFG, find all reachable non-terminals starting from the start symbol.
A non-terminal is useful if and only if it is both derivable and reachable. Otherwise it is useless.

2.3 Examples#

Find all useless non-terminals in this given CFG with productions: $S \rightarrow AB \mid CD; A \rightarrow aA \mid a; B \rightarrow bB \mid BD \mid b; C \rightarrow abC; D \rightarrow dD \mid d; E \rightarrow e$

Step 1: find all derivable non-terminals to determine the non-generating non-terminals:
- initialize a set $derivable = \{A, B, D, E\}$ since they have productions directly to terminals.
- add $S$ to the set due to the production $S \rightarrow AB$ and $A, B$ are already in the set: $derivable = \{A, B, D, E, S\}$
- Repeat until no more non-terminals can be added: $derivable = \{A, B, D, E, S\}$.
- So $C$ is a non-generating non-terminal.
Step 2: remove all productions that involves $C$, so the CFG becomes $S \rightarrow AB; A \rightarrow aA \mid a; B \rightarrow bB \mid BD \mid b; D \rightarrow dD \mid d; E \rightarrow e$
Step 3: find all reachable non-terminals:
- Starting from $S$, we can reach $A, B$
- From $B$ we can reach $D$
- Since no productions contain $E$ on the righ-hand size, $E$ is unreachable.
Conclusion: in the given CFG, $C$ and $E$ are useless non-terminals.

2.4 Code Walkthrough: Uselessness Problem#

Concept#

Decision problem: Which non-terminals in a CFG are useless?

A non-terminal $N \in V$ is useless if it is non-generating (cannot derive any terminal string) OR unreachable (cannot be reached from the start symbol $S$).

Algorithm (two-pass):

Pass	Computes	How
Pass 1	`generating` set	Same fixed-point as `is_language_empty`
Pass 2	`reachable` set	BFS/fixed-point from $S$ through production RHS symbols
Decision	`useless`	$N$ is useless iff $N \notin \mathit{generating}$ OR $N \notin \mathit{reachable}$

Order matters. The algorithm must compute generating first, then remove non-generating non-terminals from the grammar, then compute reachable on the cleaned grammar. Swapping the passes can produce incorrect results (a non-terminal may appear reachable through a path that only goes through non-generating symbols). The code below handles this by checking both sets independently and taking their union for the useless set.

Two functions in one cell#

This cell defines two functions:

find_useless_symbols(cfg) — identifies useless non-terminals; returns a set.
eliminate_useless_symbols(cfg) — calls find_useless_symbols, builds and returns a new CFG with those symbols removed.

eliminate_useless_symbols depends on find_useless_symbols being defined first in the same cell.

Running outside Jupyter#

Append both functions and the test block to cfg_ch13.py and run:

python cfg_ch13.py

# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Implements the Uselessness Decision Procedure.
# A non-terminal N is useless iff N ∉ generating  OR  N ∉ reachable.
# Order of passes: generating first, then reachable on the original grammar.
# Reference: Sipser §2.1; Cohen Chapter 18; Hopcroft §7.1
# ────────────────────────────────────────────────────────────────────────────

def find_useless_symbols(cfg):
    """
    Identify all useless non-terminals in a CFG.

    Returns:
        set: non-terminals that are non-generating OR unreachable (or both).
    """
    # ── Pass 1: find GENERATING non-terminals ─────────────────────────────
    # generating = { N | N can derive some string in Σ* }
    # Same fixed-point algorithm as is_language_empty (Algorithm 1).
    generating = set()

    # Base: N → ε or N → t₁…tₖ  (all terminals or empty)
    for non_terminal, productions in cfg.productions.items():
        for production in productions:
            if not production or all(symbol in cfg.terminals for symbol in production):
                generating.add(non_terminal)
                break

    # Iterate: N is generating if it has a production whose RHS is all generating/terminal
    changed = True
    while changed:
        changed = False
        for non_terminal, productions in cfg.productions.items():
            if non_terminal in generating:
                continue
            for production in productions:
                if all(symbol in generating or symbol in cfg.terminals
                       for symbol in production):
                    generating.add(non_terminal)
                    changed = True
                    break

    # ── Pass 2: find REACHABLE non-terminals ──────────────────────────────
    # reachable = { N | S ⟹* αNβ for some α, β }
    # BFS/fixed-point starting from S: expand each reachable NT's productions.
    reachable = {cfg.start_symbol}
    changed = True
    while changed:
        changed = False
        new_reachable = set(reachable)          # copy so we can extend safely
        for non_terminal in reachable:
            if non_terminal not in cfg.productions:
                continue                         # terminal reached by mistake; skip
            for production in cfg.productions[non_terminal]:
                for symbol in production:
                    # Only track non-terminal symbols that haven't been seen yet
                    if symbol in cfg.non_terminals and symbol not in new_reachable:
                        new_reachable.add(symbol)
                        changed = True
        reachable = new_reachable

    # ── Pass 3: classify as useless ───────────────────────────────────────
    # N is useless iff it fails at least one criterion.
    useless = set()
    for non_terminal in cfg.non_terminals:
        if non_terminal not in generating or non_terminal not in reachable:
            useless.add(non_terminal)

    return useless


def eliminate_useless_symbols(cfg):
    """
    Return a new CFG equivalent to `cfg` with all useless symbols removed.

    Depends on: find_useless_symbols (defined above in this cell).
    """
    # ── Phase 1: identify what to remove ─────────────────────────────────
    useless = find_useless_symbols(cfg)

    # ── Phase 2: rebuild productions, skipping useless LHS and RHS symbols ─
    new_productions = {}
    for non_terminal, productions in cfg.productions.items():
        if non_terminal in useless:
            continue                             # drop useless LHS entirely
        valid_prods = []
        for production in productions:
            # Drop any production that mentions a useless symbol on the RHS;
            # keeping it would preserve an unreachable or non-generating path.
            if not any(symbol in useless for symbol in production):
                valid_prods.append(production)
        if valid_prods:                          # only add NT if it still has rules
            new_productions[non_terminal] = valid_prods

    return CFG(new_productions, cfg.start_symbol)


# ── TEST ──────────────────────────────────────────────────────────────────────
# Grammar: S → A | B; A → aA | a; B → C; C → B; D → a
# D is unreachable (no production reaches D from S).
# B and C form a cycle with no terminal derivation → non-generating.
grammar_with_useless = CFG(
    productions={
        'S': [['A'], ['B']],
        'A': [['a', 'A'], ['a']],
        'B': [['C']],          # B → C → B → … never reaches a terminal
        'C': [['B']],
        'D': [['a']]           # D is never referenced in any production from S
    },
    start_symbol='S'
)

useless = find_useless_symbols(grammar_with_useless)
print(f"Useless non-terminals: {useless}")

simplified_grammar = eliminate_useless_symbols(grammar_with_useless)
print("Grammar after eliminating useless symbols:")
print(simplified_grammar)

Useless non-terminals: {'B', 'D', 'C'}
Grammar after eliminating useless symbols:
Start symbol: S
Productions:
  S -> A
  A -> a A
  A -> a

3. Membership Problem#

3.1 Definition#

The membership problem asks: “Does a given string belong to the language generated by a CFG?” This problem is decidable using two primary algorithms: the Cocke-Younger-Kasami (CYK) algorithm and the Earley’s Algorithm.

3.2 CYK Algorithm (Cocke-Younger-Kasami) for Membership Problem#

The CYK algorithm is a dynamic programming approach that determines whether a given string is generated by a CFG.

Works only for grammars in Chomsky Normal Form (CNF).
Uses bottom-up parsing to check derivations efficiently.
Runs in $O(n^3)$ time, where $n$ is the length of the input string.

Algorithm Description: The CYK algorithm uses a bottom-up parsing approach with a dynamic programming table. We can find the answer step by step by looking at shorter parts of the string first. For a string of length $n$, we create an $n \times n$ table where each cell $[i,j]$ stores the non-terminals that can generate the substring from position $i$ to $j$. We fill the table from the smallest substrings (length 1) up to the full string. For each substring, we check if it can be split into two smaller parts, where one part is generated by a non-terminal $B$ and the other by a non-terminal $C$, and if there is a rule $A \rightarrow BC$. The algorithm can be described as follows:

Step 1: Initialize the Table
- Let $n$ be the length of the input string $w = w_1w_2...w_n$.
- Construct a table $T[i,j]$ to store the set of nonterminals that can derive the substring $w[i:j]$ (a substring of length $j$ starting at position $i$).
- $i$ represents the starting position of a substring.
- $j$ represents the length of the substring.
- Rows (i) correspond to different starting positions in the input string.
- Columns (j) correspond to substring lengths from 1 to n (the length of w).
- The table is triangular, meaning $T[i,j]$ is filled only for $j \geq i$.
Step 2: Filling single characters as the base cases
- For each terminal $w_k$ in the input string, find all non-terminals $A$ such that $A \rightarrow w_k$ is a production.
- Store these non-terminals in $T[k,1]$.
Step 3: Fill the Table Using Recursion
- For each substring length $L$: from $2$ to $n$:
  - For each starting position $i$: from $1$ to $n-L+1$:
    - Check all possible splits of the substring $w[i:i+L-1]$ into two parts:
      - Let $A \rightarrow BC$ be a rule in the CFG.
      - If $B$ is found in $T[i,s]$ and $C$ is found in $T[i+s,L-s]$, $A$ is added to $T[i,L]$.
Step 4: Check Membership
- If the start symbol $S$ is in $T[1,n]$, $w \in L(G)$.
- Otherwise, $w \notin L(G)$.

3.3 CYK Algorithm Examples#

Given a CFG in CNF: $S \rightarrow AB \mid BC; A \rightarrow BA \mid a; B \rightarrow CC \mid b; C \rightarrow AB \mid a$, decide whether $w = baaba$ can be generated by the given CFG.

Step 1 initialize the table: Given $w = baaba$, a string of length $n = 5$, we construct a $5 \times 5$ table (at the start, all cells are empty):

i/j	1	2	3	4	5
1
2
3
4
5

Step 2 filling the table for substrings of length 1: for each single character in $w$, we find which non-terminals produce that terminal:
- Since $A \rightarrow a$ and $C \rightarrow a$, $A, C$ can generate $a$;
- $B \rightarrow b$, $B$ can generate $b$:

i/j	1	2	3	4	5
1	b: {B}
2	a: {A,C}
3	a: {A,C}
4	b: {B}
5	a: {A,C}

Step 3 filling the table for substrings of length 2: look at pairs of adjacent terminals (substrings of length 2), then apply productions of the form $A \rightarrow BC$ to check if any non-terminals derive them.
- $w[1:2] = ba$: based on non-terminals producing length-1 substrings, we know that $BA$ and $BC$ can produce length-2 substring $ba$, since $S \rightarrow BC; A \rightarrow BA$, the set of non-terminals to produce length-2 substring $ba$ is $\{S,A\}$.
- $w[2:2] = aa$: based on non-terminals producing length-1 substrings, we know that $AA, AC, CA, CC$ can produce length-2 substring $aa$, since $B \rightarrow CC$, the set of non-terminals to produce length-2 substring $aa$ is $\{B\}$.
- $w[3:2] = ab$: based on non-terminals producing length-1 substrings, we know that $AB, CB$ can produce length-2 substring $ab$, since $S \rightarrow AB; C \rightarrow AB;$, the set of non-terminals to produce length-2 substring $aa$ is $\{S,C\}$.
- $w[4:2] = ba$: this substring has been analyzed before, the set of non-terminals to produce length-2 substring $ba$ is $\{S,A\}$.
- $w[5:2]$: this substring does not exist, we use “-” to represent an invalid entry in the table.

i/j	1	2
1	b: {B}	ba: {S,A}
2	a: {A,C}	aa:{B}
3	a: {A,C}	ab: {S,C}
4	b: {B}	ba: {S,A}
5	a: {A,C}	-

Step 4 filling the table for substrings of length 3:
- $w[1:3] = baa$: based on non-terminals producing length-1 and length-2 substrings, we know that $baa$ can be split into either $\{b, aa\}$ or $\{ba, a\}$. Note that we split a substring into only two parts, not three or more, since a CFG in CNF has exactly two non-terminals on the right-hand side of any produciton rule.
  - for $\{b, aa\}$, based on previous analysis, $BB$ can product it. However, since no non-terminal can produce $BB$, no non-terminal can generate the substring $baa$ in this split.
  - for $\{ba, a\}$, based on previous analysis, $SA,SC,AA,AC$ can product it. However, since no non-terminal can produce $SA,SC,AA,AC$, no non-terminal can generate the substring $baa$ in this split.
  - So no non-terminal can produce $baa$ in any split. We use an enmpty set $\{\}$ to indicate that in the table.
- $w[2:3] = aab$: based on non-terminals producing length-1 and length-2 substrings, we know that $aab$ can be split into either $\{a, ab\}$ or $\{aa, b\}$.
  - for $\{a, ab\}$, based on previous analysis, $AS,AC,CS,CC$ can product it. Since $B$ can produce $C$, we add $B$ to the set of non-terminals for $aab$.
  - for $\{aa, b\}$, based on previous analysis, $BB$ can product it. However, since no non-terminal can produce $BB$, no non-terminal can generate the substring $aab$ in this split.
  - So $\{B\}$ can produce $aab$ in any split.
- $w[3:3] = aba$: based on non-terminals producing length-1 and length-2 substrings, we know that $aba$ can be split into either $\{a, ba\}$ or $\{ab, a\}$.
  - for $\{ab, a\}$, based on previous analysis, $SA,SC,CA,CC$ can product it. Since $B$ can produce $CC$, we add $B$ to the set of non-terminals for $aba$.
  - for $\{a, ba\}$, based on previous analysis, $AS,AA,CS,CA$ can product it. However, since no non-terminal can produce $AS,AA,CS,Ca$, no non-terminal can generate the substring $aba$ in this split.
  - So $\{B\}$ can produce $aba$ in any split.
- $w[4:3],w[5:3]$: they are not valid substrings, we use “-” in the table for these entries,

i/j	1	2	3
1	b: {B}	ba: {S,A}	baa: {}
2	a: {A,C}	aa:{B}	aab: {B}
3	a: {A,C}	ab: {S,C}	aba: {B}
4	b: {B}	ba: {S,A}	-
5	a: {A,C}	-	-

Step 5 filling the table for substrings of length 4:
- $w[1:4] = baab$: we know that $baab$ can be split into either $\{b, aab\}$, $\{ba, ab\}$, or $\{baa, b\}$.
  - for $\{b, aab\}$, based on previous analysis, $BB$ can product it. However, since no non-terminal can produce $BB$, no non-terminal can generate the substring $baab$ in this split.
  - for $\{ba, aa\}$, based on previous analysis, $SS,SC,AS,AC$ can product it. However, since no non-terminal can produce $SS,SC,AS,AC$, no non-terminal can generate the substring $baab$ in this split.
  - for $\{baa, b\}$, based on previous analysis, no non-terminal can produce $baa$, so no non-terminal can generate the substring $baab$ in this split.
  - So no non-terminal can produce $baab$ in any split. We use an enmpty set $\{\}$ to indicate that in the table.
- $w[2:4] = aaba$: we know that $aaba$ can be split into either $\{a, aba\}$, $\{aa, ba\}$, or $\{aab, a\}$.
  - for $\{a, aba\}$, based on previous analysis, $AB,CB$ can product it. Since $S,C$ can produce $AB$, we add $S,C$ to the set of non-terminals for $aaba$.
  - for $\{aa, ba\}$, based on previous analysis, $BS,BA$ can product it. Since $A$ can produce $BA$, we add $A$ to the set of non-terminals for $aaba$.
  - for $\{aab, a\}$, based on previous analysis, $BA,BC$ can product it. Since $A$ can produce $BA$ and $S$ can produce $BC$, we add $A,S$ to the set of non-terminals for $aaba$.
  - So $\{S,C,A\}$ can produce $aaba$ in any split.
- $w[3:4],w[4:4],w[5:4]$: they are not valid substrings, we use “-” in the table for these entries,

i/j	1	2	3	4
1	b: {B}	ba: {S,A}	baa: {}	baab:{}
2	a: {A,C}	aa:{B}	aab: {B}	aaba: {S,C,A}
3	a: {A,C}	ab: {S,C}	aba: {B}	-
4	b: {B}	ba: {S,A}	-	-
5	a: {A,C}	-	-	-

Step 6 filling the table for substrings of length 5:
- $w[1:5] = baaba$: we know that it can be split into either $\{b, aaba\}$, $\{ba, aba\}$, $\{baa, ba\}$, or $\{baab, a\}$.
  - for $\{b, aaba\}$, based on previous analysis, $SB,CB,AB$ can product it. Since $S,C$ can produce $AB$, we add $S,C$ to the set of non-terminals for $baaba$.
  - for $\{ba, aba\}$, based on previous analysis, $SB,AB$ can product it. Since $S,C$ can produce $AB$, we add $S,C$ to the set of non-terminals for $baaba$.
  - for $\{baa, ba\}$, based on previous analysis, no non-terminal can produce $baa$, so no non-terminal can generate the substring $baaba$ in this split.
  - for $\{baab, a\}$, based on previous analysis, no non-terminal can produce $baab$, so no non-terminal can generate the substring $baaba$ in this split.
  - So $S,C$ can produce $baaba$ in any split.
- $w[2:5],w[3:5],w[4:5],w[5,5]$: they are not valid substrings, we use “-” in the table for these entries,

i/j	1	2	3	4	5
1	b: {B}	ba: {S,A}	baa: {}	baab:{}	baaba:{S,C}
2	a: {A,C}	aa:{B}	aab: {B}	aaba: {S,C,A}	-
3	a: {A,C}	ab: {S,C}	aba: {B}	-	-
4	b: {B}	ba: {S,A}	-	-	-
5	a: {A,C}	-	-	-	-

Step 7: Since $S$ appears in $T[1:5]$, we conclude that $baaba \in L(G)$.

3.4 Code Walkthrough: CYK Algorithm (Membership Problem)#

Concept#

Decision problem: Given a CFG $G$ and string $w$, is $w \in L(G)$?

CYK (Cocke-Younger-Kasami) solves this in $O(n^3)$ time using bottom-up dynamic programming, but requires the grammar to be in Chomsky Normal Form (CNF): every production is either $A \to BC$ (two non-terminals) or $A \to a$ (one terminal).

Table structure. Build an $n \times n$ upper-triangular table $T$ where: $$T[i][j] = \{A \in V \mid A \xRightarrow{*} w_{i+1}\ldots w_{j+1}\}$$ (using 0-based indexing; $i$ = start position, $j$ = end position)

Table phase	Filled when	Rule applied
Base ($j = i$)	Substring length 1	$A \to w_k$
Recursive ($j > i$)	Substring length $j-i+1$	$A \to BC$: $B \in T[i][k]$, $C \in T[k+1][j]$

Accept if $S \in T[0][n-1]$.

⚠️ Grammar format change. This function does not accept the CFG class from Sections 1–2. It takes a raw Python dict with this specific format:
grammar = {
    'S': {('A', 'B'), ('B', 'C')},  # binary rules as tuples in a set
    'A': {('B', 'A'), 'a'},          # terminal rule as a bare string in the set
    'B': {('C', 'C'), 'b'},
}
Values are sets containing either:

A 2-tuple (lhs1, lhs2) for a binary production $A \to BC$

A bare string 'a' for a terminal production $A \to a$

This format is different from the CFG class (list-of-lists) and from the Earley cell’s format (list-of-tuples). Do not mix them.

Running outside Jupyter#

Save the function and the two test calls to cyk_ch13.py and run:

python cyk_ch13.py

# Note: defaultdict is imported here but not used by cyk_algorithm.
# It is retained for compatibility with any downstream cells that may use it.
from collections import defaultdict

# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Implements the CYK membership algorithm for CNF grammars.
# Requires: grammar in Chomsky Normal Form (every rule is A→BC or A→a).
# Time complexity: O(n³ · |R|) where n = |w|, |R| = number of productions.
# Table: T[i][j] = { A | A ⟹* w[i..j] }  (0-indexed, i ≤ j)
# Reference: Sipser §4.1; Hopcroft §7.4; Cohen Chapter 18
#
# Grammar format for this function:
#   dict: non-terminal str → set of (tuple-for-binary or str-for-terminal rules)
#   e.g. {'S': {('A','B'), 'a'}, 'A': {('B','C'), 'b'}}
# ────────────────────────────────────────────────────────────────────────────

def cyk_algorithm(grammar, start_symbol, word):
    """
    Determine if `word` ∈ L(grammar) using the CYK algorithm.

    Args:
        grammar:      dict — CNF grammar (see format note above).
        start_symbol: str  — start non-terminal.
        word:         str  — input string to test.

    Returns:
        bool: True if word ∈ L(grammar), False otherwise.
    """
    n = len(word)

    # ── Special case: empty string ────────────────────────────────────────
    # If the grammar has a rule start_symbol → ε, the empty string is accepted.
    if n == 0:
        return start_symbol in grammar.get("", set())

    # ── Phase 1: Initialise the DP table ─────────────────────────────────
    # table[i][j] = set of non-terminals that derive w[i..j]
    # Allocate n×n; only upper triangle (j ≥ i) will be used.
    table = [[set() for _ in range(n)] for _ in range(n)]

    # ── Phase 2: Base case — fill diagonal (substrings of length 1) ──────
    # For each character w[i], find all A such that A → w[i] ∈ R.
    for i, symbol in enumerate(word):
        for lhs, rhs_set in grammar.items():
            if symbol in rhs_set:      # A → w[i] is a terminal rule (bare string)
                table[i][i].add(lhs)

    # ── Phase 3: Fill upper triangle (substrings of length 2 … n) ────────
    # length = current substring length being filled
    for length in range(2, n + 1):
        # i = start position of the substring (0-indexed)
        for i in range(n - length + 1):
            j = i + length - 1        # end position (inclusive)
            # k = split point: left part is w[i..k], right part is w[k+1..j]
            for k in range(i, j):
                # Check all binary productions A → B C
                for lhs, rhs_set in grammar.items():
                    for rule in rhs_set:
                        # Binary rule: rule is a 2-tuple (B, C)
                        if (len(rule) == 2
                                and rule[0] in table[i][k]   # B derives w[i..k]
                                and rule[1] in table[k+1][j]):# C derives w[k+1..j]
                            table[i][j].add(lhs)             # so A derives w[i..j]

    # ── Phase 4: Accept if start symbol derives the entire string ─────────
    return start_symbol in table[0][n - 1]


# ── EXAMPLE: grammar from Section 3.3 ────────────────────────────────────────
# S → AB | BC; A → BA | a; B → CC | b; C → AB | a
# (already in CNF — all rules are A→BC or A→a)
grammar = {
    'S': {('A', 'B'), ('B', 'C')},
    'A': {('B', 'A'), 'a'},
    'B': {('C', 'C'), 'b'},
    'C': {('A', 'B'), 'a'}
}
start_symbol = 'S'

word = "baab"
print(word + ": " + str(cyk_algorithm(grammar, start_symbol, word)))

word = "baaba"
print(word + ": " + str(cyk_algorithm(grammar, start_symbol, word)))

baab: False
baaba: True

3.5 Earley’s Algorithm for Membership Problem#

Earley’s algorithm is a chart parsing algorithm used for recognizing whether a given string belongs to a language defined by a CFG.

Works for any CFG, not just those in Chomsky Normal Form (CNF).
Uses a top-down, dynamic programming approach that processes the input string one symbol at a time while maintaining a chart of parsing states.
Time Compliexity: worst case: $O(n^3)$, best case: $O(n)$ for simple grammars, where $n$ is the length of the input string.

Algorithm Description: The fundamental unit in Earley’s algorithm is the Earley item, which has the form: $[A \rightarrow \alpha \cdot \beta, i]$, where

$A \rightarrow \alpha\beta$ is a production rule in the grammar
$\cdot$ represents the current parsing position in the production rule
$\alpha$ is the part of the production that has been recognized
$\beta$ is the part of the production that still needs to be recognized
$i$ is the starting position in the input string where recognition of this rule began

The algorithm maintains $n+1$ sets of Earley items for a string of length $n$, denoted $S_0, S_1, \cdots, S_n$. Each set $S_j$ contains items that represent possible parsing states after processing $j$ symbols of the input string. An Earley item $[A \rightarrow \alpha \cdot \beta, i]$ in $S_j$ means that we have recognized $\alpha$ starting from position $i$ and ending at position $j$. Earley’s algorithm uses three main operations to build these sets:

Predicting: for each item $[A \rightarrow \alpha \cdot B\beta, i]$ in $S_j$ where $B$ is a non-terminal, add $[B \rightarrow \cdot \gamma, j]$ to $S_j$ for every production rule $B \rightarrow \gamma$ in the grammar. This operation predicts what rules might be used next based on what we’re expecting to see.
Scanning: for each item $[A \rightarrow \alpha \cdot a\beta, i]$ in $S_j$ where $a$ is a terminal that matches the input at position $j$, add $[A \rightarrow \alpha a\cdot \beta, i]$ to $S_{j+1}$. This operation consumes an input symbol when it matches what we’re expecting.
Completing: for each item $[A \rightarrow \gamma \cdot, i]$ in $S_j$ which is a completed rule, for each item $[B \rightarrow \alpha \cdot A\beta, k]$ in $S_i$: add $[B \rightarrow \alpha A \cdot \beta, k]$ to $S_j$. This operation updates all rules that were waiting for the rule we just completed.

The algorithm follows these steps:

Initialize $S_0$ with the item $[S'\rightarrow \cdot S, 0]$, where $S$ is the original start symbol and $S'$ is a new start symbol.
For each position $j$ from $0$ to $n$:
- Apply prediction, scanning, and completion until no new items can be added to $S_j$
- Move to position $j+1$
If the item $[S'\rightarrow S\cdot, 0]$ is in $S_n$, the input string is accepted; otherwise, it’s rejected.

Example: Consider the CFG for balanced parentheses: $S \rightarrow SS \mid (S) \mid ()$. Decide if the string $(())$ can be derived by the given CFG.

Step 1: Initialize $S_0$: start with $[S' \rightarrow \cdot S, 0]$, add all $S$ productions by the predicting operation:
- $[S \rightarrow \cdot SS, 0]$
- $[S \rightarrow \cdot (S), 0]$
- $[S \rightarrow \cdot (), 0]$
- Finally $S_0$ contains the following items: $$[S' \rightarrow \cdot S, 0]$$[S \rightarrow \cdot SS, 0]$$[S \rightarrow \cdot (S), 0]$$[S \rightarrow \cdot (), 0]$$
Step 2: Process $($ at position $0$:
- Applying the scanning operation:
  - For item $[S \rightarrow \cdot(S), 0]$, add to $S_1$: $[S \rightarrow (\cdot S), 0]$
  - For item $[S \rightarrow \cdot(), 0]$, add to $S_1$: $[S \rightarrow (\cdot), 0]$
- Applying the predicting operation: For item $[S \rightarrow (\cdot S), 0]$, add to $S_1$:
  - $[S \rightarrow \cdot SS, 1]$
  - $[S \rightarrow \cdot (S), 1]$
  - $[S \rightarrow \cdot (), 1]$
- Finally $S_1$ contains the following items: $$[S \rightarrow (\cdot S), 0]$$[S \rightarrow (\cdot), 0]$$[S \rightarrow \cdot SS, 1]$$[S \rightarrow \cdot (S), 1]$$[S \rightarrow \cdot (), 1]$$
Step 3: Process $($ at position $1$
- Applying the scanning operation:
  - For item $S \rightarrow \cdot (S), 1$, add to $S_2$: $[S \rightarrow (\cdot S), 1]$
  - For item $[S \rightarrow \cdot(), 1]$, add to $S_2$: $[S \rightarrow (\cdot), 1]$
- Applying the predicting operation: for item $[S \rightarrow (\cdot S), 1]$, add to $S_2$:
  - $[S \rightarrow \cdot SS, 2]$
  - $[S \rightarrow \cdot (S), 2]$
  - $[S \rightarrow \cdot (), 2]$
- Finally $S_2$ contains the following items: $$[S \rightarrow (\cdot S), 1]$$[S \rightarrow (\cdot), 1]$$[S \rightarrow \cdot SS, 2]$$[S \rightarrow \cdot (S), 2]$$[S \rightarrow \cdot (), 2]$$
Step 4: Process $)$ at position $2$
- Applying the scanning operation:
  - For item $[S \rightarrow \cdot(), 1]$, add to $S_3$: $[S \rightarrow ()\cdot, 1]$
- Applying the completing operation: for item $[S \rightarrow ()\cdot, 1]$, find items in $S_1$ expecting $S$: $[S \rightarrow (\cdot S), 0]$, so add $[S \rightarrow (S \cdot), 0]$ to $S_3$.
- There is no new predicting operations needed
- Finally $S_3$ contains the following items: $$[S \rightarrow ()\cdot, 1]$$[S \rightarrow (S\cdot), 0]$$
Step 5: Process $)$ at position $3$
- Applying the scanning operation:
  - For item $[S \rightarrow (S\cdot), 0]$, add to $S_4$: $[S \rightarrow (S)\cdot, 0]$
- Applying the completing operation: for item $[S \rightarrow (S)\cdot, 0]$, find items in $S_0$ expecting $S$:
  - for $[S' \rightarrow \cdot S, 0]$, add $[S' \rightarrow S \cdot, 0]$ to $S_4$.
  - for $[S \rightarrow \cdot SS, 0]$, add $[S \rightarrow S \cdot S, 0]$ to $S_4$.
- There is no new predicting operations needed
- Finally $S_4$ contains the following items: $$[S \rightarrow (S)\cdot, 0]$$[S' \rightarrow S \cdot, 0]$$[S \rightarrow S \cdot S, 0]$$
Step 6: since $S_4$ contains $[S' \rightarrow S \cdot, 0]$, the string $(())$ is accepted.

3.6 Code Walkthrough: Earley’s Algorithm (Membership Problem)#

Concept#

Earley’s algorithm decides membership for any CFG (CNF not required). It runs in $O(n^3)$ worst-case, $O(n^2)$ for unambiguous grammars, and $O(n)$ for simple deterministic grammars (e.g., LR(k)).

Earley item: $[A \to \alpha \bullet \beta,\; i]$

Component	Meaning
$A \to \alpha\beta$	A production rule
$\bullet$ (dot)	How far parsing of this rule has progressed
$\alpha$	The part of the RHS already matched
$\beta$	The part still to be matched
$i$	Position in input where this rule’s match began

Chart structure. Sets $S_0, S_1, \ldots, S_n$ where $S_j$ contains all Earley items that are consistent after consuming input positions $0$ through $j-1$.

Three operations fill the chart:

Operation	Trigger	Action
Predictor	Dot before non-terminal $B$	Add $[B \to \bullet\gamma, j]$ for every $B$-production
Scanner	Dot before terminal $a = w_j$	Advance dot → $S_{j+1}$
Completer	Dot at end of rule	Find all items waiting for this non-terminal; advance their dots

Accept if $[S \to \alpha\bullet,\; 0] \in S_n$ for any complete $S$-rule.

⚠️ Grammar format change (again). This function uses a different raw dict format from the CYK cell:
grammar = {
    'S': [tuple(['S', 'S']), tuple(['(', 'S', ')']), tuple(['(', ')'])],
}
Values are lists of tuples (every production is a tuple, including terminal-only ones). The CYK cell used sets containing bare strings for terminal rules — that format will not work here. This inconsistency is an artifact of the two cells being independently authored; a production helper function could unify them.

Earley item representation in code#

The function represents each item as a 4-tuple: (lhs, production_tuple, dot_position, origin)

Code field	Earley item component
`state[0]` = `lhs`	$A$
`state[1]` = `production_tuple`	$\alpha\beta$ (full RHS as tuple)
`state[2]` = `dot_position`	index of $\bullet$ in `production_tuple`
`state[3]` = `origin`	$i$ (input start position)

Running outside Jupyter#

Save the function and both test blocks to earley_ch13.py and run:

python earley_ch13.py

from collections import defaultdict  # imported for compatibility; not used internally

# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Implements Earley's parsing algorithm for any CFG (CNF not required).
# Earley item: (lhs, production_tuple, dot_position, origin)
#   = [lhs → prod[:dot]•prod[dot:], origin]  in standard notation.
# Chart: chart[j] = set of items consistent after processing input[0..j-1].
# Time complexity: O(n³) general; O(n²) unambiguous; O(n) LR(k).
# Reference: Earley (1970); Sipser §4.1; Hopcroft §7.2
#
# Grammar format: dict: non-terminal → list of tuples
#   e.g. {'S': [('S','S'), ('(','S',')'), ('(',')')]}
# ────────────────────────────────────────────────────────────────────────────

def earley_parse(tokens, grammar, start_symbol):
    """
    Determine if `tokens` ∈ L(grammar) using Earley's algorithm.

    Args:
        tokens:       list of str — input string as individual token list.
        grammar:      dict — CFG in list-of-tuples format (see format note above).
        start_symbol: str — start non-terminal.

    Returns:
        bool: True if tokens ∈ L(grammar), False otherwise.
    """

    # ── Operation 1: PREDICTOR ────────────────────────────────────────────
    # When dot is before non-terminal B in chart[k], add all B-productions
    # as new items starting at position k.
    def predictor(state, k):
        non_terminal = state[1][state[2]]   # symbol immediately after dot
        if non_terminal in grammar:
            for production in grammar[non_terminal]:
                new_state = (non_terminal, tuple(production), 0, k)
                if new_state not in chart[k]:
                    chart[k].add(new_state)
                    current_agenda.append(new_state)

    # ── Operation 2: SCANNER ─────────────────────────────────────────────
    # When dot is before terminal a and a matches tokens[k], advance the dot
    # and add the new item to chart[k+1].
    def scanner(state, k):
        if k < len(tokens):
            expected = state[1][state[2]]   # terminal expected next
            current_token = tokens[k]
            # Allow 'num' to match any single digit token (for arithmetic grammar)
            matches = (expected == current_token or
                       (expected == 'num' and current_token.isdigit()))
            if matches:
                new_state = (state[0], state[1], state[2] + 1, state[3])
                if new_state not in chart[k + 1]:
                    chart[k + 1].add(new_state)
                    # Note: new_state goes to chart[k+1], processed in the next
                    # outer loop iteration — not added to current_agenda.

    # ── Operation 3: COMPLETER ────────────────────────────────────────────
    # When dot is at the end of a rule (state is complete), find all items in
    # chart[origin] that were waiting for this non-terminal; advance their dots.
    def completer(state, k):
        origin = state[3]               # where this rule's match began
        completed_nt = state[0]         # the non-terminal that was just completed
        for prev_state in list(chart[origin]):
            # prev_state is waiting for completed_nt if dot points to it
            if (prev_state[2] < len(prev_state[1]) and
                    prev_state[1][prev_state[2]] == completed_nt):
                new_state = (prev_state[0], prev_state[1],
                             prev_state[2] + 1, prev_state[3])
                if new_state not in chart[k]:
                    chart[k].add(new_state)
                    current_agenda.append(new_state)

    # ── Initialise chart ─────────────────────────────────────────────────
    n = len(tokens)
    chart = [set() for _ in range(n + 1)]   # n+1 sets: S₀ … Sₙ

    # S₀: seed with all start-symbol productions, dot at beginning
    for production in grammar[start_symbol]:
        initial_state = (start_symbol, tuple(production), 0, 0)
        chart[0].add(initial_state)

    # ── Main loop: process each chart position ────────────────────────────
    for k in range(n + 1):
        # current_agenda: items in chart[k] that still need to be processed
        # Using a list (not the set) so we can append newly discovered items
        # during iteration — sets do not support safe concurrent modification.
        current_agenda = list(chart[k])
        i = 0
        while i < len(current_agenda):
            state = current_agenda[i]
            i += 1

            if state[2] < len(state[1]):    # dot NOT at end
                next_symbol = state[1][state[2]]
                if next_symbol in grammar:   # dot before non-terminal → predict
                    predictor(state, k)
                else:                        # dot before terminal → scan
                    scanner(state, k)
            else:                            # dot AT end → complete
                completer(state, k)

    # ── Accept condition ──────────────────────────────────────────────────
    # Accept if any complete start-symbol item spans the entire input [0..n]
    return any(
        state[0] == start_symbol and
        state[2] == len(state[1]) and   # dot at end (rule fully matched)
        state[3] == 0                    # match began at position 0
        for state in chart[n]
    )


# ── EXAMPLE 1: balanced parentheses ──────────────────────────────────────────
# Grammar does NOT need to be in CNF; this grammar has ternary rules (3 symbols RHS)
grammar_parens = {
    'S': [tuple(['S', 'S']), tuple(['(', 'S', ')']), tuple(['(', ')'])]
}
tokens = list("(()())")
print(earley_parse(tokens, grammar_parens, 'S'))    # Expected: True

# ── EXAMPLE 2: arithmetic expressions ────────────────────────────────────────
# E → E+T | E-T | T;  T → T*F | T/F | F;  F → (E) | num
# 'num' matches any single digit via the special case in scanner().
grammar_arith = {
    'E': [tuple(['E', '+', 'T']), tuple(['E', '-', 'T']), tuple(['T'])],
    'T': [tuple(['T', '*', 'F']), tuple(['T', '/', 'F']), tuple(['F'])],
    'F': [tuple(['(', 'E', ')']), tuple(['num'])]
}
tokens = list("2+3*4")
print(earley_parse(tokens, grammar_arith, 'E'))     # Expected: True

True
True

3.7 Comparison between Earley’s and CYK Algorithm#

Feature	Earley’s Algorithm	CYK Algorithm
Grammar Requirements	Any context-free grammar	Chomsky Normal Form only
Time Complexity	O(n³) worst case O(n²) unambiguous O(n) for simple CFG such as LR(k)	O(n³) always
Space Complexity	O(n²)	O(n²)
Implementation Complexity	Difficult	Simple
Early Recognition	Can recognize prefixes early	Must process entire input

4. Finiteness Problem#

4.1 Definition#

The finiteness problem asks: “Is the language generated by a CFG finite or infinite?” For context-free grammars, determining whether the generated language is finite or infinite is decidable, meaning there exists an algorithm that always terminates with the correct answer.

4.2 Algorithm to Check Finiteness#

A language is infinite if and only if there exists a derivation path in which a non-terminal derives a string that contains itself, and it can derive a non-empty string (i.e., there’s a cycle in the derivation). A CFG generates an infinite language if and only if there exists at least one non-terminal $A$ that:

$A$ is useful, meaning it is reachable from the start symbol and can derive a terminal string, and
$A$ can derive a string that contains itself: $A \stackrel{*}\Rightarrow xAy$ where $\stackrel{*}\Rightarrow$ means deriving in zero or more steps, and
Either $x$ or $y$ or both is not empty.

If such a non-terminal exists, we call it a self-embedded non-terminal, and its recursive derivation path can be used to “pump” the derivation, enabling the generation of arbitrarily long words by repeatedly applying the recursive rule. Based on this observation, we can design an algorithm as follows:

Algorithm Steps:#

Step 1: Identify useful non-terminals (as previously discussed in Section 2 of this chapter)
Step 2: Revise the CFG by eliminating all productions containing useless non-terminals
Step 3: Detect self-embedded nonterminals: for each remaining non-terminal $A$:
- Create a temporary marker symbol $A_M$
- Replace all occurrences of $A$ on the left side of productions with $A_M$
- Initialize a set $Reachable = \{A\}$
- Repeat until no changes to $Reachable$:
  - For each nonterminal $B$ that appears on the left side of a production with some non-terminals from $Reachable$ on its right side: Add $B$ to $Reachable$
- If $A_M$ is in $Reachable$, then $A$ is self-embedded
- If a self-embedded nonterminal is found, exit the loop
Step 4: if any self-embedded non-terminal was found, the language is infinite, otherwise it is finite.

4.3 Examples#

Example 1: Consider the given CFG: $\{S \rightarrow aA \mid bB; A \rightarrow a \mid b; B \rightarrow ab\}$. Decide if its language is finite or infinite.

Step 1: Identify useful non-terminals: all non-terminals $\{S, A, B\}$ are useful. Each can be reached from S and can derive terminal strings.
Step 2: Remove useless non-terminals: no useless non-terminals to remove, the given CFG remains unchanged.
Step 3: Detect self-embedded non-terminals:
- For non-terminal $S$:
  - Replace $S$ with $S_M$ in left sides: $S_M \rightarrow aA \mid bB$,
  - Initialize $Reachable = \{S\}$
  - Check productions:
    - $S_M \rightarrow aA$: add $A$: $Reachable = \{S, A\}$
    - $S_M \rightarrow bB$: add $B$: $Reachable = \{S, A, B\}$
  - No more changes to $Reachable$
  - $S_M$ is not in $Reachable$, so $S$ is not self-embedded
- For non-terminal $A$:
  - Replace $A$ with $A_M$ in left sides: $A_M \rightarrow a \mid b$
  - Initialize $Reachable = \{A\}$
  - Check productions: no non-terminals on right sides
  - No changes: $Reachable = \{A\}$
  - $A_M$ is not in $Reachable$, so $A$ is not self-embedded
- For non-terminal $B$:
  - Replace $B$ with $B_M$ in left sides: $B_M \rightarrow ab$
  - Initialize $Reachable = \{B\}$
  - Check productions: no non-terminals on right sides
  - No changes: $Reachable = \{B\}$
  - $B_M$ is not in $Reachable$, so $B$ is not self-embedded
Step 4: Decision: No self-embedded non-terminals found, so the language is finite.

Example 2: Consider the given CFG: $\{S \rightarrow AB \mid a; A \rightarrow BC; B \rightarrow CA; C \rightarrow b\}$. Decide if its language is finite or infinite.

Step 1: Identify useful non-terminals: all non-terminals $\{S, A, B, C\}$ are useful.
Step 2: Remove useless non-terminals: no useless non-terminals to remove, the given CFG remains unchanged.
Step 3: Detect self-embedded non-terminals:
- For non-terminal $S$:
  - Replace $S$ with $S_M$ in left sides: $S_M \rightarrow AB \mid a$
  - Initialize $Reachable = \{S\}$
  - Check productions:
    - $S_M \rightarrow AB$: add $A$ and $B$ to $Reachable = \{S, A, B\}$
    - $S_M \rightarrow a$: No non-terminals
    - $A \rightarrow BC$: add $C$ to $Reachable = \{S, A, B, C\}$
    - $B \rightarrow CA$: Nothing new to add
  - No more changes to $Reachable$
  - $S_M$ is not in $Reachable$, so $S$ is not self-embedded
- For non-terminal $A$:
  - Replace $A$ with $A_M$ in left sides: $A_M \rightarrow BC$
  - Initialize $Reachable = \{A\}$
  - Check productions:
    - $A_M \rightarrow BC$: add $B$ and $C$ to $Reachable = \{A, B, C\}$
    - $B \rightarrow CA$: since $C$ is in $Reachable$ and $A$ is in the initial set, add $A_M$ to $Reachable = \{A, B, C, A_M\}$
  - $A_M$ is in $Reachable$, so $A$ is self-embedded
- Terminate further processing upon detecting a self-embedded non-terminal.
Step 4: Decision: found self-embedded non-terminal $A$, so the language is infinite.

4.4 Code Walkthrough: Finiteness Problem#

Concept#

Decision problem: Does a CFG $G$ generate a finite or infinite language?

Theorem (Finiteness Decidability). $L(G)$ is infinite if and only if there exists a self-embedded useful non-terminal — one that can derive a string containing itself with at least one non-empty string on at least one side: $$A \xRightarrow{*} xAy \quad \text{where } x \neq \varepsilon \text{ or } y \neq \varepsilon$$

Algorithm (three steps):

Step	Action
1	Find useful non-terminals (both generating and reachable from $S$)
2	Prune the grammar to remove useless non-terminals
3	For each remaining non-terminal $A$, test if $A$ is self-embedded

Four functions in one cell — dependency diagram#

find_useful_symbols(grammar, start_symbol)
        │
        ▼
is_self_embedded(non_terminal, filtered_grammar, useful_symbols)
        │
        ▼
is_language_finite(grammar, start_symbol)   ← top-level entry point
        │ (calls find_useful_symbols and is_self_embedded internally)
        ▼
[test suite in __main__ block]

⚠️ Definition order in the cell. is_language_finite is defined before find_useful_symbols and is_self_embedded in the source. This works in Python because function bodies are not executed at definition time — is_language_finite only calls the other two when invoked, by which point they are defined. However, reading top-to-bottom: mentally re-order to read find_useful_symbols first, then is_self_embedded, then is_language_finite.

`find_useful_symbols` vs. `find_useless_symbols` (Cell 3)#

Both functions compute generating and reachable sets with identical fixed-point loops. They differ in what they return: Cell 3’s find_useless_symbols returns the useless set; this cell’s find_useful_symbols returns the useful (generating ∩ reachable) set. The duplication exists because Cell 6 operates on raw dicts, not CFG objects — the two functions are not interchangeable due to the different grammar input types.

Grammar format for this cell#

Raw dict: non-terminal str → list of list of str (same format as CFG.productions, but passed directly without a CFG wrapper). The empty list [] encodes $\varepsilon$.

Running outside Jupyter#

Save all four functions and the test suite to finiteness_ch13.py. Remove the if __name__ == "__main__": guard if you want the tests to run automatically:

python finiteness_ch13.py

# ── FORMAL GROUNDING ────────────────────────────────────────────────────────
# Implements the Finiteness Decision Procedure for CFGs.
# Theorem: L(G) is infinite  iff  some useful non-terminal is self-embedded.
# A non-terminal A is self-embedded if A ⟹* xAy where x≠ε or y≠ε.
# Reference: Cohen Chapter 18; Hopcroft §7.4
#
# Grammar format: dict: non-terminal → list of list of str  (same as CFG.productions)
# ────────────────────────────────────────────────────────────────────────────


def is_language_finite(grammar, start_symbol):
    """
    Determine if L(grammar) is finite or infinite.

    Note: defined before find_useful_symbols and is_self_embedded in source order,
    but only calls them at runtime (by which point they are defined).

    Returns:
        str: "Finite" or "Infinite".
    """
    # ── Step 1: identify useful non-terminals (generating ∩ reachable) ───
    useful_symbols = find_useful_symbols(grammar, start_symbol)
    print(f"Useful symbols: {useful_symbols}")

    # ── Step 2: prune grammar — remove useless non-terminals ─────────────
    # Keep only useful LHS; keep only productions whose RHS is all useful or terminal.
    filtered_grammar = {}
    for nt, productions in grammar.items():
        if nt in useful_symbols:
            valid_prods = []
            for prod in productions:
                # A production is kept if every symbol in it is either a terminal
                # (not a key in grammar) or a useful non-terminal.
                if all(symbol in useful_symbols or symbol not in grammar
                       for symbol in prod):
                    valid_prods.append(prod)
            if valid_prods:
                filtered_grammar[nt] = valid_prods

    # ── Step 3: test each useful non-terminal for self-embedding ─────────
    for non_terminal in filtered_grammar:
        if is_self_embedded(non_terminal, filtered_grammar, useful_symbols):
            # Step 4a: self-embedded non-terminal found → L(G) is infinite
            return "Infinite"

    # Step 4b: no self-embedded non-terminal → L(G) is finite
    return "Finite"


def find_useful_symbols(grammar, start_symbol):
    """
    Return generating ∩ reachable — the set of useful non-terminals.

    Uses the same fixed-point algorithm as find_useless_symbols (Cell 3),
    but operates on a raw dict rather than a CFG object, and returns the
    USEFUL set instead of the USELESS set.
    """
    # ── Pass 1: GENERATING — which NTs can derive some terminal string? ──
    generating = set()

    # Base: NT with a production of all terminals (or empty production)
    for nt, productions in grammar.items():
        for prod in productions:
            # prod not in grammar for all symbols → all are terminals or ε
            if all(symbol not in grammar for symbol in prod):
                generating.add(nt)
                break

    # Fixed-point: NT is generating if some production uses only generating/terminals
    changed = True
    while changed:
        changed = False
        for nt, productions in grammar.items():
            if nt in generating:
                continue
            for prod in productions:
                if all(symbol in generating or symbol not in grammar
                       for symbol in prod):
                    generating.add(nt)
                    changed = True
                    break

    # ── Pass 2: REACHABLE — which NTs are reachable from start_symbol? ──
    reachable = {start_symbol}
    changed = True
    while changed:
        changed = False
        new_reachable = set(reachable)
        for nt in reachable:
            if nt not in grammar:
                continue
            for prod in grammar[nt]:
                for symbol in prod:
                    if symbol in grammar and symbol not in new_reachable:
                        new_reachable.add(symbol)
                        changed = True
        reachable = new_reachable

    # Useful = generating ∩ reachable
    return generating.intersection(reachable)


def is_self_embedded(non_terminal, grammar, useful_symbols):
    """
    Test whether `non_terminal` is self-embedded in `grammar`.

    Algorithm: replace the target non-terminal A with a marker A_M on the
    left-hand side, then propagate reachability. A is self-embedded iff the
    marker A_M becomes reachable — meaning A appears in a derivation of itself.

    Args:
        non_terminal:  str — the NT to test (must be a key in grammar).
        grammar:       dict — pruned grammar (useless symbols already removed).
        useful_symbols: set — used to skip non-useful NTs during propagation.

    Returns:
        bool: True if non_terminal is self-embedded.
    """
    # Create a marker name: M_A distinguishes "A appears via its own rule"
    # from "A appears via another NT's rule".
    marker = f"M_{non_terminal}"

    # Reachable = set of NTs reachable from A by applying productions
    # Starting with {A} means "A can trivially reach itself via 0 steps".
    reachable = {non_terminal}

    changed = True
    while changed:
        changed = False
        new_reachable = set(reachable)

        for left_side, productions in grammar.items():
            if left_side not in useful_symbols:
                continue    # skip useless NTs (already pruned, but be safe)

            # Does this NT's production contain any symbol currently in reachable?
            contains_reachable = any(
                symbol in reachable
                for prod in productions
                for symbol in prod
                if symbol in useful_symbols    # only count useful NTs
            )

            if contains_reachable:
                if left_side == non_terminal:
                    # A's own production leads back to something reachable from A:
                    # this is the self-embedding signal.
                    if marker not in new_reachable:
                        new_reachable.add(marker)
                        changed = True
                elif left_side not in new_reachable:
                    # Another NT leads to something reachable from A — add it
                    # so its productions can be explored in the next iteration.
                    new_reachable.add(left_side)
                    changed = True

        reachable = new_reachable

    # A is self-embedded iff the marker M_A was added to reachable
    return marker in reachable


# ── TEST SUITE ────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    # Example 1: {ac, ad, be} — finite; no recursive rules
    grammar1 = {
        'S': [['a', 'T'], ['b', 'U']],
        'T': [['c'], ['d']],
        'U': [['e']]
    }
    # Example 2: {b, ab, aab, …} — infinite; S → aS is self-embedded
    grammar2 = {
        'S': [['a', 'S'], ['b']]
    }
    # Example 3: S→AB, A→BC, B→CA, C→b — no NT derives itself; finite
    grammar3 = {
        'S': [['A', 'B'], ['a']],
        'A': [['B', 'C']],
        'B': [['C', 'A']],
        'C': [['b']]
    }
    # Example 4: fully expanded tree; no cycles; finite
    grammar4 = {
        'S': [['A', 'B'], ['a']],
        'A': [['C', 'D']],
        'B': [['E', 'F']],
        'C': [['g']], 'D': [['h']], 'E': [['i']], 'F': [['j']]
    }
    # Example 5: {aⁿbⁿ | n ≥ 0} — infinite; S → aSb is self-embedded
    grammar5 = {
        'S': [['a', 'S', 'b'], []]   # [] = ε
    }
    # Example 6: complex grammar with mutual recursion; Z → ZAb is self-embedded
    grammar6 = {
        'S': [['A', 'B', 'a'], ['b', 'A', 'Z'], ['b']],
        'A': [['X', 'b'], ['b', 'Z', 'A']],
        'B': [['b', 'A', 'A']],
        'X': [['a', 'Z', 'a'], ['b', 'A'], ['a', 'a', 'a']],
        'Z': [['Z', 'A', 'b', 'A']]
    }

    print("Results of finiteness checking:")
    print(f"Example 1: {is_language_finite(grammar1, 'S')} (should be Finite)")
    print(f"Example 2: {is_language_finite(grammar2, 'S')} (should be Infinite)")
    print(f"Example 3: {is_language_finite(grammar3, 'S')} (should be Finite)")
    print(f"Example 4: {is_language_finite(grammar4, 'S')} (should be Finite)")
    print(f"Example 5: {is_language_finite(grammar5, 'S')} (should be Infinite)")
    print(f"Example 6: {is_language_finite(grammar6, 'S')} (should be Infinite)")

    # Show which specific non-terminals are self-embedded in each grammar
    for i, grammar in enumerate([grammar1, grammar2, grammar3,
                                  grammar4, grammar5, grammar6], start=1):
        useful = find_useful_symbols(grammar, 'S')
        print(f"\nExample {i} self-embedded NTs (useful: {useful}):")
        found_any = False
        for nt in grammar:
            if nt in useful and is_self_embedded(nt, grammar, useful):
                print(f"  {nt} is self-embedded")
                found_any = True
        if not found_any:
            print("  (none)")

Results of finiteness checking:
Useful symbols: {'T', 'U', 'S'}
Example 1: Finite (should be Finite)
Useful symbols: {'S'}
Example 2: Infinite (should be Infinite)
Useful symbols: {'C', 'S'}
Example 3: Finite (should be Finite)
Useful symbols: {'A', 'D', 'C', 'E', 'F', 'S', 'B'}
Example 4: Finite (should be Finite)
Useful symbols: {'S'}
Example 5: Infinite (should be Infinite)
Useful symbols: {'X', 'B', 'A', 'S'}
Example 6: Infinite (should be Infinite)

Example 1 self-embedded NTs (useful: {'T', 'U', 'S'}):
  (none)

Example 2 self-embedded NTs (useful: {'S'}):
  S is self-embedded

Example 3 self-embedded NTs (useful: {'C', 'S'}):
  (none)

Example 4 self-embedded NTs (useful: {'A', 'D', 'C', 'E', 'F', 'S', 'B'}):
  (none)

Example 5 self-embedded NTs (useful: {'S'}):
  S is self-embedded

Example 6 self-embedded NTs (useful: {'X', 'B', 'A', 'S'}):
  A is self-embedded
  X is self-embedded

5. Practice Exercises#

5.1 Short Answer Questions#

Explain the difference between a non-generating non-terminal and an unreachable non-terminal.
Why does the CYK algorithm require the grammar to be in Chomsky Normal Form?
What defines a self-embedded non-terminal and why is it important for language finiteness?
How does Earley’s algorithm differ from the CYK algorithm in terms of grammar requirements?

5.2 Emptiness Problem#

Determine if the language generated by each following grammar is empty:

$\{S \rightarrow AB, A \rightarrow BC, B \rightarrow CA, C \rightarrow DA, D \rightarrow AD\}$
$\{S \rightarrow AB, A \rightarrow aA, B \rightarrow bB, C \rightarrow c\}$
$\{S \rightarrow aS \mid bS, A \rightarrow SbS\}$

5.3 Uselessness Problem#

Identify all useless non-terminals in the following grammar:

$\{S \rightarrow AB \mid CD, A \rightarrow aA \mid a, B \rightarrow bB \mid b, C \rightarrow cD, D \rightarrow d\}$
$\{S \rightarrow AB \mid aC, A \rightarrow BC, B \rightarrow bB, C \rightarrow c, D \rightarrow dD\}$
$\{S \rightarrow XY \mid a, X \rightarrow YZ, Y \rightarrow XZ, Z \rightarrow b\}$

5.4 Membership Problem#

Using the CYK algorithm, determine if the string $abbab$ is in the language of the following grammar. $\{S \rightarrow AB \mid BB, A \rightarrow AB \mid a, B \rightarrow Ba \mid b\}$
Using the Earley’s Algorithm, determine if the string $aab$ is in the language of the following grammar. $\{S \rightarrow AA \mid B, A \rightarrow aA \mid a, B \rightarrow b\}$

5.5 Finiteness Problem#

Determine if each following grammar generates a finite or infinite language:

$\{S \rightarrow aS \mid A, A \rightarrow bA \mid b\}$
$\{S \rightarrow aSb \mid \wedge, A \rightarrow aAb \mid ab\}$
$\{S \rightarrow AB, A \rightarrow aA \mid a, B \rightarrow b\}$

5.6 Integrated Problem#

For the following grammar, determine:

Is the language empty?
Which non-terminals are useless?
Is the string “abba” in the language?
Is the language finite? $$\{S \rightarrow AB \mid aS, A \rightarrow aA \mid a, B \rightarrow bB \mid BA \mid b\}$$

5.7 Algorithm Implementation#

Write pseudocode/Python Code for an algorithm that takes a CFG and returns a simplified equivalent grammar with no useless symbols.

5.8 CFG Language Analyzer (Emptiness & Uselessness)#

Create a comprehensive CFG analyzer that can determine if a grammar generates an empty language and identify all useless symbols. Your program should also provide detailed analysis of why certain symbols are useless.

Requirements

Implement a CFGAnalyzer class with methods:
- is_empty(): Returns True if language is empty
- find_useless_symbols(): Returns a dictionary categorizing useless symbols
- simplify_grammar(): Returns a new CFG with useless symbols removed
- analysis_report(): Returns a detailed text report
Handle edge cases like grammars with no productions, self-referencing symbols, etc.

Sample Input/Output:

grammar = {
    'S': [['A', 'B'], ['C']],
    'A': [['a']],
    'B': [['b']],
    'C': [['D']],
    'D': [['C']],
    'E': [['e']]
}

analyzer = CFGAnalyzer(grammar, 'S')
print(analyzer.is_empty())  # False
print(analyzer.find_useless_symbols())
# {'non_generating': ['C', 'D'], 'unreachable': ['E'], 'useless': ['C', 'D', 'E']}

5.9 CYK Parser with Visualization (Membership Problem)#

Implement a CYK parser that not only determines membership but also provides a visual representation of the parsing table and can extract parse trees for successful parses.

Requirements

Convert grammar to CNF automatically
Implement CYK algorithm with detailed table tracking
Provide ASCII visualization of the parsing table
Extract and display parse trees for successful parses
Handle multiple parse trees (ambiguous grammars)

Sample Input/Output:

grammar = {
    'S': [['A', 'B'], ['B', 'C']],
    'A': [['B', 'A'], ['a']],
    'B': [['C', 'C'], ['b']],
    'C': [['A', 'B'], ['a']]
}

parser = CYKParser(grammar, 'S')
result = parser.parse("baaba")
print(f"String accepted: {result['accepted']}")
print(result['table_visualization'])
if result['accepted']:
    print("Parse Trees:")
    for tree in result['parse_trees']:
        print(tree)

5.10 Earley Parser with Error Recovery (Membership Problem)#

Implement an Earley parser that can handle any CFG and provides detailed parsing information including partial parses for invalid strings and suggestions for error correction.

Requirements:

Implement full Earley algorithm (predictor, scanner, completer)
Provide detailed step-by-step parsing trace
For invalid strings, show where parsing failed and suggest corrections
Handle ambiguous grammars by showing multiple parse paths
Implement prefix recognition (determine if string is a valid prefix)

5.11 Language Finiteness Detector with Cycle Analysis (Finiteness Problem)#

Create a sophisticated tool that determines if a CFG generates a finite or infinite language by detecting self-embedded non-terminals and analyzing dependency cycles. The tool should also estimate the maximum string length for finite languages.

Requirements

Implement the self-embedding detection algorithm
Visualize the dependency graph between non-terminals
For finite languages, estimate the maximum possible string length
Identify all cycles in the grammar and classify them as productive or non-productive
Provide recommendations for converting infinite languages to finite ones

6. Further Reading#

“Introduction to the Theory of Computation” by Michael Sipser, Section 4.1
“Introduction to Computer Theory” by Daniel I.A. Cohen, Chapter 18
“Automata Theory, Languages, and Computation” by Hopcroft, Motwani, and Ullman, Chapter 7

Table phase	Filled when	Rule applied
Base (\(j = i\))	Substring length 1	\(A \to w_k\)
Recursive (\(j > i\))	Substring length \(j-i+1\)	\(A \to BC\): \(B \in T[i][k]\), \(C \in T[k+1][j]\)

Feature	Ch 12 `CFG`	Ch 12 `Grammar`	This `CFG`
Variables field	`variables`	`non_terminals`	auto-computed
Terminals field	`terminals`	`terminals`	auto-computed
Rules field	`rules` (dict→list of strings)	`rules` (dict→list of strings)	`productions` (dict→list of lists)
Start field	`start`	`start_symbol`	`start_symbol`

Formal notation	Python field	Format
\(V\) (variables)	`self.non_terminals`	`set` — auto-computed from `productions.keys()`
\(\Sigma\) (terminals)	`self.terminals`	`set` — auto-computed by scanning all RHS symbols
\(R\) (rules)	`self.productions`	`dict`: `str → list[list[str]]`
\(S\) (start)	`self.start_symbol`	`str`

Algorithm 1 step	Function phase	Python construct
Initialise: \(\mathit{derivable} \leftarrow \{N \mid N \to t_1\ldots t_k \in R\}\)	Phase 1	First `for` loop over `cfg.productions`
Iterate: add \(N\) if some production \(N \to \alpha\) has \(\alpha \subseteq \mathit{derivable} \cup \Sigma\)	Phase 2	`while changed` loop
Accept if \(S \in \mathit{derivable}\)	Phase 3	`return cfg.start_symbol not in derivable`

Component	Meaning
\(A \to \alpha\beta\)	A production rule
\(\bullet\) (dot)	How far parsing of this rule has progressed
\(\alpha\)	The part of the RHS already matched
\(\beta\)	The part still to be matched
\(i\)	Position in input where this rule’s match began

Operation	Trigger	Action
Predictor	Dot before non-terminal \(B\)	Add \([B \to \bullet\gamma, j]\) for every \(B\)-production
Scanner	Dot before terminal \(a = w_j\)	Advance dot → \(S_{j+1}\)
Completer	Dot at end of rule	Find all items waiting for this non-terminal; advance their dots

Code field	Earley item component
`state[0]` = `lhs`	\(A\)
`state[1]` = `production_tuple`	\(\alpha\beta\) (full RHS as tuple)
`state[2]` = `dot_position`	index of \(\bullet\) in `production_tuple`
`state[3]` = `origin`	\(i\) (input start position)

Step	Action
1	Find useful non-terminals (both generating and reachable from \(S\))
2	Prune the grammar to remove useless non-terminals
3	For each remaining non-terminal \(A\), test if \(A\) is self-embedded

Decidability in Context-Free Grammars (CFGs)

Contents

13. Decidability in Context-Free Grammars (CFGs)#

Introduction#

Code Walkthrough: CFG Container#

Concept#

Notation mapping#

What to look for#

Running outside Jupyter#

1. Emptiness Problem#

1.1 Definition#

1.2 Algorithm to Check Emptiness:#

1.3 Examples#

1.4 Code Walkthrough: Emptiness Problem — Algorithm 1#

Concept#

Formal algorithm ↔ Python mapping#

Running outside Jupyter#

2. Uselessness Problem#

2.1 Definition#

2.2 Algorithm to Identify Useless Non-terminals#

2.3 Examples#

2.4 Code Walkthrough: Uselessness Problem#

Concept#

Two functions in one cell#

Running outside Jupyter#

3. Membership Problem#

3.1 Definition#

3.2 CYK Algorithm (Cocke-Younger-Kasami) for Membership Problem#

3.3 CYK Algorithm Examples#

3.4 Code Walkthrough: CYK Algorithm (Membership Problem)#

Concept#

Running outside Jupyter#

3.5 Earley’s Algorithm for Membership Problem#

3.6 Code Walkthrough: Earley’s Algorithm (Membership Problem)#

Concept#

Earley item representation in code#

Running outside Jupyter#

3.7 Comparison between Earley’s and CYK Algorithm#

4. Finiteness Problem#

4.1 Definition#

4.2 Algorithm to Check Finiteness#

Algorithm Steps:#

4.3 Examples#

4.4 Code Walkthrough: Finiteness Problem#

Concept#

Four functions in one cell — dependency diagram#

find_useful_symbols vs. find_useless_symbols (Cell 3)#

Grammar format for this cell#

Running outside Jupyter#

5. Practice Exercises#

5.1 Short Answer Questions#

5.2 Emptiness Problem#

5.3 Uselessness Problem#

5.4 Membership Problem#

5.5 Finiteness Problem#

5.6 Integrated Problem#

5.7 Algorithm Implementation#

5.8 CFG Language Analyzer (Emptiness & Uselessness)#

5.9 CYK Parser with Visualization (Membership Problem)#

5.10 Earley Parser with Error Recovery (Membership Problem)#

5.11 Language Finiteness Detector with Cycle Analysis (Finiteness Problem)#

6. Further Reading#

`find_useful_symbols` vs. `find_useless_symbols` (Cell 3)#