Fundamentals of Context-Free Grammar

9. Fundamentals of Context-Free Grammar#

Context-Free Grammars (CFGs) are important in computer science and linguistics for defining language structure. Introduced by Noam Chomsky in the 1950s, they play a key role in programming languages, helping compilers check syntax and convert code into machine instructions. CFGs are also used in natural language processing (NLP) for grammar checking and language translation. Beyond that, they assist in bioinformatics by identifying patterns in DNA sequences and help structure documents in markup languages like XML and HTML. Their “context-free” nature means that rules apply independently of surrounding elements, making them both flexible and efficient for various applications.

This section covers the fundamental concepts of Context-Free Grammar (CFG), including formal definitions, examples, derivations, and parse trees.

1. Formal Definition of Context-Free Grammar#

A Context-Free Grammar G is formally defined as a 3-tuple \(G = (V, \Sigma, R)\) where:

V: A finite set of variables called non-terminals, including a special start symbol S (\(S \in N\))
An alphabet \(\Sigma\): A finite set of symbols called terminals
R: A finite set of production rules

1.1 Key Properties:#

Non-terminals are typically denoted by uppercase letters (A, B, C, …). These are placeholder symbols that will be replaced using productions. You may think of them as “categories” or “types” of phrases in a language. For example, in a programming language, these might represent concepts like “expression”, “statement”, or “function”.
Terminals (\(\Sigma\)) are typically denoted by lowercase letters (a, b, c, …). These are the actual symbols that appear in the final strings. In a programming language, these would be keywords, operators, and literals. For example: numbers, parentheses, operators like ‘+’ and ‘/’, keywords like ‘if’ and ‘while’, etc.
Production rules are of the form A \(\rightarrow\) \(\alpha\) where
- A is a single non-terminal
- \(\alpha\) is a string of non-terminals and terminals (including \(\wedge\) for empty string).
- These productions tell us how to replace non-terminals with combinations of non-terminals and terminals.
- They define the valid ways to generate strings in the language.
- Multiple production rules for the same variable are often written as A → \(\alpha_1\) | \(\alpha_2\) | \(\alpha_3\). The vertical bar | means “or”: any of these productions can be used
The start symbol S must appear on the left side of at least one production rule. This is the non-terminal where we begin applying productions. Every valid string in the language must be derivable from this start symbol. You may think of it as the “root” or “entry point” of the grammar.
The “context-free” property means that a rule can be applied based only on the variable being replaced, without considering what surrounds it (its context). This makes CFGs powerful enough to define most programming language structures while still being easy to parse efficiently.

1.2 Code Example:#

The formal definition of a CFG — (V, Σ, R) with a start symbol S - maps directly to code. The class below represents each component as a Python data structure and then validates that the grammar is well-formed according to the three rules from Section 1:

Formal Component	Python Representation
V (non-terminals)	`set` of uppercase strings
Σ (terminals)	`set` of lowercase strings or symbols
R (production rules)	`dict` mapping each non-terminal to a list of strings
S (start symbol)	a single string that must appear in V

After reading the code, check:

Does the validate() method enforce all three key properties from Section 1.1?
What case does it not check? (Hint: does it verify S appears on the left-hand side of at least one production?)

The example at the bottom instantiates the balanced-parentheses grammar G2 from Section 4.1 and validates it.

# ─────────────────────────────────────────────────────────────────────────────
# CFG Validator
# Represents a Context-Free Grammar G = (V, Σ, R) with start symbol S.
#
# Running outside Jupyter:
#   python cfg_validator.py
# ─────────────────────────────────────────────────────────────────────────────

class CFG:
    def __init__(self, variables, terminals, productions, start_symbol):
        self.V = set(variables)      # V: finite set of non-terminals
        self.Σ = set(terminals)      # Σ: finite set of terminal symbols
        self.R = productions         # R: dict {non-terminal: [list of RHS strings]}
        self.S = start_symbol        # S: the designated start symbol (must be in V)

    def validate(self):
        """
        Checks that this grammar satisfies the three requirements of a CFG:
          1. The start symbol S must be in V.
          2. Every left-hand side of a production must be a single non-terminal.
          3. Every symbol on any right-hand side must be in V ∪ Σ ∪ {∧}.
        Returns (True, message) if valid, (False, reason) otherwise.
        """

        # Rule 1: S ∈ V  (start symbol must be declared as a non-terminal)
        if self.S not in self.V:
            return False, "Start symbol must be in variables"

        # Rule 2 & 3: Check every production rule A → α
        for left, rights in self.R.items():

            # Rule 2: left-hand side must be a single non-terminal
            if left not in self.V:
                return False, f"Left-hand side '{left}' must be a variable (non-terminal)"

            # Rule 3: every symbol in each RHS string must be in V ∪ Σ ∪ {∧}
            for right in rights:
                for symbol in right:
                    if symbol not in self.V and symbol not in self.Σ and symbol != '∧':
                        return False, f"Invalid symbol '{symbol}' in production {left} → {right}"

        return True, "Grammar is valid"


# ── Example: validate the balanced-parentheses grammar G2 ─────────────────────
# G2 = ({S}, {(, )}, {S → (S) | SS | ∧})
# This is the same grammar defined formally in Section 4.1.
balanced_parens = CFG(
    variables={'S'},
    terminals={'(', ')'},
    productions={'S': ['(S)', 'SS', '∧']},   # S → (S) | SS | ∧
    start_symbol='S'
)

is_valid, message = balanced_parens.validate()
print(f"Grammar validation: {message}")

Grammar validation: Grammar is valid

2. An Example of Context-Free Grammars#

2.1 Example: Grammar for the language ab#

A Context-Free Grammar G1 is defined as follows:

\(V\) (non-terminals): \({S, A, B}\)
\(\Sigma\) (terminals): \({a, b}\)
R (production rules): \({S → AB; A → aA| \wedge , B → bB | \wedge}\)

To understand the production rules:

The start symbol S expands into A followed by B. This enforces that all ‘a’ symbols, if any, must come before any ‘b’ symbols.
A can either produce an ‘a’ followed by another ‘A’ (recursion) or terminate with \(\wedge\)(empty string). This means A generates zero or more ‘a’ characters (a*).
B can either produce a ‘b’ followed by another ‘B’ (recursion) or terminate with \(\wedge\)(empty string). This means B generates zero or more ‘b’ characters (b*).

The language defined by this grammar is: \(L(G1)=\{a^nb^m∣n,m≥0\}\). This represents any number of ‘a’s followed by any number of ‘b’s, including the empty string. This grammar generates strings like: \(\wedge\), a, b, aa, bb, ab, aab, abbb, aaabb, …

3. Derivations#

A derivation is a sequence of strings obtained by applying production rules, starting from the start symbol. There are two main types of derivations:

3.1 Leftmost Derivation#

In a leftmost derivation, at each step we replace the leftmost non-terminal with one of its productions.
Example using G1 to derive ‘aabb’:

S ⇒ AB    [using S → AB]  
  ⇒ aAB   [using A → aA] 
  ⇒ aaAB  [using A → aA]
  ⇒ aaB   [using A → ∧]  
  ⇒ aabB  [using B → bB]  
  ⇒ aabbB [using B → bB]
  ⇒ aabb  [using B → ∧]  

3.2 Rightmost Derivation#

In a rightmost derivation, at each step we replace the rightmost non-terminal with one of its productions.
Example using G1 (ab) to derive ‘aabb’: (different from leftmost but same result):

S ⇒ AB    [using S → AB]  
  ⇒ AbB   [using B → bB] 
  ⇒ AbbB  [using B → bB]
  ⇒ Abb   [using B → ∧]  
  ⇒ aAbb  [using A → aA]  
  ⇒ aaAbb [using A → aA]
  ⇒ aabb  [using A → ∧]  

Different types of derivation help maintain consistency and structure when parsing and analyzing languages. Leftmost derivation is commonly used for top-down parsing such as in Recursive Descent Parser. Rightmost Derivation is commonly used in bottom-up parsing such as in LR parsers. Common programming languages, such as C, Java, Python, are parsed with bottom-up approach.

4. More Examples of Context-Free Grammars#

4.1 Example: Grammar for balanced parentheses#

A Context-Free Grammar G2 is defined as follows:

\(V\): {\(S\)}
\(\Sigma\): {(, )}
\(R: {S → (S) | SS | \wedge}\)

Components breakdown:

Components	Explanation
V = {S}	Single non-terminal S
Σ = {(, )}	Two terminals: open and close parentheses
R = {S → (S) `\|` SS `\|` ∧}	Three production rules

This grammar generates strings like: \(\wedge\), (), (()), ()(), ((()))

4.2 Example: Grammar for arithmetic expressions#

A Context-Free Grammar G3 is defined as follows:

V: {\(S, E, T, F, N, D\)}
\(\Sigma\): {\(+, -, *, /, (, ), 0, 1, 2, 3, 4, 5, 6, 7, 8, 9\)}
R =
{

\(S → E\)
\(E → E + T | E - T | T \)
\(T → T * F | T / F | F \)
\(F → (E) | N | -F\)
\(N → ND | D\)
\(D → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9\) }

This grammer defines arithmetic expressions that handles numbers, basic operators (+, -, *, /), and parentheses.

Components breakdown:

Components	Explanation
V = {E, T, F, N, D}	Variables for Expression, Term, Factor, Number, Digit
Σ = {0,1,2,3,4,5,6,7,8,9, +, -, *, /, (, )}	Terminals: digits, operators, parentheses
R: Production Rules
`S → E`	Start the production
`E → E + T \| E - T \| T`	Addition/Subtraction/Term
`T → T * F \| T / F \| F`	Multiplication/Division/Factor
`F → (E) \| N \| -F`	Number, Parentheses, Negation
`N → ND \| D`	Multi-digit Numbers
`D → 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9`	Single Digits

This grammar generates expressions like:

2
3 + 7
4 * 5
(8 + 2) / (3 - 1)
(-2 + 5) * 3
1000 - 245

Sample derivation for ‘5 + 7 * 2”’:
S ⇒ E + T ⇒ T + T ⇒ F + T ⇒ N + T ⇒ D + T ⇒ 5 + T * F ⇒ 5 + F * F ⇒ 5 + N * N ⇒ 5 + 7 * N ⇒ 5 + 7 * 2

4.3 Example: Grammar for simple English sentences#

A Context-Free Grammar G4 is defined as follows:

V: {S, NP, VP, Det, N, V}
Σ: {“a”, “cat”, “dog”, “boy”, “girl”, “candy”, “sees”, “eats”, “runs”}
R:

\(S → NP VP\) # A sentence consists of a noun phrase and a verb phrase
\(NP → Det N\) # A noun phrase consists of a determiner and a noun
\(VP → V NP\) # A verb phrase consists of a verb followed by a noun phrase
\(VP → V\) # A verb phrase can also be just a verb
\(Det → a\) # Determiners
\(N → cat | dog | apple\) # Nouns
\(V → sees | eats | runs\) # Verbs

Components breakdown:

Components	Explanation
V = {S, NP, VP, Det, N, V}	Non-terminals for Sentence, Noun Phrase, Verb Phrase, Determiner, Noun, and Verb
Σ = {“a”, “cat”, “dog”, “boy”, “girl”, “candy”, “sees”, “eats”, “runs”}	Terminals: Determiners, Nouns, Proper Nouns, Verbs
R: Production Rules
S → NP VP	A sentence consists of a noun phrase followed by a verb phrase
NP → Det N	A noun phrase consists of a determiner followed by a noun
VP → V NP	A verb phrase consists of a verb followed by a noun phrase
VP → V	A verb phrase can be just a verb
Det → “a”	Determiners
N → “cat” \| “dog” \| “boy” \| “girl” \| “candy”	Nouns
V → “sees” \| “eats” \| “runs”	Verbs

This grammar generates sentences like:

a cat runs
a boy eats a candy

Example Derivations:

“a cat sees a dog”
S ⇒ NP VP ⇒ Det N VP ⇒ a N VP ⇒ a cat NP ⇒ a cat V NP ⇒ a cat sees NP ⇒ a cat sees Det N ⇒ a cat sees a N ⇒ a cat sees a dog

Note that this grammar may derive some nonsensical but valid sentences, such as “a candy sees a girl”. In formal language theory, a Context-Free Grammar defines the structure or format of valid strings in a language, regardless of their meaning. A string is considered valid if it can be derived from the start symbol S by applying the production rules of the grammar. A valid sentence such as a candy runs follow the correct structure (syntax), but they don’t make sense (semantics). The grammar only cares about how the sentence is put together, not whether it has a logical meaning.

4.4 Example: Grammar for palindromes#

A palindrome is a string that reads the same forward and backward. Here’s a context-free grammar for generating palindromes:

V: {\(S\)}
Σ: {\(a, b\)}
R: \(S → aSa | bSb | a | b | \wedge\)

Components breakdown:

Components	Explanation
V = {S}	Non-terminal for the start symbol, representing the palindrome.
Σ = {“a”, “b”}	Terminals: the characters used to form the palindrome are “a” and “b”.
R Production Rules
S → aSa	Generates palindromes where the first and last characters are “a”, with a smaller palindrome in between.
S → bSb	Generates palindromes where the first and last characters are “b”, with a smaller palindrome in between.
S → a	A base case where the palindrome consists of just the letter “a”.
S → b	A base case where the palindrome consists of just the letter “b”.
S → ε	The empty string, which is also a valid palindrome.

This grammar generates palindromes like:

∧ (empty string)
a
b
abba
baab
aabbaa

Example Derivations for ‘abba’:
S ⇒ aSa ⇒ abSba ⇒ abba

4.5 Example: Grammar for {aⁿbⁿ | n ≥ 0}#

The language {aⁿbⁿ | n ≥ 0} consists of strings where there are an equal number of a’s followed by b’s, with n representing the number of a’s and b’s.

V: {\(S\)}
Σ: {\(a, b\)}
R: \(S → aSb | \wedge\)

Components breakdown:

Components	Explanation
V = {S}	Non-terminal for the start symbol, representing the balanced structure of aⁿbⁿ.
Σ = {“a”, “b”}	Terminals: the characters “a” and “b” used to form the strings in the language.
R: Production Rules	Rules that define how strings in the language are generated
S → aSb	This rule generates a string with an “a” at the start, a “b” at the end, and recursively generates the middle part of the string (another “a” and “b” pair).
S → ∧	The empty string ∧ is also a valid string in the language, corresponding to n = 0 (no “a” or “b”).

This grammar generates strings like:

∧ (empty string)
ab
aabb
aaabbb

Sample derivation for ‘aabb’: S ⇒ aSb ⇒ aaSbb ⇒ aabb

5. Parse Trees#

A parse tree (or derivation tree) is a graphical representation of a derivation. It shows the hierarchical structure of the derived string.

5.1 Properties of Parse Trees:#

Root is labeled with the start symbol
Interior nodes are labeled with non-terminals
Leaf nodes are labeled with terminals or ∧ (empty string)
If A → X₁X₂…Xₙ is a production used in the derivation, then in the parse tree:
- A is the label of an interior node
- X₁, X₂, …, Xₙ are labels of its children from left to right

5.2 Example Parse Tree#

For the string “2 + 3” using grammar G3:

5.3 Example Python Implementation#

We use three libraries to build and visualize parse trees:

Package	Purpose
`nltk`	Natural Language Toolkit — provides CFG parsing via `ChartParser`
`svgling`	Renders parse trees as scalable SVG inside Jupyter
`graphviz`	General graph visualization (used by nltk’s `.draw()` method)

The cell below checks whether each package is already installed and installs it automatically if not. This is safe to re-run; it only installs what is missing.

Outside Jupyter, install once from the terminal:

pip install nltk svgling graphviz

# Install required Python packages using pip
import sys
import importlib.util
import subprocess

def install_if_missing(package):
    if importlib.util.find_spec(package) is None:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"{package} has been installed")
    else:
        print(f"{package} is already installed")

required_packages = ['nltk', 'svgling', 'graphviz']
for package in required_packages:
    install_if_missing(package)

nltk is already installed
svgling is already installed
graphviz is already installed

Parse Tree Example: Arithmetic Expressions (Grammar G3)#

This example demonstrates two things simultaneously:

Parsing — using NLTK’s ChartParser to automatically find all parse trees for a given token list.
Derivation tracing — manually printing each step of a leftmost derivation, mirroring the formal notation from Section 3.1.

The grammar below is a code translation of G3 from Section 4.2:

Rule in G3	NLTK string
E → E + T \| E − T \| T	`E -> E '+' T \| E '-' T \| T`
T → T * F \| T / F \| F	`T -> T '*' F \| T '/' F \| F`
F → (E) \| N \| −F	`F -> '(' E ')' \| N \| '-' F`
N → ND \| D	`N -> N D \| D`
D → 0 \| 1 \| … \| 9	`D -> '0' \| '1' \| … \| '9'`

Notice that NLTK requires each terminal to be a quoted string (e.g., '+'), while non-terminals are unquoted identifiers. This mirrors the formal distinction between Σ (terminals) and V (non-terminals).

Note on display(): The display(draw_tree(tree)) call renders an SVG inside Jupyter. Outside a notebook, replace it with tree.pretty_print() to get an ASCII rendering in the terminal.

# ─────────────────────────────────────────────────────────────────────────────
# Parse Tree Visualization — Arithmetic Expressions (Grammar G3)
#
# Running outside Jupyter:
#   python parse_tree_arithmetic.py
#   Replace:  display(draw_tree(tree))
#   With:     tree.pretty_print()
# ─────────────────────────────────────────────────────────────────────────────

import nltk
from nltk import CFG
from IPython.display import display    # Jupyter-specific; see note above
from nltk.tree import Tree

# Try importing svgling for SVG tree rendering in Jupyter.
# If unavailable, we fall back to NLTK's ASCII pretty_print().
try:
    from svgling import draw_tree
    SVG_AVAILABLE = True
except ImportError:
    SVG_AVAILABLE = False
    print("Warning: 'svgling' not found. Run: pip install svgling")
    print("Falling back to ASCII tree output.\n")


# ── Define Grammar G3 ──────────────────────────────────────────────────────────
# This is the NLTK encoding of the arithmetic expression grammar from Section 4.2.
# Key: non-terminals (E, T, F, N, D) are unquoted; terminals are quoted strings.
grammar = CFG.fromstring("""
    S -> E
    E -> E '+' T | E '-' T | T
    T -> T '*' F | T '/' F | F
    F -> '(' E ')' | N | '-' F
    N -> N D | D
    D -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
""")

# ChartParser uses the Earley/CYK algorithm to find all valid parse trees.
# For unambiguous grammars there is exactly one tree per string.
parser = nltk.ChartParser(grammar)

# ── Parse and display trees for several expressions ───────────────────────────
expressions = [
    ['2', '+', '3'],
    ['2', '+', '3', '*', '4'],            # operator precedence: * binds tighter than +
    ['(', '5', '+', '2', ')', '*', '3'],  # parentheses override precedence
    ['-', '2', '*', '3']                  # unary negation handled by F → -F
]

for expr in expressions:
    print(f"\nParse Tree for: {' '.join(expr)}")
    for tree in parser.parse(expr):
        if SVG_AVAILABLE:
            display(draw_tree(tree))   # Jupyter: renders as SVG
        else:
            tree.pretty_print()        # Outside Jupyter: ASCII tree in terminal


# ── Manual leftmost derivation trace ──────────────────────────────────────────
# This function prints each step of a leftmost derivation the way Section 3.1
# shows it: always expanding the leftmost non-terminal first.
# The steps below are hard-coded for '2 + 3 * 4' to match Section 3.1 exactly.
def show_leftmost_derivation(expression):
    print(f"\nLeftmost Derivation for: {' '.join(expression)}")
    steps = [
        "S",
        "⇒ E",            # S → E
        "⇒ E + T",        # E → E + T  (leftmost E chosen; '+' fixes operator)
        "⇒ T + T",        # E → T      (leftmost E, no more addition)
        "⇒ F + T",        # T → F      (leftmost T)
        "⇒ 2 + T",        # F → N → D → '2'
        "⇒ 2 + T * F",    # T → T * F  (leftmost T in right subtree)
        "⇒ 2 + F * F",    # T → F
        "⇒ 2 + 3 * F",    # F → N → D → '3'
        "⇒ 2 + 3 * 4"     # F → N → D → '4'
    ]
    for i, step in enumerate(steps):
        print(f"  Step {i}: {step}")

show_leftmost_derivation(['2', '+', '3', '*', '4'])

Parse Tree for: 2 + 3

../../../_images/82ec013daf7a10e20c5bd786ee99cb215a82cc441c4ff3fac407c3e755c2ccba.svg

Parse Tree for: 2 + 3 * 4

../../../_images/1dc1bc3300a4db5a0ecf7379277a631b5fe313ef55755ad824c255c087a94776.svg

Parse Tree for: ( 5 + 2 ) * 3

../../../_images/388525708ffbb3edfe79552d8f8667038371a43ec3aa5167ed060b263180b7fe.svg

Parse Tree for: - 2 * 3

../../../_images/dc36c5f52cf2f7780b739bc5e3455db5b26c57dbe92e126e5be4c86ab216e3d5.svg

Leftmost Derivation for: 2 + 3 * 4
  Step 0: S
  Step 1: ⇒ E
  Step 2: ⇒ E + T
  Step 3: ⇒ T + T
  Step 4: ⇒ F + T
  Step 5: ⇒ 2 + T
  Step 6: ⇒ 2 + T * F
  Step 7: ⇒ 2 + F * F
  Step 8: ⇒ 2 + 3 * F
  Step 9: ⇒ 2 + 3 * 4

Parse Tree Example: Simple English Sentences (Grammar G4)#

This example applies the same NLTK parsing pipeline to the natural language grammar G4 from Section 4.3. The key difference from the arithmetic case: the grammar here is more overtly ambiguous — the same sentence could have multiple derivations if the grammar were extended.

Two derivation functions are defined manually to contrast:

Leftmost derivation — always expand the leftmost non-terminal next (used in top-down / recursive descent parsing).
Rightmost derivation — always expand the rightmost non-terminal next (the reversed order used internally by LR parsers working bottom-up).

Both produce the same final string "the cat sees a dog", but through different intermediate steps. Compare them side-by-side to see which non-terminals are expanded early vs. late.

Outside Jupyter: Replace display(draw_tree(tree)) with tree.pretty_print().

# ─────────────────────────────────────────────────────────────────────────────
# Parse Tree Visualization — Simple English Sentences (Grammar G4)
#
# Running outside Jupyter:
#   python parse_tree_english.py
#   Replace:  display(draw_tree(tree))
#   With:     tree.pretty_print()
# ─────────────────────────────────────────────────────────────────────────────

import nltk
from nltk import CFG
from IPython.display import display    # Jupyter-specific; see note above
from nltk.tree import Tree

try:
    from svgling import draw_tree
    SVG_AVAILABLE = True
except ImportError:
    SVG_AVAILABLE = False
    print("Warning: 'svgling' not found. Run: pip install svgling")


# ── Define Grammar G4 ──────────────────────────────────────────────────────────
# NLTK encoding of the English sentence grammar from Section 4.3.
# S  = sentence,  NP = noun phrase,  VP = verb phrase
# Det = determiner (e.g. "a", "the"),  N = noun,  V = verb
grammar = CFG.fromstring("""
    S   -> NP VP
    NP  -> Det N
    VP  -> V NP
    Det -> 'a' | 'the'
    N   -> 'cat' | 'dog'
    V   -> 'chases' | 'sees'
""")

print("Defined Context-Free Grammar G4:")
print(grammar)


# ── Leftmost derivation (Section 3.1) ─────────────────────────────────────────
# At each step, the leftmost non-terminal is replaced.
# Used by top-down parsers (e.g., Recursive Descent Parser).
def leftmost_derivation():
    print("\nLeftmost Derivation for 'the cat sees a dog':")
    steps = [
        "S",
        "⇒ NP VP",          # S  → NP VP      (expand S, leftmost = S)
        "⇒ Det N VP",        # NP → Det N      (expand NP, leftmost = NP)
        "⇒ the N VP",        # Det → 'the'
        "⇒ the cat VP",      # N   → 'cat'
        "⇒ the cat V NP",    # VP  → V NP      (expand VP)
        "⇒ the cat sees NP", # V   → 'sees'
        "⇒ the cat sees Det N",  # NP → Det N
        "⇒ the cat sees a N",    # Det → 'a'
        "⇒ the cat sees a dog",  # N  → 'dog'
    ]
    for step in steps:
        print(f"  {step}")


# ── Rightmost derivation (Section 3.2) ────────────────────────────────────────
# At each step, the rightmost non-terminal is replaced.
# Used internally by bottom-up parsers (e.g., LR parsers for C, Java, Python).
def rightmost_derivation():
    print("\nRightmost Derivation for 'the cat sees a dog':")
    steps = [
        "S",
        "⇒ NP VP",              # S   → NP VP     (rightmost = S)
        "⇒ NP V NP",            # VP  → V NP      (rightmost = VP)
        "⇒ NP V Det N",         # NP  → Det N     (rightmost NP)
        "⇒ NP V Det dog",       # N   → 'dog'
        "⇒ NP V a dog",         # Det → 'a'
        "⇒ NP sees a dog",      # V   → 'sees'
        "⇒ Det N sees a dog",   # NP  → Det N     (leftmost now, it's the only one)
        "⇒ Det cat sees a dog", # N   → 'cat'
        "⇒ the cat sees a dog", # Det → 'the'
    ]
    for step in steps:
        print(f"  {step}")


leftmost_derivation()
rightmost_derivation()


# ── Parse and display tree ─────────────────────────────────────────────────────
# ChartParser finds all valid parse trees for the given sentence.
parser = nltk.ChartParser(grammar)
sentence = ['a', 'dog', 'chases', 'the', 'dog']

print(f"\nParse Tree for: {' '.join(sentence)}")
for tree in parser.parse(sentence):
    if SVG_AVAILABLE:
        display(draw_tree(tree))   # Jupyter: SVG render
    else:
        tree.pretty_print()        # Outside Jupyter: ASCII tree

Defined Context-Free Grammar G4:
Grammar with 9 productions (start state = S)
    S -> NP VP
    NP -> Det N
    VP -> V NP
    Det -> 'a'
    Det -> 'the'
    N -> 'cat'
    N -> 'dog'
    V -> 'chases'
    V -> 'sees'

Leftmost Derivation for 'the cat sees a dog':
  S
  ⇒ NP VP
  ⇒ Det N VP
  ⇒ the N VP
  ⇒ the cat VP
  ⇒ the cat V NP
  ⇒ the cat sees NP
  ⇒ the cat sees Det N
  ⇒ the cat sees a N
  ⇒ the cat sees a dog

Rightmost Derivation for 'the cat sees a dog':
  S
  ⇒ NP VP
  ⇒ NP V NP
  ⇒ NP V Det N
  ⇒ NP V Det dog
  ⇒ NP V a dog
  ⇒ NP sees a dog
  ⇒ Det N sees a dog
  ⇒ Det cat sees a dog
  ⇒ the cat sees a dog

Parse Tree for: a dog chases the dog

../../../_images/b16c0ca47f07d583abc7a57cc33b08db185a6a5458f62020af23581278763858.svg

6. Practice Exercises#

6.1 Definition of CFG#

For each of the following languages, identify V, Σ, R. Then give three strings that belong to the language generated by the grammar.

Language 1: a language that contains strings with one or more x’s followed by one or more y’s.
Language 2: a language that contains strings of the form 0ⁿ1ⁿ where n ≥ 1.

6.2 Derivations#

Using the English sentence grammar G4 from Example 4:
- give a complete derivation for “a girl sees a cat”
- print the parse tree for this derivation
- provide two more valid sentences that can be generated by this grammar.
For the grammar G = ({S}, {a, b}, {S → aSb | SS | ∧}):
- Give a leftmost derivation for “aabb”
- Give a rightmost derivation for “aabb”
- Are these derivations unique? If not, find another derivation for the same string
- Draw the parse tree(s) for your derivation(s)

6.3 Expanding a CFG Class#

Implement the CFG class from the notebook with additional methods:

def is_valid_string(self, input_string): Implement a method to check if a given string can be generated by this grammar
def generate_strings(self, max_length): Implement a method to generate all strings up to length max_length

6.4 CFG Validator#

Create a class that validates whether a given Context-Free Grammar is properly formed. The validator should check:

Start symbol exists in the set of non-terminals
All symbols in productions are either terminals, non-terminals, or the null stringn
All left-hand sides of productions are single non-terminals
The grammar has at least one production from the start symbol

6.5 Balanced Parentheses Checker#

Implement a parser that checks if a string of parentheses (including (), [], {}) is balanced using a CFG approach. The parser should:

Define the grammar for balanced parentheses
Implement a recursive descent parser
Return True if balanced, False otherwise

6.6 Palindrome Generator and Validator#

Implement a system that:

Generates palindromes using a CFG
Validates if a string is a palindrome using CFG parsing
Finds the shortest derivation for a given palindrome

6.7 Simple Expression Evaluator#

Build an expression evaluator that:

Parses arithmetic expressions using a CFG
Builds a parse tree
Evaluates the expression from the parse tree
Handles operator precedence correctly

7. Further Reading#

“Introduction to the Theory of Computation” by Michael Sipser, Section 2.1 - Context-Free Grammars
“Introduction to Computer Theory” by Daniel I.A. Cohen, Chapter 12 - Context-Free Grammars
“Automata Theory, Languages, and Computation” by Hopcroft, Motwani, and Ullman, Chapter 5 - Context-Free Grammars

Rule in G3	NLTK string
E → E + T \| E − T \| T	`E -> E '+' T \| E '-' T \| T`
T → T * F \| T / F \| F	`T -> T '*' F \| T '/' F \| F`
F → (E) \| N \| −F	`F -> '(' E ')' \| N \| '-' F`
N → ND \| D	`N -> N D \| D`
D → 0 \| 1 \| … \| 9	`D -> '0' \| '1' \| … \| '9'`

Fundamentals of Context-Free Grammar

Contents

9. Fundamentals of Context-Free Grammar#

1. Formal Definition of Context-Free Grammar#

1.1 Key Properties:#

1.2 Code Example:#

2. An Example of Context-Free Grammars#

2.1 Example: Grammar for the language a*b*#

3. Derivations#

3.1 Leftmost Derivation#

3.2 Rightmost Derivation#

4. More Examples of Context-Free Grammars#

4.1 Example: Grammar for balanced parentheses#

4.2 Example: Grammar for arithmetic expressions#

4.3 Example: Grammar for simple English sentences#

4.4 Example: Grammar for palindromes#

4.5 Example: Grammar for {aⁿbⁿ | n ≥ 0}#

5. Parse Trees#

5.1 Properties of Parse Trees:#

5.2 Example Parse Tree#

5.3 Example Python Implementation#

Parse Tree Example: Arithmetic Expressions (Grammar G3)#

Parse Tree Example: Simple English Sentences (Grammar G4)#

6. Practice Exercises#

6.1 Definition of CFG#

6.2 Derivations#

6.3 Expanding a CFG Class#

6.4 CFG Validator#

6.5 Balanced Parentheses Checker#

6.6 Palindrome Generator and Validator#

6.7 Simple Expression Evaluator#

7. Further Reading#

2.1 Example: Grammar for the language ab#