2. Languages#

Learning Objectives#

After completing this chapter, you will be able to:

  • Define an alphabet, a word, a language, and a grammar.

  • Understand some different ways in which a language can be defined, in particular recursively.

  • Understand the concept of Kleene closure.

1. Formal Languages#

“Colorless green ideas sleep furiously.”

That sentence makes no sense. It’s contradictory, and it applies verbs to objects that can’t perform them. However, right now you’re probably parsing that sentence and trying to understand what it means, because it seems like it should mean something. But why? Why is this fundamentally different than, say, the five words:

“Person woman man camera TV”

Well, it’s because the grammar - the form - of the first sequence aligns with how we structure sentences in the English language. The second does not. This is the fundamental idea behind a formal language. It’s a language where our primary concern is the structure of the letters or the words within it, and not their deeper conceptual meaning.

This idea has major applications in computer programming. Specifically, when you write a computer program, there’s a certain form it has to have in order for it to make sense to the computer. This is what a compiler does - it takes the program you’ve written, and translates it into instructions the computer can execute. If the program compiles, that means its form is correct. It makes sense to the computer. Now, the deeper meaning behind your program - specifically whether the program you’ve written actually does what you want it to do - is a different question altogether.

In this section we’ll discuss the building blocks of formal languages and some of the basic ideas.

2. Alphabets and Words#

“Words, words, words…”

  • William Shakespeare, Hamlet, Act II, Scene ii

A language begins with a finite set of symbols called an alphabet. I trust you’re familiar, perhaps from kindergarten, with the idea. Typically we denote an alphabet with the symbol \(\Sigma\). We can form finite sequences of symbols from an alphabet, and a language is a specified subset of these sequences. We call the sequences that are in a language the words of the language. Please note that we allow the concept of a word with no letters, called the empty string, and denote it by \(\lambda\). We also allow the concept of a language with no words, and denote it by \(\emptyset\), the symbol used to represent the empty set.

Example#

We could use the standard English alphabet, along with the apostrophe and hyphen

\[\displaystyle \Sigma = \{a, b, c, d, e, \ldots z, ', -\}\]

as an alphabet, and the language could be the set of words from a standard English dictionary.

This is a perfectly acceptable way to define an alphabet and language, and this can work if the set of words is finite. However, let’s suppose we wanted to expand our scope a bit, and instead define our alphabet as the entire set of words in the English language, and our “words” as the set of English sentences. Well, in this case the method wouldn’t work. We can imagine a dictionary that covers every English word, but that won’t be possible for every English sentence. There would be an infitine number of sentences, and the book would never end! Take, for example:

  1. “I saw one ship a sailing in.”,

  2. “I saw two ships a sailing in.”,

  3. “I saw three ships a sailing in.”,

etc…

These are all sentences in the English language, and the list goes on forever.

3. Grammars#

So what do we do if our language has an infinite number of words? Instead of listing out every one, which would be impossible, we need some way to check whether a sequence of letters matches an acceptable form. This form is what we call a grammar. If we were dealing with a natural language, we’d also need to make sure that a sentence with proper grammar actually made sense. We probably wouldn’t say the example we saw at the beginning of these notes, “Colorless green ideas sleep furiously.”, makes sense, but this concern takes us out of the realm of formal languages. So, we’ll only care if a word adheres to the rules of our grammar, and not whether the word “makes sense” from a broader perspective.

Example#

We could define an alphabet with only one letter, say \(\Sigma = \{x\}\), and one language over that alphabet could be any finite, nonempty string of characters. So, \(L = \{x, xx, xxx, \ldots\}\).

We define the length of a word in a language as the number of characters (symbols from the alphabet) it has. So, for example, in our language right above \(length(xxx) = 3\). The concatenation of two words \(w_{1}, w_{2}\), typically written as \(w_{1}*w_{2}\) or just \(w_{1}w_{2}\), combines them into one sequence of symbols. So, if \(w_{1}\) = “Weber” and \(w_{2}\) = “State”, then \(w_{1}w_{2}\) = “WeberState”. The notation \(w^{n}\) means the word \(w\) concatenated with itself \(n\) times. So, if \(w\) = “great” then \(w^{3}\) = “greatgreatgreat”.

Example#

We could define an alphabet consisting of the ten digit characters \(\Sigma = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}\), and a language over this alphabet as any finite sequence of these characters that does not start with \(0\) if its length is greater than \(1\). This would correspond with the set of natural (counting) numbers \(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \ldots\).

We define the reverse of a word to be its sequence of symbols in the opposite order. For example, \(reverse(abc) = cba\). A word that’s equal to its reverse is a palindrome. For example, the word “racecar” would be a palindrome.

4. Kleene Closure#

Given an alphabet \(\Sigma\), the closure of the alphabet is the language in which any string of letters from \(\Sigma\) is a word, including the empty string \(\lambda\). We call this the closure or Kleene closure of the alphabet, and denote it with the Kleene star \(\Sigma^{*}\).

Example#

If \(\Sigma = \{a, b\}\) then \(\Sigma^{*} = \{\lambda, a, b, aa, ab, bb, aaa, aab, \ldots\}\).

Note that If \(\Sigma = \emptyset\) then \(\Sigma^{*} = \{\lambda\}\). For any nonempty alphabet, the closure is an infinite set of words.

The set \(\Sigma^{+}\) is defined as everything in the closure of the alphabet \(\Sigma\) except for the empty string \(\lambda\).

If \(S\) is any set of words, then by \(S^{*}\) we mean the set of all finite strings formed by concatenating words from \(S\), where any word can be used as often as we like, and where the null string is also included. The set \(S^{+}\) is all finite positive concatenations of strings from \(S\). It does not contain \(\lambda\) unless \(S\) does as well.

The notation \(S^{**}\) means the closure of the closure of \(S\), so \((S^{*})^{*}\). Similarly, \(S^{***}\) means the closure of the closure of the closure of \(S\), and so on.

Example#

For any set of strings \(S\) we have \(S^{*} = S^{**}\). We can prove this by noting that for any string \(s \in S^{*}\), we must have \(s \in S^{**}\), because \(S^{**}\) is the set of strings formed by concatenating any finite number of strings from \(S^{*}\), and as \(1\) is a finite number, this must include \(s\). On the other hand, if we have a string \(s \in S^{**}\) it must be a finite concatenation of strings from \(S^{*}\), so:

\[s = s_{1}s_{2} \ldots s_{k}\]

for some finite number \(k\), where each \(s_{i} \in S^{*}\). Now, each \(s_{i} \in S^{*}\), which means each \(s_{i}\) is a concatenation of a finite number of strings from \(S\):

\[s_{i} = s_{i,1}s_{i,2} \ldots s_{i,l_{i}}\]

for some finite number \(l_{i}\), where each \(s_{i,j} \in S\). Combining these, we get:

\[s = s_{1,1}s_{1,2} \ldots s_{1,l_{1}} s_{2,1} s_{2,2} \ldots s_{2,l_{2}} \ldots s_{k,1}s_{k,2} \ldots s_{k,l_{k}}\]

This whole thing is just a finite concatenation of strings from \(S\), which means \(s \in S^{*}\). Done!

An alphabet is a set of symbols, and so it would make sense to represent an alphabet using a set data structure in Python. For example, we could create an alphabet consisting of only the vowel characters:

vowels = {'a','e','i','o','u'}

The Kleene closure of this alphabet would then be all the strings consisting of just vowels.

We could write a Python function that checks whether a given string is in the closure of a set of characters. First, we’ll want a helper function that will check whether an object is an alphabet (a set of characters).

def is_alphabet(alpha):
  if not isinstance(alpha, set):
    return False
  for item in alpha:
    if not isinstance(item, str):
      return False
    elif len(item) > 1:
      return False
  return True
is_alphabet(vowels)
True

Then, we can use this to check whether a string is in the closure of an alphabet:

def in_closure(s, alpha):
  if not isinstance(s, str):
    return False
  if not is_alphabet(alpha):
    return False
  for i in range(len(s)):
    if s[i] not in alpha:
      return False
  return True

We could use this function to check whether the string ‘eau’ is in the alphabet of vowels:

in_closure('eau', vowels)
True

The function says it is, which is correct. However, the word ‘slartybartfast’ has more than just vowels, and so would not be:

in_closure('slartybartfast', vowels)
False

Note - We’ve been very careful to put checks into this code to make sure, for example, that an alphabet is a set of characters or that the object given to the first argument of the in_closure function is indeed a string. We won’t necessarily be so concerned with guardrails in the future.

5. Recursive Definitions#

A common and powerful way of providing the grammar for a language is recursively. A recursive grammar begins by defining a basic set of words within the language, and then a set of rules for generating additional words. The language is the set of all words this process can generate. It’s called recursive because determining whether a word is in the language, if it’s not part of the basic set, involves referencing other words that we’ve already determined to be in the language.

Example#

We could define a language as:

  • Rule 1 - \(\lambda\) is in \(L\).

  • Rule 2 - If \(w\) is any word in \(L\), then \(xw\) is also in \(L\).

This recursive definition would define the language:

\[L = \{\lambda, x, x^{2}, x^{3}, \ldots\}\]

We could write a Python function for determining if a string is in \(L\) using recursion:

def in_L(word):
  if word == '':
    return True
  if word[0] == 'x':
    return in_L(word[1:])
  return False
in_L('xxxx')
True
in_L('xxyx')
False

We could have actually defined the Kleene closure of a language recursively as:

  • Rule 1 - If \(S\) is a language, then all the words of \(S\) are is \(S^{*}\).

  • Rule 2 - \(\lambda\) is in \(S^{*}\).

  • Rule 3 - If \(x\) and \(y\) are in \(S^{*}\), then so is their concatenation \(xy\).

Example#

We could define the positive even numbers recursively as:

  • Rule 1 - The number \(2\) is in \(EVEN\).

  • Rule 2 - If \(n\) is in \(EVEN\), so is \(n+2\).

We can prove the number \(8\) is in \(EVEN\) using this recursive definition. Specifically, we know that \(2\) is in \(EVEN\) by Rule 1, and using this and Rule 2 we get that \(2 + 2 = 4\) is in \(EVEN\). Using this and Rule 2 again, we know that \(4 + 2 = 6\) is in \(EVEN\). Using this and Rule 2 one more time, we know that \(6 + 2 = 8\) is in \(EVEN\).

6. Practice Exercises#

Exercise 1 - Consider the language \(S^{*}\), when \(S = \{a,b\}\). How many words does this language have of length 2? Of length 3? Of length \(n\)?

Exercise 2 - A student suggested the following algorithm to test a string of \(a\)’s and \(b\)’s to see if it is a word is \(S^{*}\), where \(S = \{aa, ba, aba, abaab\}\). Step 1, cross off the longest set of characters from the front of the string that is a word in \(S\). Step 2, repeat step 1 until it is no longer possible. If what remains is the string \(\lambda\) (the empty string), the original string was a word in \(S^{*}\). If what remains is not \(\lambda\) (this means some letters are left, but we cannot find a word in \(S\) at the beginning), the original string was not a word in \(S^{*}\). Find a string that disproves this algorithm.

Exercise 3 - For a set of words \(S\), is the set \(S^{***}\) different than the set \(S^{**}\)?

Exercise 4 - Provide another recursive definition of the set EVEN.

Exercise 5 - Write a recursive Python function that determines whether a positive integer is even.

7. References and Further Reading#