Documentation

Atlas.TheoryOfComputation.code.ContextFree

@[reducible, inline]
abbrev ContextFree.CFG (T : Type u_1) :
Type (max 1 u_1)

A Context Free Grammar (CFG) over the terminal alphabet T.

In Sipser's notation, a CFG is a 4-tuple G = (V, Σ, R, S) where V is a finite set of variables (nonterminals), Σ = T is the set of terminal symbols, R is a finite set of production rules of the form V → (V ∪ Σ)*, and S ∈ V is the start variable.

Instances For
    @[reducible, inline]
    abbrev ContextFree.Rule (T : Type u_1) (N : Type u_2) :
    Type (max u_1 u_2)

    A production rule of a CFG: an element of the form A → α where A ∈ V is a nonterminal and α ∈ (V ∪ Σ)* is a sentential form.

    Instances For
      @[reducible, inline]
      abbrev ContextFree.Sym (T : Type u_2) (N : Type u_3) :
      Type (max u_2 u_3)

      A grammar symbol: either a terminal t ∈ T or a nonterminal A ∈ N.

      Instances For
        @[reducible, inline]
        abbrev ContextFree.SententialForm (T : Type u_2) (N : Type u_3) :
        Type (max u_3 u_2)

        A sentential form is a finite string over the combined alphabet of terminals and nonterminals, i.e. an element of (V ∪ Σ)*.

        Instances For
          @[reducible, inline]
          abbrev ContextFree.Produces {T : Type u_1} (g : CFG T) (u v : List (Symbol T g.NT)) :

          One-step derivation u ⇒ v: v is obtained from u by applying a single production rule of g at one position.

          Instances For
            @[reducible, inline]
            abbrev ContextFree.Derives {T : Type u_1} (g : CFG T) :
            List (Symbol T g.NT)List (Symbol T g.NT)Prop

            Multi-step derivation u ⇒* v: the reflexive transitive closure of Produces g, i.e. v is obtained from u by zero or more rule applications.

            Instances For
              @[reducible, inline]
              abbrev ContextFree.Generates {T : Type u_1} (g : CFG T) (s : List (Symbol T g.NT)) :

              g.Generates w means the start symbol S derives the sentential form w, i.e. S ⇒* w (where w is a list of symbols, not yet restricted to terminals).

              Instances For
                @[reducible, inline]
                abbrev ContextFree.language {T : Type u_1} (g : CFG T) :

                The language generated by the grammar g: L(G) = { w ∈ Σ* | S ⇒* w }.

                Instances For
                  @[reducible, inline]
                  abbrev ContextFree.IsContextFree {T : Type u_1} (L : Language T) :

                  A language L ⊆ T* is a context free language (CFL) if there exists a CFG G with L = L(G).

                  Instances For
                    def ChomskyNormalForm.IsCNFRule {T : Type u_1} {N : Type u_2} (r : ContextFreeRule T N) (startVar : N) :

                    A rule r : A → α is in Chomsky Normal Form if its right-hand side has one of the following forms:

                    • A → BC where B, C are nonterminals distinct from the start variable, or
                    • A → a where a is a single terminal symbol, or
                    • S → ε where S is the start variable (the only allowed ε-production).
                    Instances For

                      A grammar g is in Chomsky Normal Form iff every one of its rules is.

                      Instances For

                        Lemma 1 (Sipser, Lecture 7): Every context free grammar can be converted into an equivalent grammar in Chomsky Normal Form (i.e. one generating the same language).

                        Refinement of cfg_to_cnf producing an equivalent CNF grammar whose nonterminal type is additionally a Fintype with decidable equality. This finite witness is needed for the pumping-length argument, which counts nonterminals.

                        inductive ChomskyNormalForm.DerivesIn {T : Type u_1} (g : ContextFreeGrammar T) :
                        List (Symbol T g.NT)List (Symbol T g.NT)Prop

                        DerivesIn g u w n means u ⇒* w in exactly n derivation steps in g. This is a counted refinement of g.Derives used to formalise step-counting arguments such as Lemma 2 below.

                        Instances For
                          def ChomskyNormalForm.ntCount {T : Type u_1} {N : Type u_2} :
                          List (Symbol T N)

                          Counts the number of nonterminal symbols occurring in a sentential form.

                          Instances For
                            def ChomskyNormalForm.tCount {T : Type u_1} {N : Type u_2} :
                            List (Symbol T N)

                            Counts the number of terminal symbols occurring in a sentential form.

                            Instances For
                              theorem ChomskyNormalForm.ntCount_append {T : Type u_1} {N : Type u_2} (l₁ l₂ : List (Symbol T N)) :
                              ntCount (l₁ ++ l₂) = ntCount l₁ + ntCount l₂

                              The nonterminal count is additive over list concatenation.

                              theorem ChomskyNormalForm.tCount_append {T : Type u_1} {N : Type u_2} (l₁ l₂ : List (Symbol T N)) :
                              tCount (l₁ ++ l₂) = tCount l₁ + tCount l₂

                              The terminal count is additive over list concatenation.

                              A string consisting only of terminals (as in w.map Symbol.terminal) contains no nonterminal symbols.

                              A string of terminals contributes exactly w.length terminals to the tCount measure.

                              In a CNF grammar, the start variable S never appears on the right-hand side of any rule. This is forced by the definition of IsCNFRule, which requires the two nonterminals in A → BC to differ from the start variable.

                              theorem ChomskyNormalForm.cnf_start_not_in_produces {T : Type u_1} (g : ContextFreeGrammar T) (hcnf : IsCNF g) {sf v : List (Symbol T g.NT)} (hp : g.Produces sf v) (hsf : Symbol.nonterminal g.initialsf) :

                              "Absence of the start variable" is preserved by one-step derivations in a CNF grammar: if S ∉ sf and sf ⇒ v, then S ∉ v.

                              theorem ChomskyNormalForm.cnf_produces_cost_step {T : Type u_1} (g : ContextFreeGrammar T) (hcnf : IsCNF g) {sf v : List (Symbol T g.NT)} (hp : g.Produces sf v) (hsf : Symbol.nonterminal g.initialsf) :
                              2 * tCount v + ntCount v = 2 * tCount sf + ntCount sf + 1

                              Potential function for a single derivation step in a CNF grammar. Defining Φ(sf) := 2·tCount(sf) + ntCount(sf), one CNF production step (other than the start ε-rule, which is forbidden once S ∉ sf) increases Φ by exactly one. This is the key invariant behind Lemma 2 (cnf_derivation_length).

                              theorem ChomskyNormalForm.cnf_derivesIn_cost {T : Type u_1} (g : ContextFreeGrammar T) (hcnf : IsCNF g) {sf w : List (Symbol T g.NT)} {n : } (hd : DerivesIn g sf w n) (hsf : Symbol.nonterminal g.initialsf) :
                              2 * tCount w + ntCount w = 2 * tCount sf + ntCount sf + n

                              Iterating the per-step potential identity: after n CNF derivation steps starting from a sentential form not containing S, the potential Φ has increased by exactly n.

                              theorem ChomskyNormalForm.cnf_derivation_length {T : Type u_1} (g : ContextFreeGrammar T) (hcnf : IsCNF g) (w : List T) (hw : w []) (hw_mem : w g.language) (n : ) (hderiv : DerivesIn g [Symbol.nonterminal g.initial] (List.map Symbol.terminal w) n) :
                              n = 2 * w.length - 1

                              Lemma 2 (Sipser, Lecture 7): If H is in Chomsky Normal Form and w ∈ L(H) is nonempty, then every derivation of w from the start symbol has exactly 2|w| − 1 steps. Proved by combining the potential identity cnf_derivesIn_cost with the fact that the first step must produce either BC (cost +2) or a single terminal (cost +2 toward 2|w|).

                              structure PDA (α : Type u) (σ : Type v) (γ : Type w) :
                              Type (max (max u v) w)

                              A Pushdown Automaton (PDA), Sipser, Lecture 4.

                              Formally, a PDA is a 6-tuple (Q, Σ, Γ, δ, q₀, F) where Σ = α is the input alphabet, Γ = γ is the stack alphabet, Q = σ is the set of states, q₀ ∈ Q is the start state, F ⊆ Q is the set of accept states, and δ : Q × Σ_ε × Γ_ε → 𝒫(Q × Γ_ε) is the transition relation (where means an optional symbol, encoded here by Option). A move (q', push) ∈ step q a pop means: in state q, optionally reading input a and optionally popping pop from the stack, go to q' and optionally push push.

                              Instances For
                                @[reducible, inline]
                                abbrev PDA.Config {α : Type u} {σ : Type v} {γ : Type w} :
                                Type (max v w u)

                                An instantaneous configuration of a PDA: a triple (q, input, stack) recording the current state, the remaining input to be read, and the current stack contents (top at the head of the list).

                                Instances For
                                  inductive PDA.Step {α : Type u} {σ : Type v} {γ : Type w} (M : PDA α σ γ) :

                                  One step of the PDA: read (optionally) an input symbol a, pop (optionally) pop from the top of the stack, transition to q', and push (optionally) push on top of the stack.

                                  Instances For
                                    def PDA.Reaches {α : Type u} {σ : Type v} {γ : Type w} (M : PDA α σ γ) :

                                    The reflexive transitive closure of Step: c ⊢* c' means c can reach c' in zero or more PDA moves.

                                    Instances For
                                      def PDA.Accepts {α : Type u} {σ : Type v} {γ : Type w} (M : PDA α σ γ) (w : List α) :

                                      M accepts the input string w if, starting from (q₀, w, ε), some sequence of moves consumes all of w and ends in an accept state (the final stack contents are arbitrary).

                                      Instances For
                                        def PDA.language {α : Type u} {σ : Type v} {γ : Type w} (M : PDA α σ γ) :

                                        The language recognised by the PDA: L(M) = { w ∈ Σ* | M accepts w }.

                                        Instances For
                                          def listPow {α : Type u_1} (l : List α) :
                                          List α

                                          listPow l n is the list l repeated n times (i.e. l^n in the free monoid of lists). Used to express the pumped strings v^i and y^i in the CFL pumping lemma.

                                          Instances For
                                            @[simp]
                                            theorem listPow_succ {α : Type u_1} (l : List α) (n : ) :
                                            listPow l (n + 1) = l ++ listPow l n

                                            l^(n+1) = l ++ l^n.

                                            theorem listPow_comm {α : Type u_1} (l : List α) (n : ) :
                                            l ++ listPow l n = listPow l n ++ l

                                            The list power commutes with prepending vs. appending: l ++ l^n = l^n ++ l.

                                            theorem listPow_succ_right {α : Type u_1} (l : List α) (n : ) :
                                            listPow l (n + 1) = listPow l n ++ l

                                            Right-recursive form: l^(n+1) = l^n ++ l.

                                            theorem CFLPumping.derives_pump_once {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} {u v target : List (Symbol T g.NT)} (hloop : g.Derives [Symbol.nonterminal A] (u ++ [Symbol.nonterminal A] ++ v)) (htarget : g.Derives [Symbol.nonterminal A] target) :

                                            Pumping one level. Given a self-loop derivation A ⇒* uAv together with any derivation A ⇒* target, we can derive A ⇒* u·target·v by first taking the loop and then substituting target for the internal A.

                                            theorem CFLPumping.derives_pump_iter {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} {u v target : List (Symbol T g.NT)} (hloop : g.Derives [Symbol.nonterminal A] (u ++ [Symbol.nonterminal A] ++ v)) (htarget : g.Derives [Symbol.nonterminal A] target) (i : ) :

                                            Iterated pumping. Iterating derives_pump_once i times yields A ⇒* u^i · target · v^i. This is the source of the v^i x y^i structure in the CFL pumping lemma.

                                            theorem CFLPumping.produces_append_split {T : Type u_1} {g : ContextFreeGrammar T} {u v w : List (Symbol T g.NT)} (h : g.Produces (u ++ v) w) :
                                            (∃ (u' : List (Symbol T g.NT)), w = u' ++ v g.Produces u u') ∃ (v' : List (Symbol T g.NT)), w = u ++ v' g.Produces v v'

                                            Splitting a one-step derivation across a concatenation. If u ++ v ⇒ w, then the single rule application happened entirely inside u or entirely inside v; correspondingly w decomposes as u' ++ v (with u ⇒ u') or as u ++ v' (with v ⇒ v').

                                            theorem CFLPumping.derives_append_split {T : Type u_1} {g : ContextFreeGrammar T} {u v w : List (Symbol T g.NT)} (h : g.Derives (u ++ v) w) :
                                            ∃ (w1 : List (Symbol T g.NT)) (w2 : List (Symbol T g.NT)), w = w1 ++ w2 g.Derives u w1 g.Derives v w2

                                            Splitting a multi-step derivation across a concatenation. If u ++ v ⇒* w, then w = w₁ ++ w₂ with u ⇒* w₁ and v ⇒* w₂.

                                            A string consisting only of terminals cannot derive anything in one step: a CFG rule requires a nonterminal on its left-hand side, so no rewrite position exists inside w.map Symbol.terminal.

                                            theorem CFLPumping.map_terminal_split {T : Type u_1} {N : Type u_2} (w : List T) (w1 w2 : List (Symbol T N)) (h : List.map Symbol.terminal w = w1 ++ w2) :
                                            ∃ (a : List T) (b : List T), w1 = List.map Symbol.terminal a w2 = List.map Symbol.terminal b w = a ++ b

                                            Any decomposition of w.map Symbol.terminal as w₁ ++ w₂ lifts to a decomposition of w itself into a ++ b (with w₁ = a.map Symbol.terminal and w₂ = b.map Symbol.terminal).

                                            inductive CFLPumping.CNFParseTree {T : Type u_1} (g : ContextFreeGrammar T) :
                                            g.NTType u_1

                                            A parse tree of a CNF grammar g, rooted at the nonterminal A.

                                            In CNF, each internal node is generated by a rule A → BC and has exactly two children rooted at B and C; each leaf is generated by a rule A → a for some terminal a. The CNF ε-rule S → ε is excluded here, since we only build parse trees for nonempty strings.

                                            Instances For
                                              def CFLPumping.CNFParseTree.yield {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} :
                                              CNFParseTree g AList T

                                              The yield of a parse tree: the list of terminals at its leaves, read left-to-right.

                                              Instances For

                                                The height of a parse tree: leaves have height 0, and an internal node has height 1 + max(height(left), height(right)).

                                                Instances For

                                                  Yield-height bound. In a binary parse tree, the number of leaves is at most 2^h where h is the height. This is the pigeonhole bound that gives the pumping length 2^(k+1) in the CFL pumping lemma.

                                                  Soundness of parse trees. Every parse tree rooted at A witnesses a derivation A ⇒* yield(t) in the grammar.

                                                  A parse tree (which has no ε-leaves by construction) always yields a nonempty string of terminals.

                                                  def CFLPumping.CNFParseTree.ntAtDepth {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} (t : CNFParseTree g A) (d : ) (hd : d t.height) :
                                                  g.NT

                                                  The nonterminal label appearing at depth d along the tallest path of t (breaking ties by descending into the left subtree). The pumping argument inspects this sequence of labels to find a repeat by pigeonhole.

                                                  Instances For
                                                    theorem CFLPumping.CNFParseTree.exists_decomp_nonempty_ctx {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} (t : CNFParseTree g A) (d : ) (hd : d t.height) (hd_pos : 0 < d) :
                                                    ∃ (R : g.NT) (sub : CNFParseTree g R) (lc : List T) (rc : List T), R = t.ntAtDepth d hd sub.height = t.height - d t.yield = lc ++ sub.yield ++ rc g.Derives [Symbol.nonterminal A] (List.map Symbol.terminal lc ++ [Symbol.nonterminal R] ++ List.map Symbol.terminal rc) (lc ++ rc).length > 0

                                                    Decomposition with nonempty context. For any positive depth d along the tallest path of t, the tree splits as

                                                    yield(t) = lc ++ yield(sub) ++ rc

                                                    where sub is the subtree rooted at the nonterminal R = ntAtDepth t d, the derivation A ⇒* lc · R · rc holds, the subtree has height height(t) − d, and the surrounding context lc ++ rc is nonempty (because we went down at least one internal node and so picked off a sibling subtree, which yields a nonempty string). This is what produces the vy ≠ ε clause of the CFL pumping lemma.

                                                    theorem CFLPumping.CNFParseTree.ntAtDepth_node_eq {T : Type u_1} {g : ContextFreeGrammar T} {A B C : g.NT} (hr : { input := A, output := [Symbol.nonterminal B, Symbol.nonterminal C] } g.rules) (l : CNFParseTree g B) (r : CNFParseTree g C) (n : ) (hn : 0 < n) (hd : n (node A B C hr l r).height) :
                                                    (node A B C hr l r).ntAtDepth n hd = if h : l.height r.height then l.ntAtDepth (n - 1) else r.ntAtDepth (n - 1)

                                                    Computation rule for ntAtDepth at a node: at depth n > 0, descend into the taller child and ask for depth n − 1.

                                                    theorem CFLPumping.CNFParseTree.exists_decomp_with_coherence {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} (t : CNFParseTree g A) (d1 d2 : ) (hd1 : d1 t.height) (hd12 : d1 + d2 t.height) :
                                                    ∃ (R : g.NT) (sub : CNFParseTree g R) (lc : List T) (rc : List T), R = t.ntAtDepth d1 hd1 sub.height = t.height - d1 t.yield = lc ++ sub.yield ++ rc g.Derives [Symbol.nonterminal A] (List.map Symbol.terminal lc ++ [Symbol.nonterminal R] ++ List.map Symbol.terminal rc) ∀ (hd2 : d2 sub.height), sub.ntAtDepth d2 hd2 = t.ntAtDepth (d1 + d2) hd12

                                                    Decomposition that is coherent with deeper sampling. Like exists_decomp_nonempty_ctx, but also guarantees that the nonterminal at depth d₂ of the extracted subtree sub coincides with the nonterminal at depth d₁ + d₂ of the original tree t. This coherence lets one apply two layers of exists_decomp_… in tree_pump and ensure that the inner pumped nonterminal matches the outer one (so the loop A ⇒* uAv is well-formed).

                                                    theorem CFLPumping.produces_from_singleton_nonterminal' {T : Type u_1} {g : ContextFreeGrammar T} {A : g.NT} {v : List (Symbol T g.NT)} (h : g.Produces [Symbol.nonterminal A] v) :
                                                    rg.rules, r.input = A v = r.output

                                                    Inversion principle: a one-step derivation [A] ⇒ v from a singleton sentential form must use a rule whose left-hand side is A, and v is that rule's right-hand side.

                                                    A derivation [A] ⇒* w to a nonempty terminal string must contain at least one step, and so factors as a first one-step [A] ⇒ v followed by v ⇒* w.

                                                    theorem CFLPumping.derives_nil_of_nil {T : Type u_1} {g : ContextFreeGrammar T} {w : List (Symbol T g.NT)} (hd : g.Derives [] w) :
                                                    w = []

                                                    The empty sentential form derives only itself: [] ⇒* w implies w = [], since there is no nonterminal in [] to rewrite.

                                                    In a CNF grammar, only the start variable can erase a string: if B ≠ S and B ⇒* w, then w is nonempty. (The only ε-rule allowed in CNF has S on its left-hand side, and the start variable never appears on a right-hand side.)

                                                    theorem CFLPumping.cnf_tree_of_derives_aux {T : Type u_1} (g : ContextFreeGrammar T) (hcnf : ChomskyNormalForm.IsCNF g) (n : ) (A : g.NT) (w : List T) :
                                                    w.length nw []g.Derives [Symbol.nonterminal A] (List.map Symbol.terminal w)∃ (t : CNFParseTree g A), t.yield = w

                                                    Existence of a parse tree, by strong induction on the length of the yield. If g is in CNF and [A] ⇒* w with w nonempty and |w| ≤ n, then there is a CNFParseTree g A whose yield is w. The proof inverts the first derivation step against the three CNF rule forms and recurses on the two halves produced by an A → BC rule.

                                                    theorem CFLPumping.cnf_tree_of_mem_language {T : Type u_1} (g : ContextFreeGrammar T) (hcnf : ChomskyNormalForm.IsCNF g) (s : List T) (hs : s g.language) (hne : s []) :
                                                    ∃ (t : CNFParseTree g g.initial), t.yield = s

                                                    Completeness of CNF parse trees: every nonempty string in L(g) (for a CNF grammar g) is the yield of some parse tree rooted at the start variable.

                                                    theorem CFLPumping.listPow_map {α : Type u_2} {β : Type u_3} (f : αβ) (l : List α) (n : ) :

                                                    List.map distributes over listPow: map f (l^n) = (map f l)^n.

                                                    theorem CFLPumping.tree_pump {T : Type u_1} (g : ContextFreeGrammar T) [Fintype g.NT] [DecidableEq g.NT] {A : g.NT} (t : CNFParseTree g A) (hh : t.height > Fintype.card g.NT) :
                                                    ∃ (u : List T) (v : List T) (x : List T) (y : List T) (z : List T), t.yield = u ++ v ++ x ++ y ++ z (∀ (i : ), g.Derives [Symbol.nonterminal A] (List.map Symbol.terminal (u ++ listPow v i ++ x ++ listPow y i ++ z))) (v ++ y).length > 0 (v ++ x ++ y).length 2 ^ (Fintype.card g.NT + 1)

                                                    Tree-level pumping lemma. If a parse tree t (rooted at A) has height exceeding the number of nonterminals k = |V|, then by pigeonhole some nonterminal R repeats along the tallest path. The two occurrences of R decompose yield(t) = u v x y z with:

                                                    theorem CFLPumping.cnf_pumping_core {T : Type u_1} {g : ContextFreeGrammar T} [Fintype g.NT] [DecidableEq g.NT] (hcnf : ChomskyNormalForm.IsCNF g) (s : List T) (hs : s g.language) (hlen : s.length 2 ^ (Fintype.card g.NT + 1)) :
                                                    ∃ (u : List T) (v : List T) (x : List T) (y : List T) (z : List T), s = u ++ v ++ x ++ y ++ z (∀ (i : ), u ++ listPow v i ++ x ++ listPow y i ++ z g.language) (v ++ y).length > 0 (v ++ x ++ y).length 2 ^ (Fintype.card g.NT + 1)

                                                    Pumping lemma for CNF grammars. If g is a CNF grammar with k = |V| nonterminals and s ∈ L(g) has length |s| ≥ 2^(k+1), then s = u v x y z with u v^i x y^i z ∈ L(g) for all i ≥ 0, vy ≠ ε, and |vxy| ≤ 2^(k+1). The argument: the height bound |s| ≤ 2^h forces h > k, so tree_pump applies to a parse tree for s.

                                                    theorem CFLPumping.pumping_lemma_cfl {T : Type u_1} {A : Language T} (hA : A.IsContextFree) :
                                                    ∃ (p : ), sA, s.length p∃ (u : List T) (v : List T) (x : List T) (y : List T) (z : List T), s = u ++ v ++ x ++ y ++ z (∀ (i : ), u ++ listPow v i ++ x ++ listPow y i ++ z A) (v ++ y).length > 0 (v ++ x ++ y).length p

                                                    Pumping Lemma for Context Free Languages (Sipser, Lecture 5).

                                                    For every CFL A there exists a pumping length p ∈ ℕ such that every string s ∈ A with |s| ≥ p admits a decomposition s = uvxyz satisfying:

                                                    1. u v^i x y^i z ∈ A for all i ≥ 0,
                                                    2. vy ≠ ε, and
                                                    3. |vxy| ≤ p.

                                                    Proof: pick a CFG for A, convert it to a CNF grammar g with a finite nonterminal type via ChomskyNormalForm.cfg_to_cnf_fintype, and apply cnf_pumping_core with p = 2^(|V|+1).