Search
Write a publication
Pull to refresh
219.92
Postgres Professional
Разработчик СУБД Postgres Pro

The future of AI: formal grammars

Level of difficultyEasy
Reading time15 min
Views346
Original author: https://habr.com/ru/users/Safreliy/

Human language is a mechanism that narrows the infinite variability of possible sounds and their combinations into a strict communication system.

  1. Phonemes limit the combinations of sounds. In Russian, for example, there are only 42 of them.

  2. Words constrain combinations of phonemes and map our world into a discrete set of concepts — this gives rise to semantics.

  3. Sentences, in turn, constrain combinations of words, forming structures for describing phenomena in the world we perceive.

All of these constraints form the essence of language: its syntax and semantics.

However, our language still remains overly free, and large language models (LLMs) easily adopt this freedom. LLMs demonstrate impressive flexibility in text generation, but their fundamental weakness lies in uncontrolled variability. Theoretically, the space of possible generations grows exponentially: if the length is M and the vocabulary has N tokens, the number of options is N^M. In practice, probability distributions and methods like top-p/top-k sampling filter out unlikely variants, but even then LLMs remain prone to chaotic deviations — hallucinations, contradictions, and unstable formats.

One way to tame this variability is by introducing formal grammars that impose strict generation rules. Grammars drastically reduce the space of possible tokens, but in return provide predictability and structure. This isn’t just restriction — it’s a shift from chaotic generation to controlled synthesis, where the model operates within strict syntactic rules. As a result, more attention is given to semantics — crucial for solving applied problems.

Formal Grammars

A formal grammar is a system of rules that defines which sequences of symbols belong to a language and which do not. It is specified in terms of:

  • Terminals (basic symbols from which strings are built):

    • Example for arithmetic expressions: "0", "1", ..., "+", "-", "*", "(", ")", "/".

    • Example for DNA grammar: "ATG", "TTT", "TAA" and other nucleotide triplets (64 in total).

  • Nonterminals (abstract syntactic categories):

    • Arithmetic: Expression, Number.

    • DNA: Gene, Codon, Promoter, START, STOP.

  • Production rules (define how nonterminals expand into terminals and other nonterminals):

    Expression ::= Term (("+" | "-") Term)*
    Term       ::= Factor (("*" | "/") Factor)*
    Factor     ::= NUMBER | "(" Expression ")"
    NUMBER     ::= [0-9]+
    

This rule format is called GBNF — a regex-enhanced extension of EBNF, which itself extends BNF (Backus–Naur Form). While there’s a zoo of notations, they all boil down to simple substitution rules of the form A → B.

For example, the above rules could be flattened into one using substitutions:

Expression ::= [0-9]+ | "(" Expression ")" (("*" | "/") [0-9]+ | "(" Expression ")")* (("+" | "-") [0-9]+ | "(" Expression ")" (("*" | "/") [0-9]+ | "(" Expression ")")*)*

It’s harder to read, but you start to see the practical role of nonterminals in grammar design.

In general, grammars can differ fundamentally. Some are defined using simple regular expressions — for example, the set of all valid email addresses. Others require a stack and recursive processing, such as for validating properly nested parentheses. There are grammars where production rules depend on context, and different rules apply in different contexts. In some cases, production rules may have virtually no constraints at all.

These distinct cases form fundamentally different categories and were formally classified into separate types in 1956 by Noam Chomsky in what became known as the Chomsky hierarchy.

Each type differs in terms of parsing complexity, automaton used, and rule formats.

Searls, D. B. (2002). The language of genes. Nature
Searls, D. B. (2002). The language of genes. Nature

Regular Grammars (Type 3)

Widely used for simple pattern matching, such as email validation:

^[a-zA-Z0-9._]+@[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)+$

Let’s define the same grammar using the previously introduced GBNF syntax, employing terminals, non-terminals, and production rules:

email  	::= local_part "@" domain

local_part ::= (letter | digit | "." | "_")+

domain 	::= label "." domain | label

label      ::= (letter | digit)+

letter     ::= [a-zA-Z]

digit      ::= [0-9]

Example derivation of user.123@somemail.com shows step-by-step substitution.

email 

→ local_part "@" domain

→ "u" local_part "@" domain

→ ... → "user.123" "@" domain

→ "user.123@" label "." domain

→ "user.123@somemail." domain

→ "user.123@somemail." label

→ "user.123@somemail.com"

Context-Free Grammars (Type 2)

Can you define SQL using a context-free grammar? For a basic query:

query       ::= "SELECT" columns "FROM" entity_name "WHERE" condition
columns     ::= (entity_name ", ")+
condition   ::= entity_name operator value
             | condition ("AND" | "OR") condition 
             | "(" condition ")"
operator    ::= "=" | ">" | "<" | "!="
value       ::= string | number
entity_name ::= [a-zA-Z0-9_]+
string      ::= "\"" [a-zA-Z0-9_ ]* "\""
number      ::= [0-9]+

Such a grammar can no longer be implemented using regular expressions, because it involves nested structures and recursion, which require the use of a stack.

Let's break down how the derivation (expansion) would look for the following expression:SELECT name, age FROM users WHERE age > 25 AND name = "Андрей":

query

→ 'SELECT' columns 'FROM' entity_name 'WHERE' condition

→ 'SELECT' entity_name ',' entity_name 'FROM' 'users' 'WHERE' condition

→ 'SELECT' 'name' ',' 'age' 'FROM' 'users' 'WHERE' condition 'AND' condition

→ 'SELECT name, age FROM users WHERE' entity_name operator value 'AND' entity_name operator value

→ 'SELECT name, age FROM users WHERE' 'age' '>' number 'AND' 'name' '=' string

→ 'SELECT name, age FROM users WHERE age > 25 AND name = "Андрей"'

Emoji language example:

story     ::= event+
event     ::= subject (verb (object | subject)+)+
subject   ::= "🧑" | "🐕"
verb      ::= "🏃" | "🍽️" | "❤️" | "🔍" | "🫴" | "🎶"
object    ::= "🍖" | "🏠" | "🌕" | "🌳" | "🚀"

Derivation for the emoji sequence"🧑🏃 🏠 🫴 🍖 🐕 🐕 🍽️ 🍖 ❤️ 🧑"("a person ran home and gave food to a dog, the dog ate the food and loves the person"):

story

→ event event

→ subject verb object verb object subject event

→ '🧑' verb object verb object subject event

→ '🧑 🏃' object verb object subject event

→ '🧑 🏃 🏠' verb object subject event

...

→ '🧑 🏃 🏠 🫴 🍖 🐕' event

→ '🧑 🏃 🏠 🫴 🍖 🐕' subject verb object verb subject

→ '🧑 🏃 🏠 🫴 🍖 🐕 🐕' verb object verb subject

...

→ '🧑 🏃 🏠 🫴 🍖 🐕 🐕 🍽️ 🍖 ❤️ 🧑'

Syntactically, the resulting language is extremely simple — but thanks to the semantics we embed in each emoji (a terminal in the grammar), we end up with a very compact tool for describing real-world events. However, due to the lack of contextual awareness, the grammar easily allows nonsensical sequences like:

  • "🐕🎶🚀🔍🌕❤️🚀"

  • "🧑🍽️🧑"

  • "🐕🍽️🧑"

  • "🧑❤️🍖🍽️🐕"

The last three are particularly disturbing, and the first one is just plain meaningless — even though all of them technically belong to the defined language.

Of course, it’s possible to explicitly specify which subjects are allowed to perform which actions on which objects or other subjects. But doing so amounts to unpacking all the contextual dependencies — and this causes the grammar to bloat rapidly, since listing every valid combination becomes exponentially more complex.

Context-Sensitive Grammars (Type 1)

In the emoji language, to account for contextual dependencies, we need to turn to the next level in the Chomsky hierarchy — context-sensitive grammars.

Let’s revisit the previous grammar, this time introducing a few restrictions:

story    ::= event+

event    ::= subject (verb (object | subject)+)+

subject  ::= "🧑" | "🐕"

verb     ::= "🏃" | "🍽️" | "❤️" | "🔍" | "🫴" | "🎶"

object   ::= "🍖" | "🏠" | "🌕" | "🌳" | "🚀"

 

# 1. Restrictions for dogs [Dog can only perform certain actions]

"🐕" verb ::= "🐕" dog_verb

dog_verb ::= "🏃" | "🍽️" | "❤️" | "🔍"

 

# 2. Eating living creatures is forbidden

verb "🧑" ::= accepted_actions "🧑"

verb "🐕" ::= accepted_actions "🐕"

accepted_actions ::= "🏃" | "❤️" | "🔍" | "🫴" | "🎶"

 

# 3. Restriction on consuming inedible objects  

"🧑" "🍽️" object ::= "🧑" "🍽️" edible

"🐕" "🍽️" object ::= "🐕" "🍽️" edible

edible        	::= "🍖"

Now 🧑❤️🐕 is allowed, but 🧑🍽️🧑 is not. More meaningful and ethically sound outputs result.

Now the strings in our language become more meaningful and morally coherent — although this came at the cost of moving to a more complex class of grammars. But that’s exactly what context sensitivity is about — a characteristic that, for example, thoroughly pervades our natural human language.

Unrestricted Grammars (Type 0)

The final type to consider is unrestricted grammars, where production rules can be practically anything. It’s hard to come up with something meaningful and directly applicable for our use cases here — so let’s just get creative:

💥⭐️ ::= 🌫️🌫️🌫️ ::= 🌍                     # A supernova explosion led to the formation of gas-dust clouds, which resulted in the creation of planet Earth (note that terminal sequences can also shorten in type 0)

person_subject ::= 👩 | 👨

person_subject💣🏦🫳💰 ::= ⛓️person_subject⛓️  # If someone robs a bank, punishment for them is inevitable

person_subject🚬 ::= ☠️                      # Clearly, smoking leads to nothing good

And so on — the principle is clear.

How to make an LLM speak the desired language

To make an LLM stay within the bounds of a grammar defined in GBNF, there is a tool called XGrammar. Its key feature is the separation of all tokens into context-independent ones (whose masks are precomputed) and context-dependent ones (checked at runtime), which reduces 99% of checks. Conveniently, XGrammar is supported by several popular LLM engines such as vLLM and TensorRT-LLM.

In our work, we use vLLM with the ChatOpenAI interface — this allows us to easily specify the grammar in the guided_grammar field as follows:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(

    model="Qwen/Qwen3-32B-AWQ",

    max_completion_tokens=10000,

    temperature=0.6,

    base_url="base_url",

    api_key="api_key",

    extra_body={

        "guided_grammar":grammar,

        "top_k": 20,

        "chat_template_kwargs": {"enable_thinking": False}

    }

)

Creating grammars for practical tasks

Example with JSON grammar for three strings in an array

This grammar is used at the first stage of our RAG Fusion chatbot to generate three variants of reformulated user queries — this is necessary for more accurate retrieval of relevant documents and to take chat history into account.

root         ::= "[" string_value "," string_value "," string_value "\n]"
string_value ::= "\n  " "\"" [^\"\n]* "\""

For example, for the query "How to delete all rows from a table?" the following variants will be generated:

[
  "How to delete all rows from a table using SQL?",
  "How to completely clear a table by deleting all rows?"
]

Example with grammar for a data model migration language

One of our projects is a generator of data models and application specifications based on user dialogue. This system was originally designed to be transactional — meaning any changes based on new user data are made using a predefined set of migrations: add_property, remove_property, rename_property, add_enum_value, and others. Currently, there are 18 such operations, and each migration operation has its own description format in YAML:

root        	        ::= (think_block "\n") "``yaml\n" operations "\n``"

 

# Thinking block

think_block 	::= "<think>\nOk," allowed_char "\n</think>"

allowed_char	::= [\na-zA-Zа-яА-ЯёЁ0-9!@#$%^&*()_+{}[\]\\|;:'",./? \-]*

 

# Operations

operations  	  ::= "operations:" ("\n" operation_item)+

operation_item  ::= "  - operation: " operation

operation   	  ::= add_property | remove_property | rename_property | add_enum_value

                | remove_enum_value | replace_enum | modify_ref | add_definition

                | remove_definition | rename_definition | remove_user_stories

                | add_user_stories | update_description | add_ui_page | remove_ui_page

                | change_ui_page | add_endpoint | remove_endpoint

 

# Common rules

entity       	::= [a-zA-Z_]+

key          	::= [a-zA-Z_]+

string_value	::= "\"" [^\"\n]* "\""

integer     	::= [0-9]+

boolean          ::= "true" | "false"

 

# Types

types       	::= exact_numeric | floating_point | integer_type

                  | string_type | temporal_type | geometric_type

                  | network_type | json_type | binary_type

                  | boolean_type | uuid_type | monetary_type

                  | bit_type | text_search_type | serial_type

 

# Numeric Types

exact_numeric   ::= "numeric" "(" integer "," integer ")"

floating_point  ::= "real" | "double precision"

monetary_type   ::= "money"

 

# Integer Types

integer_type	::= "smallint" | "integer" | "bigint"

serial_type 	::= "smallserial" | "serial" | "bigserial"

# ...

# Definition Operations

add_definition   	::= "add_definition" entity_spec "\n	properties:" (property_def)+

property_def     	::= "\n  	" key ":" (prop_schema_type | prop_schema_ref | prop_schema_enum | prop_schema_elements) prop_pk? prop_optional?

prop_schema_type 	::= "\n    	type: " types

prop_schema_elements ::= "\n    	elements: " (prop_schema_elements_ref | prop_schema_elements_type)

prop_schema_ref  	::= "\n    	ref: " entity

prop_schema_enum 	::= "\n    	enum:" ("\n      	- " string_value)+

prop_optional    	::= "\n    	optional: " boolean

prop_pk          	::= "\n    	primary_key: " boolean

# ...

Here is a small part of the migrations grammar with basic nonterminals, an example of specifying data types, and the description of the operation for adding a new entity to the data model (add_definition).

This is what a migration in this language looks like for the request: "Generate a data model for a horror book selling application":

Example of a migration to create a data model
<think>

...

</think>

```yaml

operations:

  - operation: remove_property

    entity: book

    key: author

  - operation: add_property

    entity: book

    key: authors

    schema:

      elements:

        ref: author

  - operation: update_description

    text: "A horror book selling application where users can browse, purchase, and manage horror-themed books. Books can have multiple authors. Admins can manage the catalog and orders."

```"

Here is how the modification would look with the clarification "A book can have multiple authors":

<think>

...

</think>

```yaml

operations:

  - operation: remove_property

    entity: book

    key: author

  - operation: add_property

    entity: book

    key: authors

    schema:

      elements:

        ref: author

  - operation: update_description

    text: "A horror book selling application where users can browse, purchase, and manage horror-themed books. Books can have multiple authors. Admins can manage the catalog and orders."

```"

Exactly — instead of regenerating the entire schema from scratch, we apply any changes incrementally through a strictly defined migration grammar, ensuring the LLM cannot produce output outside those constraints.

Example of PostgreSQL grammar in GBNF, derived from Bison+Flex

To enable LLMs to generate syntactically correct SQL queries for different DBMS versions, we adopted the approach of converting PostgreSQL’s LALR grammar (originally written in Bison+Flex) into an LL grammar format (GBNF). This transformation is quite non-trivial.

root ::= parse_toplevel

ws ::= [ \t\n\r]*

parse_toplevel ::= toplevel_stmt ws ( ";" ws toplevel_stmt )*

 

toplevel_stmt ::= stmt

           | TransactionStmtLegacy

stmt ::= ( AlterEventTrigStmt | AlterCollationStmt | AlterDatabaseStmt | AlterDatabaseSetStmt | AlterDefaultPrivilegesStmt | AlterDomainStmt | AlterEnumStmt | AlterExtensionStmt | AlterExtensionContentsStmt | AlterFdwStmt | AlterForeignServerStmt | AlterFunctionStmt | AlterGroupStmt | AlterObjectDependsStmt | AlterObjectSchemaStmt | AlterOwnerStmt | AlterOperatorStmt | AlterTypeStmt | AlterPolicyStmt | AlterSeqStmt | AlterSystemStmt | AlterTableStmt | AlterTblSpcStmt | AlterCompositeTypeStmt | AlterPublicationStmt | AlterRoleSetStmt | AlterRoleStmt | AlterSubscriptionStmt | AlterStatsStmt | AlterTSConfigurationStmt | AlterTSDictionaryStmt | AlterUserMappingStmt | AnalyzeStmt | CallStmt | CheckPointStmt | ClosePortalStmt | ClusterStmt | CommentStmt | ConstraintsSetStmt | CopyStmt | CreateAmStmt | CreateAsStmt | CreateAssertionStmt | CreateCastStmt | CreateConversionStmt | CreateDomainStmt | CreateExtensionStmt | CreateFdwStmt | CreateForeignServerStmt | CreateForeignTableStmt | CreateFunctionStmt | CreateGroupStmt | CreateMatViewStmt | CreateOpClassStmt | CreateOpFamilyStmt | CreatePublicationStmt | AlterOpFamilyStmt | CreatePolicyStmt | CreatePLangStmt | CreateSchemaStmt | CreateSeqStmt | CreateStmt | CreateSubscriptionStmt | CreateStatsStmt | CreateTableSpaceStmt | CreateTransformStmt | CreateTrigStmt | CreateEventTrigStmt | CreateRoleStmt | CreateUserStmt | CreateUserMappingStmt | CreatedbStmt | DeallocateStmt | DeclareCursorStmt | DefineStmt | DeleteStmt | DiscardStmt | DoStmt | DropCastStmt | DropOpClassStmt | DropOpFamilyStmt | DropOwnedStmt | DropStmt | DropSubscriptionStmt | DropTableSpaceStmt | DropTransformStmt | DropRoleStmt | DropUserMappingStmt | DropdbStmt | ExecuteStmt | ExplainStmt | FetchStmt | GrantStmt | GrantRoleStmt | ImportForeignSchemaStmt | IndexStmt | InsertStmt | ListenStmt | RefreshMatViewStmt | LoadStmt | LockStmt | MergeStmt | NotifyStmt | PrepareStmt | ReassignOwnedStmt | ReindexStmt | RemoveAggrStmt | RemoveFuncStmt | RemoveOperStmt | RenameStmt | RevokeStmt | RevokeRoleStmt | RuleStmt | SecLabelStmt | SelectStmt | TransactionStmt | TruncateStmt | UnlistenStmt | UpdateStmt | VacuumStmt | VariableResetStmt | VariableSetStmt | VariableShowStmt | ViewStmt )?

opt_single_name ::= ColId?

opt_qualified_name ::= any_name?

opt_concurrently ::= "CONCURRENTLY"?

opt_drop_behavior ::= ( "CASCADE" | "RESTRICT" )?

CallStmt ::= "CALL" ws func_application

CreateRoleStmt ::= "CREATE" ws "ROLE" ws RoleId ws opt_with ws OptRoleList

opt_with ::= ( "WITH" | "WITH" )?

OptRoleList ::= CreateOptRoleElem*

AlterOptRoleElem ::= "PASSWORD" ws ( Sconst | "NULL" )

           | ( ( "ENCRYPTED" | "UNENCRYPTED" ) ws "PASSWORD" | "VALID" ws "UNTIL" ) ws Sconst

           | "INHERIT"

           | "CONNECTION" ws "LIMIT" ws SignedIconst

           | "USER" ws role_list

           | IDENT

CreateOptRoleElem ::= AlterOptRoleElem

           | "SYSID" ws Iconst

           | ( "ADMIN" | "ROLE" | "IN" ws ( "ROLE" | "GROUP" ) ) ws role_list

CreateUserStmt ::= "CREATE" ws "USER" ws RoleId ws opt_with ws OptRoleList

 

# ...

 

ICONST ::= [0-9] ([0-9]){0,8} |

           "1" ([0-9]){9} |

           "20" ([0-9]){8} |

           "21" [0-3]([0-9]){7} |

           "214" [0-6]([0-9]){6} |

           "2147" [0-3]([0-9]){5} |

           "21474" [0-7]([0-9]){4} |

           "214748" [0-2]([0-9]){3} |

           "2147483" [0-5]([0-9]){2} |

           "21474836" [0-3]([0-9]){1} |

           "214748364" [0-7]

Features of converting Bison → GBNF

The GBNF format first appeared as part of the llama.cpp ecosystem around 2023. Being a relatively new grammar format, it lacks tools that allow direct conversion from Bison to GBNF. The closest similar format that provides an opportunity to translate Bison grammars is EBNF.

Broadly speaking, GBNF differs from EBNF (the W3C EBNF dialect) by requiring manual description of whitespace, comment symbols (which remain in parser style, i.e., like those in Bison), and quotation marks (double quotes instead of single).

There are also differences in the very nature of parsers and grammar formats: in Bison, code fragments are part of the grammar, whereas GBNF/EBNF are purely declarative description languages. Therefore, some specific cases in Bison either cannot be represented in GBNF or must be specified manually.

Moreover, it should be considered that the grammar is originally described not only in Bison itself but also partially in the lexer file (Flex). When converting Bison → GBNF, lexer constants must be handled manually because there is currently no tool for automatic extraction.

A particular challenge is handling left-recursive rules. Since the parser in XGrammar belongs to the LL class, all left-recursive rules cause infinite loops. To resolve left recursion, the following transformation can be applied:

Suppose there is (1) A ::= A a1 a2 a3 ... | b | c, where a1, a2, a3, ..., b, and c are rules free from A. Then (2) A ::= (b | c) A'? A' ::= a1 a2 a3 ... A'?

Since A in the new definition necessarily starts with b or c, there is no left recursion, while right recursion (A' ::= a1 ... A') does not cause loops.

Let’s show the equivalence of (1) and (2).

In (2), rules must start with b | c. We prove by contradiction that (1) also always starts with b | c:
Suppose A ::= A a1 a2 a3 ... | b | c and the first recursion is not equal to b | c, meaning A ::= A a1 a2 a3 .... Try substituting the expression A a1 a2 a3 ... instead of A again, then:

A ::= A a1 a2 a3 ... a1 a2 a3 ...

We see that we get an infinite loop until A becomes equal to b | c. Then we have:

A ::= (b | c) (a1 a2 a3 ...)*

Or

A ::= (b | c) A',
A' ::= a1 a2 a3 ... A'

That is, we transformed expression (1) into (2).

Dynamic grammars

The idea of dynamic grammars is as simple as possible but at the same time very effective. Suppose the correctness of the written code depends on the environment in which it is executed, for example, the same SQL may be valid in one database but not in another. Now, instead of using the same grammar for all cases, we create a template in which we redefine nonterminals necessary for the correct execution of the code. This introduces a kind of contextual dependency using the same context-free grammars. Thus, the LLM during generation will not be able to reference non-existent columns or non-existent tables because it simply won’t have the ability to use the corresponding tokens during generation due to the dynamic language constraints we impose.

Tags:
Hubs:
+5
Comments0

Articles

Information

Website
www.postgrespro.ru
Registered
Founded
Employees
201–500 employees
Location
Россия
Representative
Иван Панченко