Is explicit parsing still necessary if we use a large language model?

Often no for chat, classification, and general retrieval, where modern models handle everyday syntax well enough. It is still worth keeping for tasks that demand interpretability, forensic auditability, low-resource languages, or strict structural correctness, such as legal review, medical coding, authorship verification, and compliance extraction. Treat it as a design choice, not a default.

What is the difference between dependency parsing and constituency parsing?

Dependency parsing maps direct head-modifier relationships between individual words and dominates modern NLP pipelines because it aligns cleanly with extraction tasks. Constituency parsing produces nested phrase structures like noun phrases and verb phrases and remains useful when phrase boundaries are the working unit. Most production teams default to dependency parsing unless the downstream task says otherwise.

Why do LLMs sometimes fail on sentences that look simple?

Research from the Linzen group shows that LLM syntactic knowledge is shallower and more entangled with lexical cues than benchmarks suggest. Small prompt rephrasings, unusual punctuation, or controlled minimal pairs can shift the output even when the surface task looks easy. The model often relies on statistical templates rather than abstract structural rules.

When should we use structured outputs instead of free text?

Whenever the model's output drives a workflow, an integration, a payment, a record update, or any decision a human would be accountable for. Force a JSON schema or grammar, validate it before acting, and route ambiguous cases to a human. Free text is fine for explanation and conversation, not for control.

What tools are commonly used for syntactic analysis in Python?

spaCy is the most common starting point for out-of-the-box dependency parsing. Stanza, NLTK, and Stanford CoreNLP cover more advanced or research-grade work. For custom needs, teams train transition-based or graph-based parsers on top of transformer encoders, usually evaluated with Labeled and Unlabeled Attachment Score.

Insights/AI & Automation

Syntactic Analysis in AI: A Builder’s Guide

May 27, 2026by Asghar Mirzaie

“Remind me to put in a call to the bank and get a dentist appointment for tomorrow,” a user might type. Your app is left with one muddled task rather than two. No crash, no refusal from the model, it simply failed to parse the sentence as you would have.

You could call this the practical side of syntactic analysis in AI. At its core, grammar is structure and that is where intent is found. A product that can not discern who is doing what to whom, or where an action stops and another starts, will be noticed by users within their first week. We have written this piece for the operators and product teams who are having to decide how much grammatical rigour their AI requires and how to fit it into a modern LLM stack.

Why Grammar Still Decides Product Behavior

A prototype will look fine in a demo under controlled conditions. But then you let real users in and they start putting together long, half-edited, compound sentences. They mix requests and omit context in ways your team never saw coming.

Take an assistant and feed it: “Send the proposal to Maya once legal has had their say on it, and if she doesn’t reply, remind me on Friday.” In one go you have a person, a dependency, a conditional, timing and a fallback. Let the system treat that as a bag of keywords and it will make a confident error. You see it everywhere:

* Task tools falter at multiple actions in a single line. * Support bots are stumped when a user lays out a problem, cause and desired outcome in one paragraph. * Enterprise search chokes on date or location constraints in a lengthy query. * Document tools fail when meaning hinges on the actors involved.

The rule is straightforward: if the feature cares about sentence structure, so does the product. The hard part is knowing where to put the work these days; the answer has been different twice over the past five years.

What Syntactic Analysis Actually Means

Put simply, it is the language processing that determines how a sentence is put together, not merely what words are in it. On paper it means turning raw text into a parse tree for the benefit of downstream tasks, with tokenization and part-of-speech tagging below and dependency or constituency parsing above.

Most toolchains will default to dependency parsing for its compactness and clean alignment with extraction. Constituency trees are still called for when phrase boundaries are the unit of work. It is a practical choice.

From Hand-Coded Grammar to Implicit Competence

To understand today’s tradeoffs you have to look at the history. When Noam Chomsky put out Syntactic Structures in 1957 – a year after the Dartmouth conference that gave us AI – he made syntax something a computer could formally analyse. See a brief history of NLP.

Engineers used to write explicit rules for early systems. It was controllable but brittle; a real user will find a dozen ways to express the same thing and your rule set will not hold up. Statistical methods came along and learned patterns from large corpora, which was more flexible if harder to reason about. And now we have transformers. OpenAI’s GPT-3 in 2020 had 175 billion parameters, and by 2024 Google’s Gemini 1.5 was showing off a million-token context window per IBM’s account of AI history. Most of the syntactic legwork has been taken in by the model.

The blogosphere will tell you to just prompt it since the model already knows grammar. The research says otherwise.

What LLMs Get Right About Syntax, and What They Quietly Get Wrong

There is plenty of regularity in large models. They do not need help with subject-verb agreement or surface ambiguity in a chat feature. But they have blind spots benchmarks don’t show.

The Linzen group at NYU has been forthright on the matter. Arehalli and Linzen (2024) have shown neural networks can pass a surface test and then flunk a minimal-pair exercise a human would think nothing of. Mueller et al. (2024) discovered that piling on in-context examples of a pattern can actually impair generalisation. And Huang and his colleagues (2024) put LLMs through a large-scale benchmark and found no sign that surprisal accounts for the difficulty humans have with garden-path sentences. In some cases an older, smaller model was closer to the human signature.

It boils down to this:

* You can probe for syntax in the embeddings but the model won’t necessarily use it. * Few-shot prompting is an unstable way to handle grammatical patterns. A slight rephrasing of the prompt and you have broken the behaviour. * Do not trust English benchmarks to tell you the whole story on robustness. You will find that morphologically rich, free-word-order languages break in stranger ways. And don’t mistake adversarial inputs, nested conditionals or odd punctuation for an edge case; they are a genuine safety surface.

But none of this is to say LLMs can’t handle language features. It is simply a matter of perspective: you should view them as one part of a system, not the whole thing.

### Where Explicit Syntax Still Has the Edge

There is a more understated story to be told here. In work that requires interpretability or a forensic level of auditability, explicit syntactic methods remain competitive if not outright better. Take LambdaG for example. The Manchester group of Andrea Nini put their grammar-based authorship-verification system to the test on twelve datasets of emails, reviews and forum posts. By looking at function-word patterns, punctuation habits and sentence structure, it left the neural baselines behind in most instances.

Of course, you can’t apply that to every problem. But it reveals something. For a high-stakes, narrow task that has to be defended, a grammar-aware system will often stand up better than a fine-tuned LLM. You see this profile in legal review, compliance extraction, medical coding and the like. Even Salesforce Einstein still has explicit linguistic and syntactic analysis in its pipeline. The capability didn’t go away, it was just embedded.

### The Real Production Problem Is Not the Model

Look at the AI features that have a hard time after launch and you will rarely find the model’s grammar to be the culprit. Anthropic’s 2025 guidance on context engineering, which comes from hard-won enterprise experience, suggests otherwise. Push a model’s context window past forty percent and instruction-following will take a nosedive. You might try to cram every policy and rule into a mega-prompt, but those tend to collapse under their own weight in a matter of months. Put in a clause to fix one thing and you regress another. Long inputs suffer from the “lost in the middle” effect where information is simply ignored.

What works is to put a harness on the model. A good deployment will have:

* A context layer to isolate and compress what is needed for the step at hand. * Structured outputs via JSON or formal grammars so the downstream code can do its due diligence before acting. * Deterministic tools for the job. Let the code, not your prose prompt, deal with SQL, math, regex and dates. * Synthetic evals from old inputs and regression tests for any change to the prompt. * Logging and anomaly alerts, with a way for the model to fail noisily when it doesn’t know.

That kind of discipline is what makes an AI feature viable as opposed to a nice demo. Our AI scalability guide is built on the premise that scaling is an operating issue, not a model one.

### How to Decide What Belongs Where

Asking “do we need a parser or an LLM?” isn’t very useful. The right question is where in the system each layer deserves to be. Most products run on three layers.

**Low-level processing.** Things like tokenization, normalization and basic POS tagging are still de rigeur. Libraries such as spaCy or Stanza will do it without fuss. Don’t skip it or your metrics will drift.

**Implicit syntax inside the LLM.** This is the home of most chat features and where chatbots and natural language processing get down to business. Use the model for summarization or paraphrase when the input is clean and the stakes are not too high.

**Explicit structure for reliability.** When the output is going to inform a payment, a policy decision or some other regulated record, you must force a structured output and validate it. Send anything ambiguous to a human. In high-assurance areas you might want to pair the LLM with a grammar-based method, as we saw with the safer copilots in our piece on AI ERP bots.

The product team’s advantage is in knowing where to put these layers and having the engineering to back it up.

### What This Looks Like in Practice

We put this to the test building an AI assistant for project managers. The initial idea was for an assistant to do “everything,” but our blueprint process reined that in. We wanted it to reason about projects by pulling data from Asana, Slack, email and meetings, not to be parsing random sentences. The difficulty wasn’t the model, it was the harness – the rules and schema that determined when the assistant could act and when to hold back.

We applied the same rigor to a training chatbot for El Colectivo 506. Solutions journalism has a methodology and the bot had to tell a reporter in two languages where his pitch was breaking the rules. A loose understanding of language wouldn’t cut it; we needed scoped logic and a fallback for when the model was in doubt.

The lesson from both is the same: define the environment and what the system has to do first, then pick your model. Build the harness with as much care as the prompt.

### Questions to Ask Before Building

When you are scoping an AI feature of any kind, there is some ground you should cover early on:

What sentence patterns do we put in front of the design first? Go with twenty actual user inputs. Not some clean demo, but the messy kind.
Where is it that structure dictates meaning? You will want to flag conditionals, dependencies, multiple actions and any references that are not clear.
In version one, which ambiguities are we letting go of? Make no mistake about it. If you leave something to be decided on, it will be a bug in production.
Is this a case for open generation or structured extraction? For anything tied to a workflow, the latter almost without exception.
How do we handle model uncertainty? There is no third option here: you either ask for clarification, put it before a human or narrow the task.
Can you show me the eval set? If there isn’t one, you are not in a position to build the project.

Put simply, if your people can’t come to terms on how the system ought to read twenty real sentences, then the feature is for discovery, not development.

A Sensible Next Step

The advent of large models did not put an end to syntactic analysis in AI, it just put it somewhere else. Sure, most of the surface grammar has been taken in by the model, and the low-level stuff is where it was, but you will see explicit structure re-emerge at a higher level of architecture via harnesses and grammar-aware methods for tasks that have to be defended. The good teams in 2026 are those who view language understanding as a layered system, not a matter of a single prompt.

Should you be looking to get the structure in order before you write code for an AI feature, our AI development services are made for that sort of early call, frequently in tandem with some AI chatbot development. We have the deeper engineering side of things covered in our guide to building an AI model and the more practical scope work in the AI chatbot development guide. Come with the twenty user sentences and the parts of the workflow that are already failing us and we can have a productive talk.

Written by

Asghar Mirzaie

Asghar Mirzaei is a backend developer at Refact, focused on the APIs, integrations, and infrastructure that power the studio’s products. His work spans data pipelines, third-party services, backend architecture, and deployment systems, helping ensure that products are stable, scalable, and ready for real-world use. Asghar works closely with the team to connect product requirements with reliable technical foundations, especially in systems where performance, automation, and integration quality matter. At Refact, he contributes to the engineering work behind the interfaces, making sure the products the studio builds can run smoothly and dependably

El Colectivo 506

Syntactic Analysis in AI: A Builder’s Guide

A Sensible Next Step

Commonly asked questions

Is explicit parsing still necessary if we use a large language model?

Why do LLMs sometimes fail on sentences that look simple?

What tools are commonly used for syntactic analysis in Python?

What is the difference between dependency parsing and constituency parsing?

When should we use structured outputs instead of free text?

Need AI & automation expertise?

AI Development

AI Chatbot Development

Automation & Integration

More on AI & Automation

AI Integration Services: A Buyer’s Playbook

How to Build an AI Agent That Works

Hiring an AI Chatbot Development Company