Skip to main content

Command Palette

Search for a command to run...

Tree-Sitter Grammars Explained: Leveraging Data for Clarity

How a Week of Jargon and 25 Languages Resulted in Creating the Parser I Needed

Updated
13 min read
Tree-Sitter Grammars Explained: Leveraging Data for Clarity

Clarity Engineering


TL;DR: If You're Here Because Tree-sitter's node-types.json Makes No Sense

You're not alone. Tree-sitter's terminology is confusing because it evolved from internal implementation details, not developer clarity.

The Core Problems

  • "Named" doesn't mean "has a name" (everything has a name). It means "corresponds to a named grammar rule"—an internal detail that's noise for most use cases.
  • "Fields" and "children" are both parent-child relationships but the distinction is unclear. Fields are semantic ("this node's condition"), children are positional ("this node's first child").
  • Everything is a "type": Nodes, edges, and abstract categories all use the same terminology, obscuring the differences that matter.

Solution: Call Things What They Are

Tree-sitterCodeWeaverMeaning
Abstract type (with subtypes)CategoryGrouping that never appears in code (like expression)
Named nodeThingWhat you actually see in your parse tree
FieldDirect ConnectionSemantic relationship with a role ("condition", "body")
Field nameRoleThe semantic purpose ("what is this child for?")
ChildPositional ConnectionOrdered relationship without semantic meaning
extra flagcan_be_anywhereWhat it is; a thing that can be anywhere
named flagis_explicit_ruleSays what it actually means

The Results

After analyzing 5,000+ node types across 25 languages, I rebuilt the interface to be self-documenting. Nobody else loses a week.

Want the code? CodeWeaver on GitHub

Just need a quick answer? Jump to the Translation Guide.

Want the full story? Keep reading.


The Problem: One Week, Gone

I was building CodeWeaver, a semantic[^1] code search tool that needed to understand the structure of 25+ programming languages. Tree-sitter seemed perfect: mature, fast, widely adopted, with grammars for every major language. (Under the hood, CodeWeaver uses ast-grep as its interface with tree-sitter for better typing and abstractions.)

Then I opened my first node-types.json file.[^2]

{
  "type": "if_statement",
  "named": true,
  "fields": {
    "condition": {
      "multiple": false,
      "required": true,
      "types": [{"type": "expression", "named": true}]
    }
  },
  "children": {
    "multiple": true,
    "required": false,
    "types": [{"type": "statement", "named": true}]
  }
}

Simple enough, right? Wrong. After days of confusion, here's what I learned:

  • "Named" doesn't mean "has a name" (everything has a name). It means "has a corresponding named rule in the grammar"—an internal implementation detail with little meaning outside parsing.
  • "Fields" vs "children": Fields are named semantic relationships ("this node's condition is..."), children are positional ("this node's first child is..."). The JSON treats them identically.
  • Everything is a "type": Nodes (things in code), edges (relationships), and abstract types (categories) all use the same terminology, flattening distinctions that matter.

The tree-sitter documentation helped a little, but not enough. I'm not new to tree-sitter, either. Between Thread and CodeWeaver, I've spent four months working with it.

I spent a week building a complex system for identifying nodes and their significance based on wrong assumptions about what these terms meant. That system isn't completely useless—I retooled it as a fallback—but it wasn't the kind of thing you want to spend a week doing when building an MVP by yourself.

After that lost week, I decided: we're not doing this to ourselves again.

The Solution: What If We Just... Made It Clear?

My driving philosophy: things should be clear. Not simple—tree-sitter grammars are complex—but understandable. The terminology was a wall between developers and shipping, so I rebuilt it from scratch.

The core insight: Separate things that are different.

1. Category vs Thing

# Category - Abstract grouping (doesn't appear in parse trees)
expression_category = Category(
    name="expression",
    member_things=["binary_expression", "unary_expression", "call", ...]
)

# Thing - Concrete node (what you actually see)
if_statement = Thing(
    name="if_statement",
    kind=ThingKind.COMPOSITE,
    categories=["statement"]  # belongs to "statement" category
)

Tree-sitter calls both of these "nodes" and uses subtypes to indicate the difference. We make it explicit.

2. Connection (Not "Field" or "Child" or... Both?)

Note: I'm still implementing the parser, so these are just illustrations. Actual implementation will not be the same.

# Direct Connection - has a semantic Role
DirectConnection(
    role="condition",           # ← The semantic meaning
    source="if_statement",
    target_things=["expression"],  # ← Can be Category OR concrete Thing
    requires_presence=True
)

# Positional Connection - ordered but no semantic role
PositionalConnection(
    source="argument_list",
    target_things=["expression"],
    allows_multiple=True
)

One concept, one name.

3. Role (Not "Field Name")

The role describes what purpose a child serves:

  • condition in an if-statement
  • body in a function
  • left and right in a binary expression
  • operator in an operation

It's semantic, not just structural.

But Does It Actually Work? The Empirical Test

Instead of trusting intuition, I analyzed all 25 language grammars CodeWeaver supports:

  • 5,000+ unique node types
  • 15,000+ connection definitions
  • Languages from Bash to TypeScript to Haskell

Let's look at the data.

Finding #1: Polymorphic References Are Real (But Minority)

Do connections reference abstract Categories or concrete Things?

Both—and the distinction matters:

  • Direct connections: 7.9% Category (abstract), 92.1% concrete
  • Positional connections: 10.3% Category, 89.7% concrete

When a connection references a Category (like expression), it accepts any expression type. This is polymorphic type constraint in the grammar:

# Polymorphic - accepts any expression
if_statement.condition → expression (Category)

# Specific - only these operators
binary_expression.operator → ["+", "-", "*", "/"] (Concrete)

Tree-sitter's flattened structure obscures this. My design makes it explicit.

Finding #2: Multi-Category Membership Happens (But Is Uncommon)

Can a Thing belong to multiple Categories?

Yes, and it's fascinating when it does:

  • Total Things with categories: 736
  • Multi-category: 99 (13.5%)
  • Single-category: 637 (86.5%)
  • Maximum categories: 5 (yes, five!)

Real examples (from C and C++ grammars):

Mapping lines from our dataset:

# C/C++ grammar categories
identifier → ["_declarator", "expression"]

# C++ grammar categories
qualified_identifier → ["_declarator", "expression", "type_specifier"]

C example (same identifier, different roles by context):

int x;     // 'x' used as a declarator (declaring a variable)
return x;  // 'x' used as an expression (using the variable)

C++ example (qualified identifier appearing in three roles):

// As a declarator (type name followed by a name)
std::vector<int>::iterator it;

// As an expression (value-initialize a temporary)
return std::vector<int>::iterator{};

// As a type specifier (appears in a parameter type)
void f(std::vector<int>::iterator it);

This happens mostly in statically-typed languages with complex type systems (C, C++, C#, Java). Dynamic languages (Python, JavaScript, Ruby) rarely need it.

Design implication: Support multi-category with frozenset, but don't overcomplicate—86.5% of Things belong to exactly one Category.

Finding #3: The Cardinality Matrix Is Essential

Tree-sitter has required and multiple as separate booleans. What do they mean together?

requires_presenceallows_multipleMeaningExample
FalseFalse0 or 1 (optional single)else clause
FalseTrue0 or more (optional list)attributes
TrueFalseexactly 1 (required)function name
TrueTrue1 or more (required list)case statements

This table should be in the tree-sitter docs. It's not.[^3] I figured it out empirically and built it into the design.

Translation Guide for Tree-sitter Veterans

I'm not trying to be difficult. Here's the mapping:

Tree-sitter TermCodeWeaver TermWhy We Changed It
Abstract type (with subtypes)Category"Abstract type" is jargon; Category is universal
Named nodeThing"It's a thing in your code" - simple and clear
Node with no fieldsTokenWhat you literally see in source
FieldDirect ConnectionEdges aren't nodes; connections are relationships
Field nameRoleDescribes semantic purpose, not just presence
ChildPositional ConnectionOrdered relationship without semantic role
named attributeis_explicit_ruleSays what it actually means
extracan_be_anywhereagain, just say what it is
multipleallows_multipleUpper bound (can have multiple)
requiredrequires_presenceLower bound (must have ≥1)
rootis_startThe starting or first node (just tells tree-sitter where to start)

The Implementation: Show, Don't Tell

Here's a pencil-sketch version of what our parser will look like once fully implemented (feedback on improvements welcome!):

class ConnectionClass(BaseEnum): # BaseEnum is a utility enum with convenience methods
    """Classification for a Connection."""
    DIRECT = "direct"        # Named semantic relationship (has a Role)
    POSITIONAL = "positional"  # Ordered but no semantic meaning

class ThingKind(BaseEnum):
    """Kind of thing based on its structural use."""
    TOKEN = "token"          # Leaf node (what you see in code)
    COMPOSITE = "composite"  # Non-leaf (has children)

# Parse an if-statement
if_statement = CompositeNode(
    name="if_statement",
    kind=ThingKind.COMPOSITE,
    categories=frozenset({"statement"}),
    direct_connections=frozenset({
        DirectConnection(
            role="condition",
            source="if_statement",
            target_things=frozenset({"expression"}),  # Polymorphic!
            requires_presence=True,
            allows_multiple=False
        }),
        DirectConnection(
            role="consequence",
            source="if_statement",
            target_things=frozenset({"block", "statement"}),
            requires_presence=True,
            allows_multiple=False
        ),
        DirectConnection(
            role="alternative",
            source="if_statement",
            target_things=frozenset({"block", "if_statement"}),  # else or elif
            requires_presence=False,  # else is optional
            allows_multiple=False
        )
    })
)

Every concept is explicit:

  • What kind of thing is this? (Composite)
  • What categories does it belong to? (statement)
  • What connections does it have? (Direct connections with semantic roles)
  • What can connect? (Categories for polymorphic, concrete Things for specific)
  • Cardinality constraints? (Required presence, multiple allowed)

No guessing. No cross-referencing documentation. No lost weeks.

Results: Time Saved, Clarity Gained

After implementing this design:

  • Bugs: caught at validation time, not runtime
  • Code: self-documenting (DirectConnection(role="condition", ...))
  • Design: accommodates real patterns (multi-category, polymorphic refs)

Most importantly: nobody else loses a week.

The Broader Lesson: You Should Question Inherited Complexity

Tree-sitter is excellent software built by brilliant people. But its node-types.json format evolved from internal implementation details, not external clarity.

When you're building developer tools (...anything, really), ask:

  • Is this complexity essential or inherited?
  • Would a newcomer understand this without reading source code?
  • Can we measure if our design works? (We analyzed 25 languages)
  • Are we preserving jargon out of tradition or necessity?

Sometimes the answer is "preserve it" (backward compatibility, ecosystem alignment). Sometimes it's "redesign it" (when clarity matters more than tradition).

For CodeWeaver, serving AI agents and developers who need to understand 25+ languages quickly, clarity won.

Try It Yourself

Tree-sitter approach:

cat node-types.json | jq '.[] | select(.type == "if_statement")'
# Output: nested JSON with fields, children, types, named flags...
# Time to understand: 15-30 minutes of documentation reading

CodeWeaver approach:

from codeweaver.semantic import NodeTypeParser

parser = NodeTypeParser()
if_stmt = parser.get_thing("if_statement", language="python")

print(f"Kind: {if_stmt.kind}")
print(f"Categories: {if_stmt.categories}")
print(f"Direct connections: {[c.role for c in if_stmt.direct_connections]}")
# Output: Clear, typed, self-documenting
# Time to understand: Immediate

(Okay, give me a couple days to get it working first!)

The code is open source on GitHub. The empirical analysis is in the repository. The design decisions are documented.

Takeaways

  1. Inherited terminology can be more confusing than the underlying concepts

    • Tree-sitter's "named" attribute confused me for days
    • My"is_explicit_rule" says exactly what it means
  2. Empirical validation beats intuition

    • I thought multi-category was rare (it is: 13.5%)
    • I didn't know polymorphic refs existed (they do: 7.9-10.3%)
  3. Make the implicit explicit

    • Category vs Thing (both called type in tree-sitter)
    • Direct vs Positional (all jumbled in tree-sitter)
    • Role as semantic purpose (just an unlabeled object key in tree-sitter)
  4. Design for clarity, optimize for understanding

    • One concept = one name
    • Similar concepts = consistent naming (ConnectionClass, ThingKind)
    • Confusing concepts = explicit separation (requires_presence vs allows_multiple)
  5. Provide bridges for experts

    • Translation guide for tree-sitter veterans
    • Preserve the power, improve the interface

The Bottom Line

Developer tools should make complex problems understandable, not add complexity on top.

I spent a week confused by tree-sitter's terminology. I spent the next week analyzing 25 languages and redesigning it. Now, every engineer who touches CodeWeaver saves that week I lost.

That's a good trade.


About CodeWeaver: I'm building intelligent code search that understands semantic meaning across 25+ programming languages (and with lesser accuracy, 170+ languages). Open source, AI-native, built to make large codebases navigable for both humans and AI agents.

About Knitli: I'm the founder and only employee (for now). I make complex systems intuitive. Whether it's tree-sitter grammars, distributed systems, or AI workflows, the right abstraction makes everything clearer. I'm building context engineering tools that are intuitive for both AI agents and humans. Oh, and cut costs by saving a lot of tokens. Visit us at https://knitli.com and sign up for the waitlist!

Found this helpful? I'd love to hear your own "lost week to bad terminology" stories. Share them in the comments.


Appendix: The Numbers

For the data nerds (I am one too):

Languages Analyzed: 25

  • Bash, C, C++, C#, CSS, Elixir, Go, Haskell, HTML, Java, JavaScript (+JSX), JSON, Kotlin, Lua, Nix, PHP, Python, Ruby, Rust, Scala, Solidity, Swift, TypeScript, TSX, YAML

Node Types Analyzed: 5,000+

  • 'Named' nodes: ~2,900
  • 'Unnamed' nodes: ~2,100
  • Abstract types (Categories): ~20-100 (explanation below)
  • Extra nodes ('can_be_anywhere'): 16 across all languages, 6 unique -- comment is an extra node in all 11 languages where they are in the grammar

Connection References: 15,635

  • Direct (fields): 9,606 (761 Category, 8,845 Concrete)
  • Positional (children): 6,029 (621 Category, 5,408 Concrete)

Multi-Category Distribution:

  • 1 category: 637 Things (86.5%)
  • 2 categories: 68 Things (9.2%)
  • 3 categories: 18 Things (2.4%)
  • 4 categories: 12 Things (1.6%)
  • 5 categories: 1 Thing (0.1%) - the legendary qualified_identifier in C++

Common Roles: ~90 unique across all languages

  • Most common: name (381 occurrences), body (281), type (217), condition (102)
  • Language-specific: lhs/rhs (Swift), quoted_start/quoted_end (Elixir)

Analysis Tools: Python script in the CodeWeaver repo results in https://github.com/knitli/codeweaver-mcp/tree/main/claudedocs/grammar_structure_analysis.md

Confidence Level: High. I parsed every node-types.json file for languages supported by ast-grep, validated against actual grammars, cross-referenced with tree-sitter documentation, and tested on real codebases.

About Categories (abstract nodes), and Inconsistent Grammars

Tree-sitter grammars have no standardized vocabulary, so whoever wrote the grammar defines the names. This makes working across languages, like with CodeWeaver, much more difficult. The grammar can also represent names with a preceding underscore (like with _expression) or without (expression) based on how they're used in the grammar.

The initial results from 25 languages gave us ~110 abstract types, but a little normalization goes a long way. Not all languages have defined Categories -- of the 25, only 18 define Categories, 6 don't (css, elixir, html, solidity, swift, yaml -- and json has only one,_value, so does Nix, _expression).

  • 50 if you simply remove the underscore.
  • 34 of those belong to only one language/grammar.
  • 7 of those are unique to C and C++ but no other languages.
  • 9 are used in multiple languages:
Category# Grammars
expression18
statement14
type9
declaration8
primary_expression*8
pattern8
literal4

You're likely to see similar patterns across all nodes and edges based on my observations working with the data, but I haven't done the analysis yet. If you look at Categories that are unique to a language, you see that many are slight variations from these, like type_declaration, pattern_expr, and expression_statement.


This post is part of our technical Clarity Engineering series on making developer tools more intuitive.

[^1]: "Semantic" is one of those heady academic words I try to avoid, but in this context it's hard to do that. In terms of AST-parsing like with tree-sitter, you can think of semantic as meaning "has structural meaning". Tools like tree-sitter parse text, usually code, based on defined relationships and rules called grammars, and not based on their physical syntax (like a regular expression (regex) does where it looks for specific characters in specific patterns). This difference is powerful - the resulting 'syntax tree' provides rich information about code relationships. Tree-sitter, despite its flaws in terminology and documentation, is a powerful tool for generating a syntax tree for almost any programming language and doing it very quickly (real time even on large codebases). [^2]: Not my first in full disclosure, but the first time I'd really inspected one. [^3]: Yes, I know I should submit a PR. I will when I can take a breath; anyone who reads this is welcome to go ahead and do it if I haven't already.

Clarity Engineering

Part 1 of 1

Technical writings on engineering problems with a focus on making complex things clear.