Rendered at 05:18:50 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
namuol 12 hours ago [-]
> The registry grows with use. Every session is smarter than the last.
This feels a bit like one of those “now you have two problems” solutions. After a few dozen sessions I would expect the tool registry to be full of “noise” for most prompts. I would also expect most tools to be extremely specific to the task at hand, leading to redundancy and ultimately poor programmability due to inconsistencies between tool APIs.
walmsles 8 hours ago [-]
It's an open experiment, the utility of tendril is the concept. I am more curious about how good can the tool making get. Frontier models tend to be very specific about what they build so don't get specific bloat (yet).
gavinray 14 hours ago [-]
It's really cool to see that other people run into the same issues and arrive at the same conclusions/solution.
At $DAYJOB, we have an LLM-based tool and this issue of "how do we avoid burning tokens solving the same problems over again" was an early obstacle
We wound up building a very similar thing to what you call "tools" (we named them "Saved Programs").
There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.
If it finds one, it'll re-use it, otherwise it'll generate a program and offer to save it, if you think it'll be common enough.
afshinmeh 14 hours ago [-]
> how do we avoid burning tokens solving the same problems over again
Letting the LLM write half baked tools is the recipe for burning more tokens.
> There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.
What's the criteria for marking an LLM written tool as useful/correct before publishing it?
gavinray 14 hours ago [-]
> Letting the LLM write half baked tools is the recipe for burning more tokens.
It sure is, if the tools are half-baked and your user scale is N=1 rather than N=100 or N=1,000
> What's the criteria for marking an LLM written tool as useful/correct before publishing it?
It solves the problem the originating user asked it to
afshinmeh 13 hours ago [-]
> It solves the problem the originating user asked it to
Interesting. And is there a mechanism to go back and "fix" the tools after they are published? What happens if the tool decided to use the "id" attribute to click on buttons and now you have a new website that follows a different pattern to find the right target?
I agree that "correctness" of a tool could have different meaning depending on the context of the problem though (e.g. would you consider OOM a correctness bug even if it addresses the user's ask?)
sifar 10 hours ago [-]
The problem here is that N different users will ask for N different variants of the same tool, so you'll end up with a tool which is similar but not quite. Is the tool updated to support new functionality, or a new tool is created and you end up with N variants of a tool.
Everyone is just taking a round about way to get there. The workflow/program as "tools" approach is the right one. Agents skills are more or less in that same direction.
dominotw 10 hours ago [-]
there are hundreds or thousands of 'memory' things ppl have been inventing. i am yet to see any proof that these are actually useful or have saved any tokens.
weitendorf 13 hours ago [-]
Get outta my swamp! Just kidding, it’s cool to see other people working on this stuff.
I think right now this is still a bit too fresh out of Claude Code to be usable by anybody but the people developing it. I got to around the same point with my first tempt at building a tool registry (https://github.com/accretional/collector) and then realized I basically needed to start over with much more investment in supporting infrastructure to build the thing I really wanted.
I can go as far into the weeds as anybody would ever care to hear about this, but for the sake of brevity I’ll just say this: reflection and type systems over the network are pretty much the only way to get this stuff to work properly (I mean you could just go full MCP/Skills but then all you really have are giant blobs of markdown and unconstrained json that make integration/discovery/usability a nightmare, and require an agent in the loop to drive/integrate the tools when you really just need to give them the actual APIs and documentation). That ends up getting rather hairy, we ended up actually building a declarative meta-lexer/parser/transpiler (meta basically just meaning it’s generalized across languages and self-hosting/bootstrapped) recently (https://github.com/accretional/gluon) because it turns out building a cross-language distributed type system is rather difficult. But reflection alone gets you halfway there as far as benefits.
mtrifonov 13 hours ago [-]
I like that you approach the question of "when" in regards to tool calls. I've become frustrated that most agent frameworks don't acknowledge it in their design philosophy.
WHEN is upstream of WHAT and HOW. You can have perfect tool descriptions and perfect call signatures, but if the model can't read the situation to know whether the moment calls for any tool at all, you get either over-firing (agent burns tokens trying to "help") or under-firing (agent waits to be addressed and acts like a chatbot, not an autonomous participant).
I have had a lot of success when I refrain from codifying WHEN as rules. "If X then fire tool Y" is a dumb heuristic with extra steps. Describe the conditions of the moment. What's been tried, what's converged, what state the work is in. Then let the model decide whether to act and which tool fits.
Rules get stale. Situation-reads generalize.
Reading the Tendril README, looks like the registration mechanic is solving a slightly different problem (the "too many tools" / context-bloat problem) by giving the agent three bootstrap tools and a growing registry. The WHEN itself still seems to be codified as rules in the system prompt ("BEFORE acting, call searchCapabilities; IF found, load and execute; IF NOT found, build yourself"). That's exactly the IF-X-THEN-Y pattern your framing seems to want to move past.
Curious whether you see the registry itself as the structured WHEN, or whether the rule-based system prompt is a starting point you intend to evolve toward something more situational.
walmsles 8 hours ago [-]
The registry itself is searchable. The system prompts guides the agent to search it to find tools. Right now its a naive implementation as it's a local tool. I am exploring the idea of more structured policy here. It's not net new or different to skills or MCP it externalities the invocation policy which I feel is really important when looking to formalise or scale agent tools in larger organisations.
It's more an idea I decided to share because I think we need more thinking in this space as we all run towards agent networks of networks.
Will review the README.md. the article I wrote looks at the aspect of "when" which I found interesting in the original case I wrote about.
Tendril and find tools is more an experimental look at "how do we discover tools at scale" and how do agents know what to choose.
More importantly how do administrators reason about the tools and when are they used and are they being used correctly (agent validation).
I feel the focus of "when" is more human oriented IMO.
walmsles 16 hours ago [-]
I built this while working on a coding agent that kept starting cold every session. The deeper problem was that agent frameworks give you what a tool does and how to call it, but no structured answer to when — when should a tool fire autonomously, and when should it stay silent. That judgement is always implicit, scattered across system prompts and tool descriptions.
Tendril is a reference implementation of what I'm calling the Agent Capability pattern. It starts with three bootstrap tools and builds everything else itself. The key constraint: there's no direct code execution. The agent can only run registered capabilities, so every task forces it to write a tool, define its invocation conditions, and register it for future sessions. The registry accumulates across sessions.
I also ran the self-extending loop against five local models — Qwen3-8B, Gemma 4, Mistral Small 3.1, Devstral Small 2, Salesforce xLAM-2. None passed.
I did something that sounds similar for my home assistant.
The agent never executes anything. It has like four tools… search, request execute, request build, request update.
The tool service runs vector search against the tools catalog.
The build generalizes the requested function and runs authoring with review steps, declaring needed credentials and network access.
The adversarial reviewer can reject back to the authoring three times.
After passing, the tool is registered and embeddings are done for search. It’s live for future use.
Credentials are stored encrypted, and only get injected by the tools catalog service during tool execution. The network resources are declared so tool function execution can be better sandboxed (it’s not, yet).
The agent never has access to credentials and cannot do anything without going through vetted functions in the tool service.
Agent, author process, reviewer, embedding… all can be different models running local or remote.
Event bus, agent, tool service… all separate containers.
It’s really just meant for me, but if you’re interested in more details on anything let me know. There’s nothing super special in it.
esafak 15 hours ago [-]
You can list the uses of the available tools in the AGENTS. I keep my agents on a tight leash, and self-extension runs counter to this. I would not my agent to spontaneously develop the ability to tap my bank account, for example.
walmsles 7 hours ago [-]
The Deno sandbox is the answer here — network access is restricted to an allowlist, and the execution environment has scoped permissions. The agent builds tools within those constraints, it can't reach anything you haven't explicitly allowed
Lucasoato 5 hours ago [-]
Well, I follow a similar pattern with Claude: whenever I solve a problem that happens frequently, I create a skill or a subagent definition, mentioning it with a one liner within the global .md file.
I think this is a simple and effective solutions if you have a dozen or two of tools, maybe it won’t scale to hundreds or thousands, but that will be a problem for tomorrow’s me.
andai 12 hours ago [-]
A while back I realized OpenClaw was Claude Code in a trenchcoat, except that Claude Code is pretty good at extending itself without breaking itself. (Note: haven't used OC since February, maybe it's solid and reliable now.)
Of course, being reliable and reliably extensible is the whole point, which means Claude Code made a better OC than OC did! I found this very amusing for some reason.
Also you can put it (or your agent of choice, e.g. codex works too) in a Telegram bot in like 50 lines of code which is a lot of fun.
It's... fine. A bit half-baked like a lot of CC features right now.
tmzt 10 hours ago [-]
Also working on something similar and using a dual LLM architecture (small router, larger deep thinker) with offline models, as well as determinisitic skills encoded as TSX.
It's evolved into a mesh-based operating system, gained it's own GPU-based AI library/runtime, and even molted and extended itself to ESP nodes.
Getting closer to a full release sometime in May. For now, pieces are released on my github.
dakiol 12 hours ago [-]
Why are we still building stuff in TS/npm? Given that LLMs can code in any language, I'd expect people would use "better" languages at this point.
mikmoila 9 hours ago [-]
I don't I use Java for small scripts/tools.
sockaddr 14 hours ago [-]
So basically you've built a mechanism for a model to de-compress itself.
walmsles 7 hours ago [-]
Not for a lack of trying! Had to enforce tool building through code as models tend to just execute arbitrary code when allowed.
10 hours ago [-]
nickstinemates 13 hours ago [-]
We built something similar[1], including integrated memory for debugging. It is very useful to have repeatable artifacts left behind every time you use an agent to accomplish a task.
The main design decision we took was to integrate with your existing agent instead of building a new one. Your harness, swamp, and you're off.
As an aside, building software for agents is incredibly fun.
You burn tokens with an smaller LLM you don't care about that exclusively does tool selection (or routing?)
aleksiy123 10 hours ago [-]
I guess the interesting part is the forces tool writing part.
Which kind of solves the when should we write a tool part by just saying always.
But I think the question is how will this scale. The real core issue I feel like I’ve been encountering is scaling complexity.
Reducing the number of tools without losing efficiency or capability.
Reducing duplication, abstracting, cleaning up, and maintaining knowledge and memory.
I think the issue for me has been threefold.
1. As the repo grows how does you make the agent keep understanding of it without excessive context pollution.
2. How do you maintain memory and knowledge over time.
3. How do you know the agent is performing better over time and not regressing as you evolve.
And what has somewhat been working for me is
A) trees or hierarchies.
Trees scale well. Folder structure but also in the form of just simple indices.
Logical structure and locality makes them even more effective.
B) caching.
Having the agents “cache” their thinking in the form of summaries, skills, tools.
Recursive summarization really helped with mono repo navigation for me.
But right now I still feel like I need to be constantly prompting them and I can’t quite close the feedback loop.
walmsles 7 hours ago [-]
The three bootstrap tools are a partial answer to (1) — the tool surface never grows, only the registry does, so context pollution is bounded by the search interface rather than the full tool list. Whether the registry search stays useful as it grows is an open question, semantic search over capability definitions is probably the next step.
(2) is where the structured capability format earns its keep over free-text memory. Triggers and suppression conditions give you inspectable, versioned invocation policy rather than prose that degrades over time. Still early though.
(3) I don't have a good answer to yet. Your point about feedback loops is the right framing — knowing whether the agent is actually getting better rather than just accumulating more tools is unsolved. The audit angle (administrators reasoning about which tools fire, when, and whether they should) is where I think this needs to go, but I haven't built that layer.
One thing that might directly address your caching point though — ADRs (Architecture Decision Records). The article that spawned Tendril started with giving an agent a record_decision capability that wrote ADRs to the filesystem. ADRs as agent cache is an interesting framing: structured, persistent, searchable records of why decisions were made at the moment they were made. That's arguably a better cache primitive than summarisation — decisions don't degrade the way summaries do, and they give you something to reason about for regression detection too.
Your tree/hierarchy observation resonates — the registry is a flat index right now which probably doesn't scale past a few dozen capabilities without some grouping structure.
jedisct1 12 hours ago [-]
I use Swival’s /learn command at the end of a session to make it write down what it got wrong, how it fixed the issue, and what it should remember next time. Works pretty well.
It can update those notes automatically, but I’ve found that even with regular nudges, models are still somewhat reluctant to do it.
So manually running /learn every now and then, especially when I can tell it didn’t take the most direct path, helps.
walmsles 7 hours ago [-]
This is essentially ADRs — capturing what the agent learned and why. The manual trigger is the interesting constraint though; the hard part is teaching the agent to recognise the moment a decision worth recording has been made, without being asked. That's what the triggers/suppression definitions are trying to formalise — the when of capture, not just the what.
This feels a bit like one of those “now you have two problems” solutions. After a few dozen sessions I would expect the tool registry to be full of “noise” for most prompts. I would also expect most tools to be extremely specific to the task at hand, leading to redundancy and ultimately poor programmability due to inconsistencies between tool APIs.
At $DAYJOB, we have an LLM-based tool and this issue of "how do we avoid burning tokens solving the same problems over again" was an early obstacle
We wound up building a very similar thing to what you call "tools" (we named them "Saved Programs").
There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.
If it finds one, it'll re-use it, otherwise it'll generate a program and offer to save it, if you think it'll be common enough.
Letting the LLM write half baked tools is the recipe for burning more tokens.
> There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.
What's the criteria for marking an LLM written tool as useful/correct before publishing it?
Interesting. And is there a mechanism to go back and "fix" the tools after they are published? What happens if the tool decided to use the "id" attribute to click on buttons and now you have a new website that follows a different pattern to find the right target?
I agree that "correctness" of a tool could have different meaning depending on the context of the problem though (e.g. would you consider OOM a correctness bug even if it addresses the user's ask?)
Everyone is just taking a round about way to get there. The workflow/program as "tools" approach is the right one. Agents skills are more or less in that same direction.
I think right now this is still a bit too fresh out of Claude Code to be usable by anybody but the people developing it. I got to around the same point with my first tempt at building a tool registry (https://github.com/accretional/collector) and then realized I basically needed to start over with much more investment in supporting infrastructure to build the thing I really wanted.
I can go as far into the weeds as anybody would ever care to hear about this, but for the sake of brevity I’ll just say this: reflection and type systems over the network are pretty much the only way to get this stuff to work properly (I mean you could just go full MCP/Skills but then all you really have are giant blobs of markdown and unconstrained json that make integration/discovery/usability a nightmare, and require an agent in the loop to drive/integrate the tools when you really just need to give them the actual APIs and documentation). That ends up getting rather hairy, we ended up actually building a declarative meta-lexer/parser/transpiler (meta basically just meaning it’s generalized across languages and self-hosting/bootstrapped) recently (https://github.com/accretional/gluon) because it turns out building a cross-language distributed type system is rather difficult. But reflection alone gets you halfway there as far as benefits.
WHEN is upstream of WHAT and HOW. You can have perfect tool descriptions and perfect call signatures, but if the model can't read the situation to know whether the moment calls for any tool at all, you get either over-firing (agent burns tokens trying to "help") or under-firing (agent waits to be addressed and acts like a chatbot, not an autonomous participant).
I have had a lot of success when I refrain from codifying WHEN as rules. "If X then fire tool Y" is a dumb heuristic with extra steps. Describe the conditions of the moment. What's been tried, what's converged, what state the work is in. Then let the model decide whether to act and which tool fits.
Rules get stale. Situation-reads generalize.
Reading the Tendril README, looks like the registration mechanic is solving a slightly different problem (the "too many tools" / context-bloat problem) by giving the agent three bootstrap tools and a growing registry. The WHEN itself still seems to be codified as rules in the system prompt ("BEFORE acting, call searchCapabilities; IF found, load and execute; IF NOT found, build yourself"). That's exactly the IF-X-THEN-Y pattern your framing seems to want to move past.
Curious whether you see the registry itself as the structured WHEN, or whether the rule-based system prompt is a starting point you intend to evolve toward something more situational.
It's more an idea I decided to share because I think we need more thinking in this space as we all run towards agent networks of networks.
Will review the README.md. the article I wrote looks at the aspect of "when" which I found interesting in the original case I wrote about.
Tendril and find tools is more an experimental look at "how do we discover tools at scale" and how do agents know what to choose.
More importantly how do administrators reason about the tools and when are they used and are they being used correctly (agent validation).
I feel the focus of "when" is more human oriented IMO.
Tendril is a reference implementation of what I'm calling the Agent Capability pattern. It starts with three bootstrap tools and builds everything else itself. The key constraint: there's no direct code execution. The agent can only run registered capabilities, so every task forces it to write a tool, define its invocation conditions, and register it for future sessions. The registry accumulates across sessions.
I also ran the self-extending loop against five local models — Qwen3-8B, Gemma 4, Mistral Small 3.1, Devstral Small 2, Salesforce xLAM-2. None passed.
The failure modes were distinct enough to be worth writing up separately: https://serverlessdna.com/strands/ai-agents/agents-know-what...
Stack: AWS Strands TypeScript SDK, Bedrock (Claude Sonnet), Deno sandbox, Tauri + React desktop shell.
The agent never executes anything. It has like four tools… search, request execute, request build, request update.
The tool service runs vector search against the tools catalog.
The build generalizes the requested function and runs authoring with review steps, declaring needed credentials and network access.
The adversarial reviewer can reject back to the authoring three times.
After passing, the tool is registered and embeddings are done for search. It’s live for future use.
Credentials are stored encrypted, and only get injected by the tools catalog service during tool execution. The network resources are declared so tool function execution can be better sandboxed (it’s not, yet).
The agent never has access to credentials and cannot do anything without going through vetted functions in the tool service.
Agent, author process, reviewer, embedding… all can be different models running local or remote.
Event bus, agent, tool service… all separate containers.
I have an url if you want to read a bit about what I did: https://dcd.fyi/agent
It’s really just meant for me, but if you’re interested in more details on anything let me know. There’s nothing super special in it.
I think this is a simple and effective solutions if you have a dozen or two of tools, maybe it won’t scale to hundreds or thousands, but that will be a problem for tomorrow’s me.
Of course, being reliable and reliably extensible is the whole point, which means Claude Code made a better OC than OC did! I found this very amusing for some reason.
Also you can put it (or your agent of choice, e.g. codex works too) in a Telegram bot in like 50 lines of code which is a lot of fun.
https://github.com/a-n-d-a-i/ULTRON/blob/main/src/index.ts
Though this might get you banned from Anthropic, they haven't quite clarified that yet. (Ostensibly it defaults to extra usage now, but who knows.)
https://news.ycombinator.com/item?id=47852834
https://code.claude.com/docs/en/channels
It's... fine. A bit half-baked like a lot of CC features right now.
It's evolved into a mesh-based operating system, gained it's own GPU-based AI library/runtime, and even molted and extended itself to ESP nodes.
Getting closer to a full release sometime in May. For now, pieces are released on my github.
The main design decision we took was to integrate with your existing agent instead of building a new one. Your harness, swamp, and you're off.
As an aside, building software for agents is incredibly fun.
1: https://swamp.club
Which kind of solves the when should we write a tool part by just saying always.
But I think the question is how will this scale. The real core issue I feel like I’ve been encountering is scaling complexity.
Reducing the number of tools without losing efficiency or capability.
Reducing duplication, abstracting, cleaning up, and maintaining knowledge and memory.
I think the issue for me has been threefold.
1. As the repo grows how does you make the agent keep understanding of it without excessive context pollution.
2. How do you maintain memory and knowledge over time.
3. How do you know the agent is performing better over time and not regressing as you evolve.
And what has somewhat been working for me is
A) trees or hierarchies.
Trees scale well. Folder structure but also in the form of just simple indices.
Logical structure and locality makes them even more effective.
B) caching.
Having the agents “cache” their thinking in the form of summaries, skills, tools.
Recursive summarization really helped with mono repo navigation for me.
But right now I still feel like I need to be constantly prompting them and I can’t quite close the feedback loop.
(2) is where the structured capability format earns its keep over free-text memory. Triggers and suppression conditions give you inspectable, versioned invocation policy rather than prose that degrades over time. Still early though.
(3) I don't have a good answer to yet. Your point about feedback loops is the right framing — knowing whether the agent is actually getting better rather than just accumulating more tools is unsolved. The audit angle (administrators reasoning about which tools fire, when, and whether they should) is where I think this needs to go, but I haven't built that layer.
One thing that might directly address your caching point though — ADRs (Architecture Decision Records). The article that spawned Tendril started with giving an agent a record_decision capability that wrote ADRs to the filesystem. ADRs as agent cache is an interesting framing: structured, persistent, searchable records of why decisions were made at the moment they were made. That's arguably a better cache primitive than summarisation — decisions don't degrade the way summaries do, and they give you something to reason about for regression detection too.
Your tree/hierarchy observation resonates — the registry is a flat index right now which probably doesn't scale past a few dozen capabilities without some grouping structure.
It can update those notes automatically, but I’ve found that even with regular nudges, models are still somewhat reluctant to do it.
So manually running /learn every now and then, especially when I can tell it didn’t take the most direct path, helps.