What Are You Building? (2025 Projects Hackathon Thread)

Next for this little util work-in-progress…hack in distillation features to call and force tool functions out of training file example or defaults.

Goal: store what you want, with human in the loop, not what you run and receive. The training file entry on the left has assistant fulfillments by the right panel AI and its additional context.

Scope: Pretty much needs to be a multi-platform chatbot anyway to fully-realize all distillation and judging features one could want to be done for you for fine tuning (and vision and function specs found in existing JSONL files or that you’d extend on).

(Tell me if the compact UI isn’t enbafflement..)

Hey that’s very interesting. So do you want to test some use cases or what?? I’d be curious to provide some “Training Entry Message” sets and see what the results are of your system.

Presumably the right pane results are from some kind of back-end prompting you are doing based on the left-pane data? Running through whatever model is selected?

What’s the “additional context” that your providing, in general? Is it complex? Is it static or dynamically generated based on some parsed parameters from the data on the left? Is it single-shot?

And what is the purpose of your mapping?

I mean where did you get that idea?

I’m into letters/numbers matrixs - particularly just with taking english words, converting them through the basic enumeration of “numerology” (just 1-9 mapped consecutively to alphabet, or however depending on the language) and then calculating the results of the words/phrases with “numerological” (base 9?) reductions.

What’s the purpose of the mapping and what’s your logic to generate the set of primes?

That is an editor of fine-tuning JSONL files. What you see is what you get. But I was hoping even a screenshot would be intuitive, without excessive use of non-content UI space.

In the left panel, you can tab through entries in a fine-tuning file you’ve loaded, and edit or delete them. You can add a new entry with the default system message, which is a smaller message that may act as the “trigger” of your tuning over the chatbot behavior. Then add messages you type up, perhaps to change the currency and factual nature of answers.

OpenAI’s distillation, which may have been sunset because of zero engagement, would have you fine-tune on full normal AI responses, whatever big model input context was needed to get the answer that you previously stored.

Here, however, the right panel lets you construct a large system message, and multishot conversation, where you can fully describe and create an AI that is able to answer training file inputs in the way you want your final application to act. Then:

  • distillation panel system message
  • distillation panel multi-shot exchanges
  • training file stripped of first system and last assistant
  • → generates a replacement assistant response for the fine-tuning file

That is something that can be fully-automated, but here you are in control of individual training responses.

Above, I press “distill”. The assistant output for the training file was generated by the qualities of the distillation panel AI.

Then run that new training file entry and its hundreds of companions, that type of new output being taught by examples, and you’d have an AI that is an instruction ignorer and question rephraser, using RAG context to also improve the question quality, or in this case, ignore irrelevant retrieval. (A training file should still have a hint of the task in the system messages, though.)


This would be something for me to plug away at more for “everybody try my tool”. But right now, it would damage unknown files, stripping function specifications or such.

I can also see just playing around with the minimum implementation, that 1000 example files are behind a wall of inaccessibility, it might need a scrolling browser and a search to find the problematic examples you need to refine.

Great! Just found this awesome community. I recently dedicated my time building a quick MVP Vizbull Turn your photos into AI art. I am still experimenting with the prompts. Are there any suggestions to produce consistent results and preserve facial features as as possible?

Welcome! You’re free to start a project thread in Community to ask for help and keep us up to date. Just try not to make it too promotional as we’re mostly devs here. I’m sure you can find some gpt-image-1 and dalle3 prompt help too. Hope you stick around!

I guess during Covid-19 I got interested in learning about Physics and Mathematics again, mainly through watching Youtube channels such as Numberphile, Computerphile, 3blue1brown and Mathologer. I also took an introduction to Quantum Computing course through IBM and MIT.

Ive always been interested in patterns and how things work or how to build things. Im a Carpenter by Trade. Ive been fooling around with Prime Numbers, Prime Gaps and Twin Primes for a few years. 2,3,5,7,11,13, etc… once primes get into double digits the ending digit of a prime will either 1,3,7 or 9 in base 10.

Lately ive just been learning how to expand my thoughts through Python programs. Ive also have ideas how to set up mathematical notation with this idea, based off of Godel and Grothendieck and others.

Always think what a criminal can do with it. (Or imagine, jochen is pissed and want get rid of you, and you have a toy he controls. If you don’t get sleepless nights, go ahead… :sweat_smile:)

Links

Connected toys - Wikipedia
My Friend Cayla - Wikipedia
Researchers Discover a Not-So-Smart Flaw In Smart Toy Bear | Trend Micro (US)

Good reason to do the OAI project of a “stuffed animal AI”, so YOU can control it. Even run your own local models too. Very low security risk if DIY. Admittedly, higher risk if you buy one and don’t control the recordings.

Furby + AI = FurbAI?

gpt-image-1 wouldn’t create it heh…

Soft Toy Wars

Image(Wide, Soft Toy Wars)

Image(Wide, AI Soft Toy Wars)

Seems like a good tool. Can this be used to improve the accuracy of tool calls as well?

I do not have support and understanding for functions in the utility now.

Internally, “functions” are one type of tool, where other tools like file_search you cannot train on.

But in general, yes, a fine-tuning file can include functions, function calls, and the response AI produces after seeing them called and a value returned.

Is there a guide to optimizing functions via fine-tuning? No, quite the opposite. You are left to figure out the correct balance of training to not completely break the ability.

My apologies for the mild redactions, obfuscation and truncations, I just don’t feel 100% comfortable showing much more than this.

Comprehensive Analysis of the Multi-Cognitive-Agent System Log for my autonomous research and software development system.

Had to host it externally because it’s about 20x’s the size allowed for a post.

cog_log.txt https://mega.nz/file/j1oCFArD#x8URXRapbU5BajWXZFIKeCLW4qosDELWbq0IiTujwtg

This log details a test run of my multi-cognitive-agent system tasked with a complex cognitive challenge: “Evaluate proposal V5: Refactor legacy auth module with new library (risks, benefits, steps, strategy).” The system demonstrates advanced capabilities in task decomposition, parallel processing, iterative refinement, dynamic capability extension, and user interaction.

Output

Phase 1: Goal Ingestion and Initial Decomposition

Task Initiation: The test begins with the user (or a test harness) submitting the high-level goal to the system.

Executive Agent Analysis: A central “Executive Agent” (the primary orchestrator, akin to a “Chief” of operations) receives this complex goal. It analyzes the request and determines that a multi-faceted approach is necessary, requiring input from several specialized cognitive functions.

Specialist Identification & Tasking: The Executive Agent identifies a team of five distinct specialist agent types needed for an initial comprehensive assessment:

An Evaluator Specialist to analyze risks and benefits (technical, security, operational).

A Strategy Specialist to outline an effective refactoring strategy (constraints, dependencies, challenges).

A Process Navigation Specialist to detail the implementation steps in a logical workflow.

An Initial Security Review Specialist to focus on security implications and potential vulnerabilities of the new library.

An Innovation Specialist to suggest alternative or complementary approaches.

The Executive Agent formulates specific focus directives for each specialist and issues activation commands.

Concurrent Specialist Activation: The System Controller parses these commands and dispatches the tasks, along with relevant context (the overall goal and the Executive Agent’s initial breakdown), to the five specialist agents, likely engaging them concurrently.

Phase 2: First Round of Specialist Input and Synthesis

Specialist Processing: Each of the five specialist agents processes its assigned focus area and generates a report. For instance, the Evaluator Specialist provides a detailed list of potential benefits and risks, while the Strategy Specialist recommends a phased, iterative approach.

Input Collection: The System Controller gathers these individual reports.

Executive Agent Synthesis (Round 1): All specialist inputs are presented to the Executive Agent for synthesis. It evaluates their relevance, quality, and identifies any conflicts or gaps.

Gap Identification: A significant finding emerges, primarily from the Initial Security Review Specialist: the current information lacks specific details about the proposed new authentication library (e.g., its name, version, security track record). This prevents a thorough security assessment.

Phase 3: Iterative Refinement – Addressing the Security Gap

Re-tasking Specialists: To address this, the Executive Agent decides to re-engage two of the initial specialists with more focused directives:

The Initial Security Review Specialist is asked to perform a more detailed security assessment, contingent on receiving specifics about the new library.

The Evaluator Specialist is asked to reassess risks and benefits once these detailed security findings are available.

Specialist Response (Round 2): The specialists process their new tasks. The Initial Security Review Specialist reiterates that without concrete details about the library, its assessment remains incomplete and flags the proposal as partially compliant with security best practices. The Evaluator Specialist provides a reassessment assuming some security details might have been incorporated but also highlights remaining gaps if specifics are still missing.

Phase 4: Advanced Gap Resolution – Dynamic Agent Creation

Executive Agent Synthesis (Round 2): The Executive Agent reviews these follow-up reports. It concludes that the existing specialists, particularly the Initial Security Review Specialist, have highlighted a persistent and critical information gap that cannot be resolved with their current scope or available data. The system needs a more deeply specialized function for a comprehensive security and operational audit.

Request for New Capability: The Executive Agent determines that a new type of specialist agent is required. It issues a directive to the System Controller to create a new agent. This request specifies:

A descriptive name for this new specialist role (e.g., “Detailed Security & Operational Validator”).

Its purpose: to perform an in-depth security and operational risk validation, covering version verification, vulnerability assessment, compliance, supply chain risks, performance benchmarking, and user transition planning.

A list of core capabilities required for this role.

A detailed system prompt defining its core functions and interaction protocols, ensuring it focuses exclusively on its validation task and requests further information if needed.

Agent Provisioning: The System Controller processes this request. After a (mocked) user confirmation, it interacts with an “Agent Management System” (AIManager) to:

Create a record for the new agent type in a database.

Assign its capabilities.

Instantiate the new agent runtime.

The system logs confirmation that the new “Detailed Security & Operational Validation Specialist” is now available.

Phase 5: Engaging the New Specialist and User Interaction

Executive Agent Clarification & New Specialist Activation: The Executive Agent, now aware of the new specialist’s availability, activates it with a clear directive to perform the comprehensive security validation it was designed for.

New Specialist Analysis & Data Requirement: The “Detailed Security & Operational Validation Specialist” analyzes its task and the provided context. It identifies specific pieces of information crucial for its assessment that are currently missing (e.g., exact library name/version, existing scan reports, compliance documents, dependency lists, performance benchmarks, transition plans).

Request for User Input: The specialist, following its protocol, determines it needs to ask the user for this missing information. This intent is communicated back to the Executive Agent.

Relaying to User: The Executive Agent synthesizes this need and issues a formal [ACTION_ASK_USER] directive. The System Controller presents this detailed request for information to the user via the User Interface.

Phase 6: Adapting to User Instructions (Mock Data Scenario)

User Response: The user responds, clarifying that the current session is a test of the AI development system and instructs it to “please use mock data to complete the task.”

Executive Agent Processing User Input: The Executive Agent receives this instruction. It understands that real-world data for the new library isn’t forthcoming, but the system’s evaluation process should still be demonstrated.

Directive to Use Mock Data: The Executive Agent re-activates the “Detailed Security & Operational Validation Specialist,” now explicitly instructing it to perform its comprehensive security validation and risk assessment using representative mock data to cover all aspects of its defined scope.

Phase 7: Final Validation Report and Conclusion

Comprehensive Mock Report Generation: The “Detailed Security & Operational Validation Specialist” executes its task using mock data. It generates an extensive, structured report covering:

Mock library version and configuration.

Mock CVE analysis and security advisories.

Mock compliance assessment against organizational and industry standards.

Mock supply chain risk evaluation (dependencies, scanning).

Mock penetration testing, dependency scanning, and code review summaries.

Mock performance benchmarking results.

Mock user communication, training, and transition plans.

A summary table of risks and benefits.

Recommendations and next steps based on the mock scenario.

A concluding statement on the viability of the proposal under the mock conditions.

Executive Agent Final Synthesis: The System Controller provides this detailed mock report to the Executive Agent. The Executive Agent analyzes this final piece of specialist input.

Task Completion: Seeing that a comprehensive evaluation (albeit based on mock data as per user instruction) has been completed, addressing all facets of the original goal, the Executive Agent concludes the cognitive task. It issues a [FINAL_PLAN] directive, summarizing the overall findings: the refactoring (in the mock scenario) is deemed low-risk with significant benefits, supported by robust mitigation and transition plans, and recommends proceeding with specific follow-up actions.

System Halts: The System Controller processes the [FINAL_PLAN], updates the UI to “Cognitive task completed,” and the test run concludes successfully.

Key System Capabilities Demonstrated:

Sophisticated Task Decomposition: Breaking a high-level goal into manageable sub-tasks for different AI specialists.

Multi-Agent Orchestration: Coordinating the activities and inputs of multiple specialized AI agents.

Iterative Problem Solving: Revisiting and refining assessments as new information or gaps are identified.

Dynamic Capability Extension: Recognizing the need for and creating a new, specialized agent at runtime with a defined purpose and prompt.

Contextual Awareness: Passing relevant history and context to agents for informed processing.

Structured Agent Communication: Using defined tags/directives for clear command and control flow.

User Interaction Management: Requesting specific information from the user when necessary and incorporating their feedback.

Adaptability: Adjusting its process based on user instructions (e.g.
, proceeding with mock data).

@FullTimeAI

responding to your “comprehensive analysis of the multi-cognitive-agent system log for autonomous research and software development system”

I’ve been working on the same sort of thing, but the question is, have you done it?

I’ve gotten to the point of successful indefinite looping of LLM-to-LLM where one (or multiple) LLM ‘threads’ do the “tool calling” (read/write from disk, from DB, and use bash), and one LLM ‘receives updates’ from the sub-LLM ‘agents’.

It works, but the context windows become so overbearing pretty quickly through (what perhaps is clumsy on my part) usage of syntax from the LLM in their response in order to perform the operation/action (i.e. proper syntax so that the system can reasonably pickup and execute the tool calls/actions for the LLMs) or “talk back and forth to each other and to the system”…

…not to mention normal “hallucination, simulation, and rapid intention drift”, that while I can get 100 messages in a few minutes between the LLM’s, even with their capacity to actually execute within the system, they don’t tend to get much done…

yet!

I’ve moved on to designing a world state context window system that takes all the input events for a given LLM ‘thread’ as an input and then cross-references a user defined (or LLM defined through it’s modification of system parameters through processing of the LLM responses) “map” of the semantic content of the input events (i.e. all the tool calls, responses, document uploads, tests in bash, etc.) → streamlined and synthetic context window + pretty significant levels of instruction on “how the LLM should use the system and think through a multi-stage process” (very similar to what you shared in your recent post) = it’s going to work?

So my question to you is - does it work - have you done it? What’s the statistics for your results? Do you have an example prompt + example output and the time-it-took + quantity of LLM calls + quantity of tokens required?

I’m probably a couple of weeks away from completion on my end. It’s been almost six months now.

But I believe it will be probably the coolest thing ever hahahaha

My plan is to post here and see who wants to give it a prompt and try it out - and then I’ll post back the “final result” + time it took to do it + how many tokens/API calls…

pure python backend with SQL + typescript on the front

Yeah, it works great. creates agents when needed, improves and modifies agents based on a score given to it based on its results. assembles teams of agents for given tasks. The project as a whole has taken me a year and a half and starting over 3 or 4 times. It’s a very manageable 70k LOC :rofl:

Agent is a very loosely used term here because they are not really agents, that was just something easy to call them in the beginning. In this setup there are many ‘agents’ but each one has a very simple and specific yet vague prompt (‘preoccupation, personality, disposition’), this makes for smaller prompts and very good adherence to instructions. They are given a lot of freedom!

The cognitive system is solely for reasoning and thinking, while they have the ability to use tools the only llm using tools is the main one using the cognitive system. But the cool thing is that all parts of the system work with Anthropic, OpenAI or any combination of the two, tools also.

The test that outputs the test log is designed to challenge it, but it’s also given a lot of leeway to figure things out on its own, so running the test multiple times will output multiple varying result times and token usage.

Time: just over 90 seconds
Total Input Tokens: 55,588
Total Output Tokens: 16,924

@FullTimeAI

Nice, that’s really amazing that you’ve gotten that far and gotten it to work. Do the LLM’s instances directly “prompt each other” through the system routing (i.e. can they “choose” when to call the agents directly in the flow, or is that only handled at system-settings system-flow level and not a “choice for the LLM within it’s response”?) i.e. is the middleware system doing it’s own layer of semantic processing/checklist review and then calling the various agents? Is it deterministic and static flow through the “cognitive” layers, or dynamic based on previous results/parts of the process? How do you expose the systems architecture and intention to the LLM?

Are those/statistics results from the file drop you shared earlier?

I read through your file drop but I was a bit confused to be honest. It seemed like endless regurgitation of the same phrases/patterns but no meaningful content or activity or any actual inputs/outputs. Of course I noticed the redactions, but they appeared sort of statically inserted and not what you expect from “redactions to protect sensitive data” - like what was it that was being redacted? Because everything that was “not redacted” again, seemed to be endless regurgitation - so I can’t imagine that was “redacted” was meaningful content either - as meaningful content is always interspersed throughout the entire response, not relegated to a static section at the beginning of every line?

Sorry not trying to be rude there - just a little confused about what you posted is supposed to represent. Also is that like server logs directly, or some kind of running-output that the system generates? What does/doesn’t represent an LLM call in the data? I can’t tell from the labeling of the lines what’s actually going on - what’s an LLM call, what’s a response, what’s a middleware processing, etc…

Is the system as you have now is it “isolated” in the sense that it’s only data set is the “data given in the prompt entry moment” or if RAG, does it have the ability to also write to system and not just retrieve? Where is the retrieval occurring from and how?


On my end, the whole purpose of the development I’ve pursued is so that the “LLM can actually do something in the real world” i.e. “have an effect outside of it’s own context window” where the context window simply “represents the net effect of those activities” - i.e. the LLM can read/write/modify code, or use any tool on the machine that is exposed through API integration.

From the example file upload you gave for your system results - I’m not seeing any actualization - only a sort of looping within the same semantic set as you would expect from a closed system.

The AI I’m using to develop this part of the system does not have much knowledge of the rest of the system, so its assumptions about the RAG and tool usage are a bit off.

Quick breakdown
  1. There is the Main LLM you chat with that uses tools and utilizes the RAG.

  2. Main LLM use the cognitive system for thinking, reasoning and problem solving. A very basic example would be if the Main LLM is tasked with making an application development plan it will generate the plan like normal then passes the plan, some instructions, guidelines and optionally some user preferences to the cognitive system to work on.

  3. The Orchestrator assembles a group (“team”) of cognitive modules and gives them all the same query, each one has one area its focused on. If an agent is needed but does not exist the Orchestrater will create one.

    “Orchestrater”,
    “Orchestrates and makes final decisions”,
    “You are the Chief agent responsible for orchestrating the collaboration between specialized cognitive modules. Your role is to analyze requests, delegate tasks, and synthesize responses into coherent solutions.”,
    “Central coordinator of the cognitive collaboration system”,
    “May struggle with domain-specific details without specialists”

    “Sentinel”,
    “Enforces rules and monitors compliance”,
    “You are the Sentinel agent responsible for ensuring compliance with guidelines and rules. Your role is to identify potential issues, enforce constraints, and maintain the integrity of solutions.”

    “Evaluator”,
    “Provides analytical assessment”,
    “You are the Evaluator agent responsible for critical analysis. Your role is to assess proposals, identify weaknesses, and suggest improvements based on objective criteria.”

  4. The updated plan is returned the the Main LLM that made the request to the cognitive system.


Generated response to your questions from the AI I am using for development, it's just too much to try to type it all out. So other than some assumptions it made about the RAG and tool usage its very accurate

You’ve hit on some really key aspects of how these complex multi-agent AI systems function, and I’m happy to clarify.

Let’s break down your points:

  1. Agent Interaction & Flow Dynamics:

“Do the LLM’s instances directly “prompt each other” through the system routing (i.e. can they “choose” when to call the agents directly in the flow, or is that only handled at system-settings system-flow level and not a “choice for the LLM within it’s response”?) i.e. is the middleware system doing it’s own layer of semantic processing/checklist review and then calling the various agents? Is it deterministic and static flow through the “cognitive” layers, or dynamic based on previous results/parts of the process?”

This is a fantastic question about the locus of control and decision-making. Here’s how it generally works in the system demonstrated by the log:

Orchestration by an Executive Agent: There’s a primary “Executive Agent” (let’s call it the Orchestrator for clarity here, though the log used a different internal name). This Orchestrator LLM does make strategic decisions about which types of specialist agents are needed and what their high-level focus should be. You see this when it first receives the goal and decides to activate the Evaluator, Strategist, Navigator, Sentinel, and Innovator specialists.

Indirect Prompting via Controller/Middleware: The Orchestrator LLM doesn’t directly send a prompt to another LLM. Instead, its output contains structured directives (like [ACTIVATE]SpecialistName:FocusDescription). A “System Controller” (the middleware) parses these directives. The Controller then constructs the actual prompt for the specialist agent, incorporating the overall goal, relevant summarized context from the interaction history, and the specific “FocusDescription” provided by the Orchestrator. So, the Orchestrator chooses the what and who, and the Controller handles the how of the actual LLM call.

Dynamic Flow: The flow is highly dynamic and not statically predefined. The Orchestrator’s decisions at each synthesis step (e.g., to re-task specialists, request the creation of a new specialist, or ask the user for input) are based entirely on the content and quality of the responses received from the specialist agents in the previous turn. If the specialists provide a complete picture, the Orchestrator might move to a final plan. If there are gaps (like the Sentinel initially identifying a lack of library specifics), the Orchestrator adapts and decides on a new course of action. The creation of the “SecurityOperationalValidator” agent is a prime example of this dynamic adaptation.

Middleware’s Role: The middleware (Controller) isn’t doing deep semantic processing or checklist reviews in the sense of making its own judgments about the content. Its primary roles are:

Managing the overall state of the task.

Routing requests and responses between the Orchestrator and Specialists.

Parsing structured directives from the Orchestrator.

Formatting prompts for LLM calls (including context summarization).

Interfacing with the UI for user input/output.

Managing agent definitions (like when a new agent is created).

  1. Exposing System Architecture and Intention to the LLM:

“How do you expose the systems architecture and intention to the LLM?”

This is primarily done through the system prompt given to the Orchestrator LLM. This prompt is carefully engineered to:

Define its role as the central decision-maker.

Inform it of the types of specialist agents available (e.g., Evaluator, Strategist) and their general capabilities.

Instruct it on the format it needs to use to issue directives (e.g., the [ACTIVATE], [ACTION_ASK_USER], [REQUEST_AGENT_CREATION] tags).

Emphasize the importance of reasoning, synthesis, and planning the next logical cognitive step.

Specialist agents also have their own system prompts defining their specific role, expertise, and expected output format, but they typically don’t need to know about the entire system architecture, only their specific task and the context provided by the Controller.

  1. Statistics Source:

“Are those/statistics results from the file drop you shared earlier?”

Yes, the “Response (X.Xs, Y tokens)” lines are indeed statistics captured by the system for each LLM call, as seen in the log. This helps in monitoring performance and cost.

  1. Log Confusion (Regurgitation, Redactions, Meaningful Content):

“I read through your file drop but I was a bit confused… It seemed like endless regurgitation of the same phrases/patterns but no meaningful content or activity… what was it that was being redacted?.. everything that was “not redacted” again, seemed to be endless regurgitation - so I can’t imagine that was “redacted” was meaningful content either…”

This is a very fair observation, and I appreciate you bringing it up! Let me clarify:

Nature of the Task: The specific task in the log (“Evaluate proposal V5…”) is an evaluative and analytical one. The system is designed to break down this evaluation, gather different perspectives, synthesize them, and produce a structured assessment. So, a lot of the “activity” is internal cognitive work, structuring thoughts, and ensuring all angles are covered. It’s more like a team of consultants writing a detailed report than an agent directly building a piece of software in that particular example.

LLM Behavior & Structured Output: LLMs, especially when asked to perform analysis or synthesis in a structured way, will often restate their understanding of the task or the inputs they’ve received. This is partly to ensure they are on the right track and to frame their subsequent output. The PROCESS> [AgentName] Analysis/Synthesis/Evaluation: sections are the LLM’s “thought process” and reasoning leading to its decision or output.

Meaningful Content: The “meaningful content” lies in:

The specific analyses provided by each specialist (e.g., the Evaluator’s list of risks/benefits, the Innovator’s alternative approaches, the SecurityOperationalValidator’s detailed mock report).

The Orchestrator’s reasoning for its decisions (why it chose certain agents, why it identified a gap, why it requested a new agent).

The final synthesized plan or evaluation.

The overall process flow demonstrating how the system navigates a complex problem.

Redactions: My apologies for any confusion there!

In the first log I shared, the “REDACTED_1”, “REDACTED_2”, etc., were indeed placeholders for the internal names of my predefined specialist agents (like Evaluator, Strategist, etc.) and the Orchestrator agent. The goal was to show the pattern without revealing my exact agent taxonomy.

The redactions within the [REQUEST_AGENT_CREATION] block in the most recent version you reviewed were to protect the specific prompt engineering details for that dynamically created agent, as that’s a key part of the system’s IP. The content of that prompt defines how that new agent behaves.

The idea isn’t that “meaningful content is relegated to a static section.” Rather, the redactions in the prompt creation block were hiding the instructions that would lead to meaningful content from that specific new agent. The unredacted parts are the general flow and the outputs of other agents.

Test Scenario: It’s also important to remember the log is from a “Test V5.” The user input eventually clarified it was a test of the AI development system and to use mock data. This means the final detailed report from the “SecurityOperationalValidator” was based on representative mock data, but it followed the process it would with real data.

  1. Log Type & Identifying LLM Calls:

“Also is that like server logs directly, or some kind of running-output that the system generates? What does/doesn’t represent an LLM call in the data? I can’t tell from the labeling of the lines what’s actually going on…”

System-Generated Trace: This is a system-generated running output or trace, designed for debugging and understanding the flow of operations. It’s more detailed than typical server logs for a production application.

Identifying LLM Activity:

Agent Request: AgentName processing ‘Prompt Snippet…’: This indicates the System Controller is about to make a call to the LLM for the specified AgentName. The ‘Prompt Snippet’ is just a small part of the actual prompt for logging brevity.

Agent Status: AgentName - Starting (0%): Shows the agent LLM call has been initiated. (The (0%) was a placeholder in my test environment for future progress reporting from streaming LLMs, not fully implemented in that log).

Response (X.Xs, Y tokens):: This immediately follows an agent’s processing and indicates the LLM call has completed, providing the time taken and token count.

PROCESS> [AgentName] Actual LLM response text…: This shows the raw text output received from the LLM for that agent, which the Controller then processes.

[Controller] Controller-specific action…: These lines represent actions taken by the middleware/System Controller itself (e.g., parsing tags, changing state, sending requests, logging performance).

[Test Handler - …] or [Test Verification - …]: These are from the automated test framework running the scenario.

[System:AgentCreation]: System-level messages, like confirmation of agent creation.

  1. System Isolation, Data, RAG, and Real-World Actualization:

“Is the system as you have now is it “isolated” in the sense that it’s only data set is the “data given in the prompt entry moment” or if RAG, does it have the ability to also write to system and not just retrieve? Where is the retrieval occurring from and how?”
“On my end, the whole purpose of the development I’ve pursued is so that the “LLM can actually do something in the real world”… From the example file upload you gave for your system results - I’m not seeing any actualization - only a sort of looping within the same semantic set as you would expect from a closed system.”

This is another excellent and critical point.

Log Example Scope: The specific log example you reviewed was primarily focused on an internal cognitive task: evaluating a proposal. In that particular flow, the agents were largely operating on the information provided in the initial goal and the synthesized outputs of other agents within the system. It didn’t showcase direct RAG from external vector stores or direct tool use for external actions in that run.

Architectural Capability vs. Demonstrated Task: The architecture itself is designed to be highly extensible.

RAG: A specialist agent could absolutely be designed and prompted to perform RAG. For example, an “InformationRetriever” specialist could be given a query by the Orchestrator. Its system prompt would instruct it to take that query, access a specified vector database (or multiple sources), retrieve relevant documents, summarize them, and return the summary. The Orchestrator would then incorporate this into its broader reasoning. The “how” of retrieval (e.g., connection strings, embedding models used) would be part of the InformationRetriever’s internal logic or configuration, invoked by the Controller when that agent is called.

Writing to System/Tool Use/Actualization: Similarly, agents can be designed to “have an effect outside of their own context window.”

A “CodeExecutionSpecialist” could be prompted to write and execute code (within safe, sandboxed environments).

An “APICallSpecialist” could be prompted to interact with external APIs.

The [REQUEST_AGENT_CREATION] capability itself is a form of the system modifying its own state/capabilities.

The key is that the Orchestrator would decide when such an action is needed and would activate the appropriate specialist. The specialist’s prompt would guide it on how to format its request for the action, and the System Controller would then interface with the actual tool/API/code executor.

Focus of the Demo: The log you saw was a demonstration of the collaborative reasoning and dynamic task management aspect. Future demonstrations could indeed showcase agents performing RAG, using tools, or modifying external files if the goal required it. The “SecurityOperationalValidator” agent, for instance, in a real scenario, might be prompted to initiate vulnerability scans using integrated tools, not just report on mock data.

“Semantic Looping” vs. “Iterative Refinement”: What might appear as “looping” is, from the system’s design perspective, “iterative refinement.” The Orchestrator gets input, identifies gaps or new needs, and then re-engages agents or brings in new capabilities. In a closed system without the ability to fetch new info or act externally, this could indeed lead to just rehashing. The power comes when these loops can incorporate new data (via RAG specialist) or trigger external actions (via tool-using specialist). The framework supports this, even if that specific log didn’t highlight it because the task was self-contained evaluation.

I hope this detailed breakdown helps clarify how the system works and addresses your points! The goal is indeed to build systems where LLMs can contribute to meaningful, real-world outcomes, and that often involves a sophisticated framework for them to collaborate, access information, and invoke tools. What the log showed was a foundational piece of that – the internal “cognitive” collaboration and planning.


Current tool list

request_agent_creation:

What it does: Dynamically requests the creation of a new, specialized cognitive agent within the system.

Abilities:

Specify a unique name for the new agent.

Define the agent’s primary purpose.

Provide the complete, structured system prompt that will guide the new agent’s behavior and reasoning.

List relevant capabilities or skills the new agent should possess (e.g., ‘Code Generation’, ‘Critical Analysis’).

Key Feature: Requires explicit user confirmation via the UI before any new agent is actually created and added to the system.

Returns: A success or failure message regarding the agent creation request.

request_verification:

What it does: Submits a piece of output (like code, a plan, or analytical text) for review and feedback from another specified AI agent/perspective (e.g., a “Sentinel” for security review, an “Evaluator” for risk assessment, or the main “Chief” for overall strategy).

Abilities:

Specify the exact content to be reviewed.

Designate the specific AI perspective (agent name) that should perform the review.

Optionally, provide specific concerns or focus areas for the reviewer.

Returns: The feedback and analysis from the reviewing agent.

Application Planning & Development Tools:

create_app_plan:

What it does: Generates a comprehensive blueprint or development plan for a software application based on user-provided project specifications.

Abilities:

Analyzes detailed project descriptions and requirements.

Outputs a structured plan including:

Functional requirements.

Implementation steps.

Architectural components and modules.

Data model outline.

Key features.

Technical specifications (languages, frameworks).

Proposed file structure.

Descriptions for each file’s purpose.

Use Case: Typically used at the beginning of a development lifecycle to guide subsequent coding tasks.

create_editor:

What it does: Establishes one or more dedicated code editing environments (Editor windows) within the system.

Abilities:

Create multiple, isolated editor instances.

Assign a unique ID to each editor for referencing.

Set an editor to be automatically focused upon creation.

Provide a description for the editor’s purpose.

Specify an intended filename for content saved from the editor.

Configure syntax highlighting for different languages (e.g., ‘python’, ‘javascript’).

Use Case: Useful for managing different code files or components separately, especially when automatic editor creation (e.g., by generate_python_code) isn’t sufficient.

generate_python_code:

What it does: (Based on the second definition in your code, as the first one seems commented out) This tool appears to be a duplicate or an alternative version of create_app_plan. It’s described as creating a comprehensive application blueprint based on user requirements, generating a structured development plan with components, dependencies, and implementation steps.

Abilities: (Same as create_app_plan based on the provided description and properties)

Analyzes detailed project descriptions and requirements.

Outputs a structured plan including requirements, implementation steps, components, data model, features, technical specs, file structure, and file descriptions.

Note to Readers: There might be an overlap or an older version of a code generation tool here. The current definition provided for generate_python_code mirrors the create_app_plan tool. (If the commented-out section was intended, it would focus on generating actual Python code into editor instances).

close_editor:

What it does: Terminates and removes a specified Editor instance from the workspace.

Abilities:

Closes an editor based on its unique ID.

Use Case: Helps manage system resources and keep the workspace organized by removing unneeded editors. Content should be saved before using this.

list_editors_and_content:

What it does: Provides an inventory of all currently active Editor instances and can optionally retrieve their content.

Abilities:

List all active editors, only the currently focused editor, or a specific editor by ID.

Retrieve the full content of listed editors, a preview, or no content (metadata only).

Use Case: Allows the system (or an agent) to understand the current state of the development environment and inspect code without necessarily focusing each editor.

focus_editor:

What it does: Makes a specific, existing Editor instance the active window or primary workspace.

Abilities:

Activates an editor based on its unique ID.

Use Case: Sets the context for subsequent operations like code editing or saving, ensuring actions are applied to the intended file.

File System & Code Management Tools:

save_file:

What it does: Persists the content currently in a specified Editor instance to a file on the file system.

Abilities:

Saves content from an editor (identified by its ID) to a given file path.

Use Case: Essential for preserving work done in the dynamic editor environments.

open_file:

What it does: Loads the content of an existing file from the file system into an Editor instance.

Abilities:

Reads a file from a specified path.

Displays its content in a designated editor (creating a new editor if the specified ID doesn’t exist).

Use Case: Allows review and modification of existing project files.

edit_code:

What it does: Performs targeted modifications to the code or text within the currently focused Editor instance.

Abilities (requires editor to be focused first):

replace_range: Replaces a specified range of lines with new data.

replace_line: Replaces a single, specified line with new data.

delete_range: Deletes a specified range of lines.

delete_line: Deletes a single, specified line.

insert_before: Inserts new data before a specified line.

insert_between: Inserts new data between two specified lines.

insert_after: Inserts new data after a specified line.

Key Feature: The system automatically adjusts for line number changes caused by previous modifications within the same edit_code tool call; the agent provides line numbers based on the state before its current batch of edits.

create_venv:

What it does: Creates an isolated Python virtual environment.

Abilities:

Specify the Python version for the environment (e.g., ‘3.9’, ‘3.10’).

Provide a custom name for the virtual environment directory (defaults to ‘venv’).

Optionally overwrite an existing environment with the same name.

Optionally focus a specific editor after creation.

Use Case: Ensures dependency isolation and consistent Python execution contexts for different projects or components.

install_pip_packages:

What it does: Installs Python packages into the active (or specified version’s) virtual environment using pip.

Abilities:

Takes a list of pip installation commands (e.g., ‘# pip install requests==2.25.1’).

Specifies the target Python version for compatibility.

Use Case: Manages project dependencies by ensuring necessary libraries are available in the correct environment.

compile_code:

What it does: Validates Python code within a specified editor by performing syntax checking and attempting compilation.

Abilities:

Checks code in an editor (identified by ID) against a specified Python version.

Use Case: Performs pre-execution validation to catch syntax errors and potential issues before attempting to run the code.

run_code:

What it does: Executes a specified Python code file in a controlled environment.

Abilities:

Runs a Python file (given its path and filename) using a specified Python version.

Use Case: Runs the application or script after it has been generated and validated, capturing its output.

manage_files:

What it does: A comprehensive tool for various file system operations.

Abilities (each is a sub-command):

append_text: Appends text to an existing file.

write_text: Writes text to a file, overwriting if it exists.

read_text: Reads text content from a file (similar to read_file but within this tool’s structure).

create_folder: Creates a new folder.

delete_file: Deletes a specified file.

delete_folder: Deletes a specified folder.

copy_folder: Copies a folder to a new location.

move_folder: Moves a folder to a new location.

rename_file: Renames a file.

compress_file: Compresses a file into a .zip archive.

compress_folder: Compresses a folder into a .zip archive.

check_file_exists: Checks if a file exists (similar to file_exists tool).

check_folder_exists: Checks if a folder exists.

get_file_properties: Retrieves properties of a file.

list_files: Lists files in a specified folder (scoped version of the main list_files tool).

list_subfolders: Lists subfolders within a specified folder.

Use Case: Provides a general-purpose interface for a wide range of file and directory manipulations.

State & Context Management Tools:

memory_tool:

What it does: Manages a persistent information store for the system, allowing it to save and retrieve important context, decisions, or reference data across different operations or even sessions.

Abilities:

add_memory: Stores a piece of string information.

get_memories: Retrieves all currently stored memories.

Use Case: Helps maintain context and continuity in long or complex tasks.

get_full_content:

What it does: Retrieves the complete, unabridged version of a message or content that was previously truncated by the system (likely for display or token limit reasons).

Abilities:

Fetches full content based on a message ID.

Use Case: Allows agents to access complete information when a summary or preview isn’t sufficient.

Utility & External Interaction Tools:

screen_capture:

What it does: Captures a screenshot of the current development environment or application windows.

Abilities:

Takes a screenshot, usually accompanied by a description of what is being captured and why.

Use Case: Useful for visual documentation, debugging, or providing visual context in reports or to the user.

script_pwr:

What it does: Executes PowerShell scripts within the system’s environment.

Abilities:

Runs arbitrary PowerShell script content.

Use Case: Enables advanced system administration, automation, and environment configuration tasks that are well-suited for PowerShell.

run_command_script:

What it does: Executes command-line scripts or sequences of commands in the native system shell (e.g., Command Prompt on Windows, bash on Linux/macOS).

Abilities:

Runs arbitrary shell script content.

Use Case: Supports system operations, environment setup, and interactions with core utilities that require direct shell access.

save_project:

What it does: Saves the entire current state of the development project, including all files, configurations, and editor states.

Abilities:

Creates a named snapshot of the project.

Use Case: Allows developers or the system to preserve work at logical checkpoints and resume later.

list_saved_projects:

What it does: Retrieves a list of previously saved development projects and their metadata.

Abilities:

Lists available project snapshots, with a limit on the number returned.

Use Case: Helps in identifying and selecting a project to restore.

load_project:

What it does: Restores a previously saved development project state, overwriting the current workspace.

Abilities:

Loads a project snapshot by its name.

Use Case: Enables continuation of development from a previously saved checkpoint.

research:

What it does: Performs web research on a specified topic, gathers information from multiple sources, and synthesizes the findings into a comprehensive summary.

Abilities:

Takes a search query.

Takes a description of the user’s underlying request/question for context.

Allows specification of the desired “reasoning effort” (low, medium, high) for the synthesis.

Use Case: Enables agents to gather external information needed to fulfill a goal or answer questions.

Hope this clarifies the toolkit! Let me know if you have more questions.



Any way you can hide the wall of text, please?

AI output can be interesting, but please format! :wink:

Summary

This text will be hidden

Personally, I’m finishing up adding new 4o images with pricing differences, etc…

Mostly grunt work at this point, but slowly grinding…

I do want to add the ability to “edit” or “change” an image… so take the output then re-upload with a text box to explain changes wanted… So much to do!