Are AI agents truly autonomous, or are we moving too fast?

The artificial intelligence industry is going through a phase of profound maturation, moving from polished demos to the raw problems of production deployment. The data that emerged this week draws a sharp line between theoretical enthusiasm and the operational reality of enterprise systems. On one hand, the ability of machines to analyze code has reached such levels that it undermines the very foundations of open source software; on the other hand, large language models show clear limitations when left to operate without supervision in complex IT environments.

The transition towards agent-dominated workflows requires a paradigm shift in governance and solution architecture. It is no longer about evaluating how intelligent a model is in the abstract, but about measuring its reliability within structured and repeatable processes.

Artificial intelligence becomes the biggest security auditor

The Glasswing project by Anthropic has applied the new Claude Mythos Preview model to global open source code, generating an earthquake in the cybersecurity sector. In just one month of activity, the system identified over ten thousand potential flaws, with 1,094 critical vulnerabilities confirmed by human analysis. The alarming figure concerns the reaction capacity: developers managed to create patches for only 97 of these critical issues.

This gap creates an unsustainable asymmetry. An artificial intelligence identifies bugs at a speed that the traditional software ecosystem cannot physically manage. Among the emerged problems stands out a critical flaw in WolfSSL, a TLS library widely used in automotive and IoT systems, with a CVSS score of 9.1. The use of these tools transforms the work of security teams from manual research to a desperate triage process. The old logic of quarterly patching is completely outdated: the market is forced to adapt to the speed of machines. The real hurdle is automating the resolution, a theme that recalls the urgency to integrate offensive security and operational voice directly into development pipelines.

In response to this scenario, Anthropic is preparing to launch Claude Mythos 1 to a wider audience, integrating it natively into Claude Code and Claude Security. Having a specialized model for vulnerability research means drastically reducing the bottlenecks linked to manual code review. The new dashboard metrics in Claude Security, which track the history of scans at 7 and 30 days, finally provide the necessary tools to demonstrate the technical impact and return on investment to enterprise clients.

The real challenge is energy efficiency, not abstract ethics

The debate on artificial intelligence has reached the highest global institutions with a new Vatican encyclical, which compares the current technological transition to the industrial revolution. Philosophical questions aside, the document highlights a fundamental technical paradox linked to hardware and resources. While the industry designs data centers in space to power increasingly compute-hungry models, the human brain continues to operate consuming less energy than a light bulb.

The inference cost and energy impact of large language models remain colossal challenges for large-scale production deployment. The industry's focus must shift to extreme optimization, pushing the adoption of smaller models and lightweight frameworks. Building data centers in orbit demonstrates a purely muscular approach to the energy problem, while true evolution passes through extracting business value by bringing computation towards an agentic revolution on the edge, imitating the real efficiency of biological systems.

The reality check for autonomous agents

The new Gartner report issues an unequivocal warning to enterprise companies: by 2027, 40% of projects linked to autonomous AI agents will be scaled back or abandoned. The main cause lies in the flawed governance models adopted by IT teams, which apply rigid rules designed for legacy software to complex agentic systems. Treating artificial intelligence in a binary way, meaning completely blocked or totally free, generates disastrous scenarios: on one hand, innovation is suffocated, on the other, automation scripts are left free to create incalculable damage. The way out isn't technical, it's organizational: to design AI governance for an agentic system means deciding at the table, before the code, which calls the agent can make alone, which require a human-in-the-loop, and which must be blocked by design.

To solve the problem, Gartner proposes a classification into four levels of operational autonomy: "observe", "advise", "act with approval" and "act autonomously". An agent tasked with data entry requires a human approval approach via dashboard, while totally autonomous systems need circuit breakers at the code level. The implementation of complete audit trails and measurable automations becomes a non-negotiable requirement.

Insight Tecnico

Confirming this need for control, Hugging Face and IBM Research published the results of ITBench-AA, the first rigorous benchmark to evaluate the capabilities of AI agents in enterprise contexts. The data shows that all current frontier models stop below the 50% success threshold in executing real IT tasks in total autonomy. The evaluation highlights the inability of today's models to maintain long-term context without constant human supervision.

The real bottleneck is not pure code generation, but the multi-step reasoning capacity on heterogeneous architectures.

These results are extremely healthy for the sector, marking the fall of chaotic agents in favor of an engineering approach based on frameworks with limited scope and highly specialized tools.

Agent swarms and model rationalization

The evolution of multi-agent orchestration takes a leap forward with the release of Claude Opus 4.8 by Anthropic. The model introduces "dynamic workflows", a native tool to coordinate swarms of sub-agents in total autonomy. Metrics indicate a reduction in code errors and an unprecedented propensity of the model to admit its own limitations. This ability to openly declare doubts radically changes the debugging phase, allowing the insertion of the model into CI/CD pipelines with the certainty that it will ask for confirmation before corrupting the output. The native integration of these functions instantly makes many third-party orchestration frameworks obsolete.

In parallel, OpenAI has updated GPT-5.5 Instant, completely removing the "canvas" function from its most recent models. Complex writing and code generation tasks now occur natively in the chat flow. This choice confirms that solid context management makes fragmented workspaces useless, allowing focus to be maintained in the main dialog window.

The company also announced the definitive shutdown of servers for the legacy o3 and GPT-4.5 models by August 2026. This brutal cleaning of the catalog recalls a fundamental rule: building applications highly dependent on the anomalies of a specific version generates devastating technical debt. It becomes imperative to implement layered architectures, capable of swapping the underlying model in a few minutes via simple environment variables.

Weekly radar: flash news and practical tools

The ecosystem continues to move at a fast pace amidst price drops, new frameworks, and market adjustments. Here is a summary of the most relevant elements that emerged in recent days.

News to watch:

Token price war: Deepseek crashes costs, offering output tokens 34 times cheaper than GPT-5.5, redefining margins for those building applications on a large scale.
Search revolution: Google transforms links into simple secondary components of its AI search, an epochal change that is pushing startups like Peec AI to reach 10 million in ARR by providing specific SEO services for LLM-generated results.
Impact on work: ClickUp cut 22% of its workforce with the goal of replacing them with autonomous agents, while McKinsey launches a free AI tool for interviews, putting strong pressure on the private coaching market.
Hardware and infrastructure: Snowflake invests 6 billion on AWS to consolidate its agentic infrastructure, and Google unveils Coral Board to run Gemma 3 directly locally. Meanwhile, semantic routers are establishing themselves as a solution to drastically reduce token consumption.
Record valuations: Anthropic surpasses OpenAI reaching a valuation of 965 billion dollars, driven by heavy contracts in the enterprise sector, including the opening of new offices in Europe.

Tools and frameworks for operations:

Ellf AI (Beta): an emerging platform to develop agentic NLP solutions. It works as a specialized virtual assistant to structure information extraction pipelines and pairs perfectly with coding assistants.
AGENTS.md: a new emerging standard to document the limits, capabilities, and expected behaviors of AI agents within source code repositories, fundamental for corporate governance.
Amazon Bedrock AgentCore: a managed runtime designed specifically to orchestrate and track serverless multi-agent systems, ideal for those who need to scale infrastructure without managing servers.
Claude Code Organizer: a comprehensive dashboard to monitor and manage memories, local configurations, and MCP servers linked to Claude Code, essential for maintaining order in agent-dominated workspaces.
Transformers.js: brings NLP models directly into the browser, managing client-side classification and RAG, drastically lightening the load on central servers and reducing infrastructure costs.
Data Formulator 0.7: the intelligent workspace by Microsoft to explore complex enterprise data relying on AI agents dedicated to structured analysis.

Artificial intelligence becomes the biggest security auditor

The real challenge is energy efficiency, not abstract ethics

The reality check for autonomous agents

Insight Tecnico

The real bottleneck is not pure code generation, but the multi-step reasoning capacity on heterogeneous architectures.

These results are extremely healthy for the sector, marking the fall of chaotic agents in favor of an engineering approach based on frameworks with limited scope and highly specialized tools.

Agent swarms and model rationalization

Weekly radar: flash news and practical tools

The ecosystem continues to move at a fast pace amidst price drops, new frameworks, and market adjustments. Here is a summary of the most relevant elements that emerged in recent days.

News to watch:

Token price war: Deepseek crashes costs, offering output tokens 34 times cheaper than GPT-5.5, redefining margins for those building applications on a large scale.
Search revolution: Google transforms links into simple secondary components of its AI search, an epochal change that is pushing startups like Peec AI to reach 10 million in ARR by providing specific SEO services for LLM-generated results.
Impact on work: ClickUp cut 22% of its workforce with the goal of replacing them with autonomous agents, while McKinsey launches a free AI tool for interviews, putting strong pressure on the private coaching market.
Hardware and infrastructure: Snowflake invests 6 billion on AWS to consolidate its agentic infrastructure, and Google unveils Coral Board to run Gemma 3 directly locally. Meanwhile, semantic routers are establishing themselves as a solution to drastically reduce token consumption.
Record valuations: Anthropic surpasses OpenAI reaching a valuation of 965 billion dollars, driven by heavy contracts in the enterprise sector, including the opening of new offices in Europe.

Tools and frameworks for operations:

Ellf AI (Beta): an emerging platform to develop agentic NLP solutions. It works as a specialized virtual assistant to structure information extraction pipelines and pairs perfectly with coding assistants.
AGENTS.md: a new emerging standard to document the limits, capabilities, and expected behaviors of AI agents within source code repositories, fundamental for corporate governance.
Amazon Bedrock AgentCore: a managed runtime designed specifically to orchestrate and track serverless multi-agent systems, ideal for those who need to scale infrastructure without managing servers.
Claude Code Organizer: a comprehensive dashboard to monitor and manage memories, local configurations, and MCP servers linked to Claude Code, essential for maintaining order in agent-dominated workspaces.
Transformers.js: brings NLP models directly into the browser, managing client-side classification and RAG, drastically lightening the load on central servers and reducing infrastructure costs.
Data Formulator 0.7: the intelligent workspace by Microsoft to explore complex enterprise data relying on AI agents dedicated to structured analysis.

Are AI agents truly autonomous, or are we moving too fast?

Artificial intelligence becomes the biggest security auditor

The real challenge is energy efficiency, not abstract ethics

The reality check for autonomous agents

Agent swarms and model rationalization

Weekly radar: flash news and practical tools

Lavora Meglio con l'Intelligenza Artificiale

Before you go, I recommend you also read these insights.

Will the collapse of inference costs and new autonomous agents make artificial intelligence scalable?

Are dynamic routing and low-cost models the real solution for scaling autonomous agents?

Will the new autonomous agents save us from the algorithmic collapse of social networks?

Are AI agents truly autonomous, or are we moving too fast?

Listen to the Insight

Artificial intelligence becomes the biggest security auditor

The real challenge is energy efficiency, not abstract ethics

The reality check for autonomous agents

Agent swarms and model rationalization

Weekly radar: flash news and practical tools

Lavora Meglio con l'Intelligenza Artificiale

Before you go, I recommend you also read these insights.

Will the collapse of inference costs and new autonomous agents make artificial intelligence scalable?

Are dynamic routing and low-cost models the real solution for scaling autonomous agents?

Will the new autonomous agents save us from the algorithmic collapse of social networks?

Fabrizio Mazzei

Listen to the Insight

Fabrizio Mazzei

Are AI agents truly autonomous, or are we moving too fast?

Artificial intelligence becomes the biggest security auditor

The real challenge is energy efficiency, not abstract ethics

The reality check for autonomous agents

Agent swarms and model rationalization

Weekly radar: flash news and practical tools

Found it useful? I have more like this.

Lavora Meglio con l'Intelligenza Artificiale

Before you go, I recommend you also read these insights.

Will the collapse of inference costs and new autonomous agents make artificial intelligence scalable?

Are dynamic routing and low-cost models the real solution for scaling autonomous agents?

Will the new autonomous agents save us from the algorithmic collapse of social networks?

Are AI agents truly autonomous, or are we moving too fast?

Listen to the Insight

Artificial intelligence becomes the biggest security auditor

The real challenge is energy efficiency, not abstract ethics

The reality check for autonomous agents

Agent swarms and model rationalization

Weekly radar: flash news and practical tools

Found it useful? I have more like this.

Lavora Meglio con l'Intelligenza Artificiale

Before you go, I recommend you also read these insights.

Will the collapse of inference costs and new autonomous agents make artificial intelligence scalable?

Are dynamic routing and low-cost models the real solution for scaling autonomous agents?

Will the new autonomous agents save us from the algorithmic collapse of social networks?

Fabrizio Mazzei

Listen to the Insight

Fabrizio Mazzei