FM Logo
AI BlogAI NewsAI LabThe BookAbout
How can I help?
How can I help?

The reality check for AI agents and the automated security revolution
INSIGHT #24
SundAI Blog

The reality check for AI agents and the automated security revolution

5/31/20267 min read
TL;DR

"Gartner reports and IT benchmarks shatter the illusions of absolute autonomy. While Claude Mythos disrupts cybersecurity, the market must refocus on governance and reliability."

Loading audio player...

The artificial intelligence industry is going through a phase of profound maturation, moving from polished demos to the raw problems of production deployment. The data that emerged this week draws a sharp line between theoretical enthusiasm and the operational reality of enterprise systems. On one hand, the ability of machines to analyze code has reached such levels that it undermines the very foundations of open source software; on the other hand, large language models show clear limitations when left to operate without supervision in complex IT environments.

The transition towards agent-dominated workflows requires a paradigm shift in governance and solution architecture. It is no longer about evaluating how intelligent a model is in the abstract, but about measuring its reliability within structured and repeatable processes.

Artificial intelligence becomes the biggest security auditor

The Glasswing project by Anthropic has applied the new Claude Mythos Preview model to global open source code, generating an earthquake in the cybersecurity sector. In just one month of activity, the system identified over ten thousand potential flaws, with 1,094 critical vulnerabilities confirmed by human analysis. The alarming figure concerns the reaction capacity: developers managed to create patches for only 97 of these critical issues.

This gap creates an unsustainable asymmetry. An artificial intelligence identifies bugs at a speed that the traditional software ecosystem cannot physically manage. Among the emerged problems stands out a critical flaw in WolfSSL, a TLS library widely used in automotive and IoT systems, with a CVSS score of 9.1. The use of these tools transforms the work of security teams from manual research to a desperate triage process. The old logic of quarterly patching is completely outdated: the market is forced to adapt to the speed of machines. The real hurdle is automating the resolution, a theme that recalls the urgency to integrate offensive security and operational voice directly into development pipelines.

In response to this scenario, Anthropic is preparing to launch Claude Mythos 1 to a wider audience, integrating it natively into Claude Code and Claude Security. Having a specialized model for vulnerability research means drastically reducing the bottlenecks linked to manual code review. The new dashboard metrics in Claude Security, which track the history of scans at 7 and 30 days, finally provide the necessary tools to demonstrate the technical impact and return on investment to enterprise clients.

The real challenge is energy efficiency, not abstract ethics

The debate on artificial intelligence has reached the highest global institutions with a new Vatican encyclical, which compares the current technological transition to the industrial revolution. Philosophical questions aside, the document highlights a fundamental technical paradox linked to hardware and resources. While the industry designs data centers in space to power increasingly compute-hungry models, the human brain continues to operate consuming less energy than a light bulb.

The inference cost and energy impact of large language models remain colossal challenges for large-scale production deployment. The industry's focus must shift to extreme optimization, pushing the adoption of smaller models and lightweight frameworks. Building data centers in orbit demonstrates a purely muscular approach to the energy problem, while true evolution passes through extracting business value by bringing computation towards an agentic revolution on the edge, imitating the real efficiency of biological systems.

The reality check for autonomous agents

The new Gartner report issues an unequivocal warning to enterprise companies: by 2027, 40% of projects linked to autonomous AI agents will be scaled back or abandoned. The main cause lies in the flawed governance models adopted by IT teams, which apply rigid rules designed for legacy software to complex agentic systems. Treating artificial intelligence in a binary way, meaning completely blocked or totally free, generates disastrous scenarios: on one hand, innovation is suffocated, on the other, automation scripts are left free to create incalculable damage.

To solve the problem, Gartner proposes a classification into four levels of operational autonomy: "observe", "advise", "act with approval" and "act autonomously". An agent tasked with data entry requires a human approval approach via dashboard, while totally autonomous systems need circuit breakers at the code level. The implementation of complete audit trails and measurable automations becomes a non-negotiable requirement.

Insight Tecnico

Confirming this need for control, Hugging Face and IBM Research published the results of ITBench-AA, the first rigorous benchmark to evaluate the capabilities of AI agents in enterprise contexts. The data shows that all current frontier models stop below the 50% success threshold in executing real IT tasks in total autonomy. The evaluation highlights the inability of today's models to maintain long-term context without constant human supervision.

The real bottleneck is not pure code generation, but the multi-step reasoning capacity on heterogeneous architectures.

These results are extremely healthy for the sector, marking the fall of chaotic agents in favor of an engineering approach based on frameworks with limited scope and highly specialized tools.

Agent swarms and model rationalization

The evolution of multi-agent orchestration takes a leap forward with the release of Claude Opus 4.8 by Anthropic. The model introduces "dynamic workflows", a native tool to coordinate swarms of sub-agents in total autonomy. Metrics indicate a reduction in code errors and an unprecedented propensity of the model to admit its own limitations. This ability to openly declare doubts radically changes the debugging phase, allowing the insertion of the model into CI/CD pipelines with the certainty that it will ask for confirmation before corrupting the output. The native integration of these functions instantly makes many third-party orchestration frameworks obsolete.

In parallel, OpenAI has updated GPT-5.5 Instant, completely removing the "canvas" function from its most recent models. Complex writing and code generation tasks now occur natively in the chat flow. This choice confirms that solid context management makes fragmented workspaces useless, allowing focus to be maintained in the main dialog window.

The company also announced the definitive shutdown of servers for the legacy o3 and GPT-4.5 models by August 2026. This brutal cleaning of the catalog recalls a fundamental rule: building applications highly dependent on the anomalies of a specific version generates devastating technical debt. It becomes imperative to implement layered architectures, capable of swapping the underlying model in a few minutes via simple environment variables.

Weekly radar: flash news and practical tools

The ecosystem continues to move at a fast pace amidst price drops, new frameworks, and market adjustments. Here is a summary of the most relevant elements that emerged in recent days.

News to watch:

  • Token price war: Deepseek crashes costs, offering output tokens 34 times cheaper than GPT-5.5, redefining margins for those building applications on a large scale.
  • Search revolution: Google transforms links into simple secondary components of its AI search, an epochal change that is pushing startups like Peec AI to reach 10 million in ARR by providing specific SEO services for LLM-generated results.
  • Impact on work: ClickUp cut 22% of its workforce with the goal of replacing them with autonomous agents, while McKinsey launches a free AI tool for interviews, putting strong pressure on the private coaching market.
  • Hardware and infrastructure: Snowflake invests 6 billion on AWS to consolidate its agentic infrastructure, and Google unveils Coral Board to run Gemma 3 directly locally. Meanwhile, semantic routers are establishing themselves as a solution to drastically reduce token consumption.
  • Record valuations: Anthropic surpasses OpenAI reaching a valuation of 965 billion dollars, driven by heavy contracts in the enterprise sector, including the opening of new offices in Europe.

Tools and frameworks for operations:

  • Ellf AI (Beta): an emerging platform to develop agentic NLP solutions. It works as a specialized virtual assistant to structure information extraction pipelines and pairs perfectly with coding assistants.
  • AGENTS.md: a new emerging standard to document the limits, capabilities, and expected behaviors of AI agents within source code repositories, fundamental for corporate governance.
  • Amazon Bedrock AgentCore: a managed runtime designed specifically to orchestrate and track serverless multi-agent systems, ideal for those who need to scale infrastructure without managing servers.
  • Claude Code Organizer: a comprehensive dashboard to monitor and manage memories, local configurations, and MCP servers linked to Claude Code, essential for maintaining order in agent-dominated workspaces.
  • Transformers.js: brings NLP models directly into the browser, managing client-side classification and RAG, drastically lightening the load on central servers and reducing infrastructure costs.
  • Data Formulator 0.7: the intelligent workspace by Microsoft to explore complex enterprise data relying on AI agents dedicated to structured analysis.

Found it useful? I have more like this.

Every week I pick the most interesting and high-impact AI news and share them in an email recap. Subscribe so you don't miss the next one.

Share this Insight
LinkedInTwitterEmail
Book cover
New

Lavora Meglio con l'Intelligenza Artificiale

My practical AI guide focused on real everyday work tasks: emails, reports, slides, data, and automation. Practical examples and ready-to-use prompts to save time and work better right away.

Discover the book

Before you go, I recommend you also read these insights.

The agentic ecosystem consolidates between operating models and vector communication

The agentic ecosystem consolidates between operating models and vector communication

OpenAI and Google are redesigning platforms around autonomous agents, while the elimination of the 'language tax' revolutionizes communication between models. In Italy, the AI market explodes to 1.8 billion.

Read more
The Italian AI market reaches 1.8 billion while Notion and Android become agentic ecosystems

The Italian AI market reaches 1.8 billion while Notion and Android become agentic ecosystems

Data confirms the acceleration of artificial intelligence in Italy, requiring a rapid update of skills. Meanwhile, the native integration of agents on Notion and Android definitively transforms how we orchestrate data and apps.

Read more
Offensive security, operational voice and the return of local infrastructure

Offensive security, operational voice and the return of local infrastructure

The arms race in the artificial intelligence sector is going through a clear phase shift. Pure text generation is giving way to infrastructure control, deep code analysis, and the execution of complex tasks.

Read more

Listen to the Insight

AI Audio Version

Listen while driving or coding.

Ready
Fabrizio Mazzei, AI Solutions Architect e consulenza AI
Author

Fabrizio Mazzei

AI Solutions Architect

As an AI Solutions Architect I design digital ecosystems and autonomous workflows. Almost 10 years in digital marketing, today I integrate AI into business processes: from Next.js and RAG systems to GEO strategies and dedicated training. I like to talk about AI and automation, but that's not all: I've also written a book, "Work Better with AI", a practical handbook with 12 chapters and over 200 ready-to-use prompts for those who want to use ChatGPT and AI without programming. My superpower? Looking at a manual process and already seeing the automated architecture that will replace it.

Discover my book (Italian)Need help with AI?Need a hand?Let's Connect