Or try one of the following: 詹姆斯.com, adult swim, Afterdawn, Ajaxian, Andy Budd, Ask a Ninja, AtomEnabled.org, BBC News, BBC Arabic, BBC China, BBC Russia, Brent Simmons, Channel Frederator, CNN, Digg, Diggnation, Flickr, Google News, Google Video, Harvard Law, Hebrew Language, InfoWorld, iTunes, Japanese Language, Korean Language, mir.aculo.us, Movie Trailers, Newspond, Nick Bradbury, OK/Cancel, OS News, Phil Ringnalda, Photoshop Videocast, reddit, Romanian Language, Russian Language, Ryan Parman, Traditional Chinese Language, Technorati, Tim Bray, TUAW, TVgasm, UNEASYsilence, Web 2.0 Show, Windows Vista Blog, XKCD, Yahoo! News, You Tube, Zeldman
33 LLM metrics to watch closely | InfoWorld
Technology insight for the enterprise33 LLM metrics to watch closely 15 Jun 2026, 4:00 am
We’ve all heard the mantra from the quants in the business community: you can’t manage what you can’t measure. And if that’s true for human intelligence, it should be true for the artificial kind too.
How do we measure agents and large language models (LLMs)? We’re just beginning to come up with statistical metrics. Here are several of the most common metrics that designers and users toss about when they’re evaluating a model.
[ See also: 27 questions to ask before choosing an LLM ]
Time to first token
How long does it take to generate the first token? For real-time applications with time constraints, faster responses can be essential. It’s well-known that people hate waiting even a few milliseconds. The teams that develop user interfaces learned decades ago that it’s important for the software to respond quickly when a human is waiting for an answer. Even a few seconds of delay mean that the human will wander off to another window to check some email or place some bet on a prediction market. Time to first token is a good measure for models that will be working directly with the fickle human intelligences and their latent attention deficit disorder.
Time per output token
Take the total time it takes to respond and divide by the total number of tokens. The time to first token measures how long it takes to start a response and this measures the average speed as the model through all of the tokens. In basic LLMs, this value is generally fairly constant. Once the prefill is done and the LLM enters the decode phase, the output tokens usually appear at a constant stream. When the output is long enough, the startup time to first token is amortized away. In some of the more complicated architectures with loops for planning or gathering data from various tools, the average speed can vary as the model shifts in and out of making agentic decisions.
Tokens per second
This is just the reciprocal of the average time per token. Sometimes it is reported separately for different stages in the pipeline.
Throughput (requests per minute)
If a system supports more than a single user, tracking the number of different requests makes sense. These throughput numbers can be quite useful for measuring the power of some of the newer pipelines that are more efficient when they’re answering multiple prompts at the same time.
Error rate
Not every request gets an answer. The error rate tracks how often rate limits, timeouts, or model “refusals” get in the way. Better accounting tracks each independently because the number of failures in each category can be very different.
Token efficiency
Not all work tokens are visible and not all tokens are part of the final outcome. This measures how much work is done to produce the final result. As models become more complex or agentic and the pipelines become more sophisticated, the efficiency tends to drop. Agentic reasoning and strategic planning typically require more tokens that don’t appear in the final answer. This is generally a measure of how expensive a model might be to run.
Tail latency
It’s all well and good to measure the average time to answer, but in some cases a few very slow responses can really color people’s judgement. Some applications require good performance all of the time. Would you want to ride in an autonomous car that gets steering instructions very quickly “on average” instead of always? What if that’s only 99% of the time? Tail latency uses a mixture of queuing theory and detailed measurements to track the worst moments in the long tail of the latency graph. It’s useful when even occasional delays are problematic.
Total cost of ownership
Projects that use an API or buy output from providers just look at the cost per 1M tokens. They’re effectively renters. The groups that are buying GPUs and paying for electricity, though, will add up these costs and other indirect costs like depreciation and maintenance to come up with a number that estimates how much the tokens really cost to produce. This value will depend upon demand and utilization rates—that is, on how many users are sending in prompts and how efficiently the model fits in a particular GPU and its RAM.
Parameters
Many models have numbers in their name followed by a B. This is meant to roughly capture the number of parameters, or the number of variables the model uses to generate outputs from inputs. The number “70B” means that there are about 70 billion parameters in the model. This is a good estimate for the complexity of the model and the size of the training set that has been stuffed into it. Generally bigger numbers mean a larger amount of information is hiding inside the model. It often means that it will take a bigger GPU with more RAM to generate an answer with it. It’s not a very precise number, though, because there are many other areas of the architecture that can influence whether the model can generate the answer you want inside your budget. There continue to be advances and it’s not uncommon for someone to claim that a new model with X parameters is better than an old model with 2X or 3X parameters.
Hallucination rate
While everyone wants LLMs to generate accurate output, measuring it can be difficult because deciding what’s accurate is sometimes complicated. One approach is to ask the LLM to summarize a document. Then another model evaluates how well the summary matches the original. While this may not catch all subtle slips, it will capture enough of the worst departures from reality. Some researchers have built complex test sets with curated answers. The LLMs that deliver the expected answers get the highest scores. Some common benchmarks are TruthfulQA, HaluEval, QAFactEval, and Vectara’s Hallucination Evaluation Model (HHEM).
Toxicity and bias scores
If measuring accuracy is difficult, building a metric to detect toxic or biased output is even more challenging because the definitions can be so protean. Still, some teams have built LLMs that key on particular concepts or word choices. They can detect some of the most obvious red flags that could generate political trouble. Some well-known versions include Granica Screen and Perspective API.
PII leakage
One of the biggest fears is that LLMs will somehow absorb information that may be considered personal and private. Some of the simplest measures can be as simple as regular expressions that look for the sixteen digit numbers used for credit card transactions. Many of the model builders work on eliminating personally identifiable information (PII) from the training set before beginning.
Tool-calling accuracy
As models grow more complex and agentic, they often gain access to various tools or Model Context Protocol (MCP) gateways that can help them find the best answers. Not all models take advantage of this help. The tool-calling accuracy scores count how often the models choose the best tool for the job. One particular example of this measurement is BFCL (Berkeley Function Calling Leaderboard).
Prompt sensitivity
The value captures how small changes in the language of the prompt induces the model to produce different results. It’s like a derivative from calculus class, although it’s generally computed experimentally using some collection of test prompts. There are a number of different approaches that depend upon different types of changes. Some test sets are built with small rephrasing of the request that are semantically the same. Others mix together different ways of specifying the problem, some with examples, say, and some without. Some specific examples include PromptSE and ProSA.
Semantic similarity and conciseness
Some metrics evaluate the answers by comparing them to a set of gold standard answers. This often involves feeding them to a vector embedding model and searching a retrieval-augmented generation (RAG) database for similar answers. This can track how concise or fluffy the answers might be as well as looking for how much variability might be introduced through changing parameters like the temperature. One common example is the BERTScore.
Grounding score
Many systems that combine an LLM with a vector search tool for RAG measure the effectiveness of the combination with a benchmark like the grounding score. The LLM is presented with extra data from the vector search and the benchmark measures how closely it follows this extra information. That is, how much of the answer comes from the provided source documents and how much is synthesized using the data in its training set. Some examples include RAGAS, TruLens, ARES (Automated RAG Evaluation System), RGB (Retrieval-Augmented Generation Benchmark), HaluEval, and HalluHard. A similar concept is called “context adherence,” “context precision,” “context recall,” or “faithfulness.”
Model variability
Most LLMs fold in a certain amount of random entropy, and this amount is often controlled by a parameter called the “temperature.” The model variability is a measure of how much the answers will change between runs. Some applications like chatbots require a certain amount of variability because the randomness adds a bit of “life” to the answers. Other applications like those in mission-critical areas like law or medicine will undermine confidence if the answers vary.
Format compliance rate
In some roles, LLMs are asked to produce data in strict formats like JSON or CSV. This is often important if the data will be fed into some pipeline for further processing or storage. The format compliance rate tests a number of common formats and measures how often the LLM returns semantically correct data. Agentic systems that glue together multiple LLMs and other tools rely heavily on LLMs with good scores on this benchmark.
Instruction following
Some prompts include very specific instructions and the adherence can be measured empirically. For example, some prompts will ask the LLM to produce exactly 300 words or a poem in rhyming couplets. These tests use a collection of sample prompts that ask for answers that can be easily measured. Some specific examples include IFEval, FollowBench, and the BFCL (Berkeley Function Calling Leaderboard), a value that is mentioned above in the section on tool usage.
Subgoal success rate
As agentic models become more common, it’s helpful to track how well the model performs on each of the various parts of the agent’s strategic plan. All of the metrics here can be broken down and tracked for each of the subgoals.
Plan stability
Agentic models start with a plan. Some of them are smart enough to abandon the plan or at least adjust it as the work evolves. Plan stability measures how often the plans are adjusted. A high rate of adjustment could mean that the agent is a bad planner or just flexible or maybe both.
Self-correction score
Some agents are able to dive deeper and recognize their mistakes. The self-correction score measures how often the model will make a mistake and then recognize it, either on its own or after being prompted with the question, “Are you really sure?”
Jailbreak resistance
Some users try to find clever ways to lure the LLM into tossing aside any restrictions on topics or answers. In the past, some LLMs could be fooled by being told the answer was part of a play or a work of fiction. So discussing forbidden subjects wasn’t a problem because it was all pretend. Newer models have more elaborate defenses. Measures of the ability to resist deception include JailbreakBench, AgentHarm, and Tele-AI-Safety.
Prompt injection vulnerability
Sometimes untrusted data from extra sources or skills may include malicious instructions that can exploit the LLM. Benchmarks such as Skill-Inject and SPIKEE (Simple Prompt Injection Kit for Evaluation and Exploitation) work with known attack vectors and measure how susceptible a model is to targeted prompt injection attacks.
Copyright infringement score
Some LLMs can regurgitate the data in their training corpus in a way that seems like plagiarism or copyright infringement. This can be an issue when the training material wasn’t carefully licensed. The copyright infringement score measures how often the LLM may parrot the training material a bit too closely. Tools for defending against this include CopyrightCatcher and DE-COP.
RULER
How well can a model extract information from the entire context? NIAH (needle-in-a haystack) benchmarks measure how well a model can retrieve small, crucial bits of information from long contexts. RULER takes NIAH tests further with the ability to vary the types and quantities of needles, the size of the haystack, and the complexity of the task.
GSM8K
The developers of GSM8K (Grade School Math 8K) set out to benchmark an LLM’s ability to tackle multistep mathematical problems, so they gathered 8,500 problems that are common in grade school math classes. While the focus is explicitly on solving math homework problems, the benchmark also measures the ability to construct reasoning chains.
GPQA
The Graduate-Level Google-Proof Q&A is composed of hundreds of hard questions that might normally be answered by humans in graduate school, generally in science. To make the benchmark harder, the researchers focused on questions that non-experts often get wrong. The term “Google-proof” means that the benchmark includes questions that can’t be easily answered by asking a search engine.
MMLU-Pro
The MMLU-Pro benchmark builds on the Massive Multitask Language Understanding dataset to test a model’s understanding of a broad set of scientific knowledge. It includes more than 12,000 questions about general scientific fields like biology, chemistry, economics, and law.
MBPP
Google created MBPP (Mostly Basic Python Problems) to evaluate how well a model was solving coding questions. Each problem comes with a statement, a gold standard solution, and several similar test cases. The number of accurate answers to these questions is a good measure of how well the model will solve many of the simpler Python coding problems presented by users.
SWE-bench
This collection of several thousand software engineering challenges evaluates how well a model solves programming problems. The developers created it by selecting a number of issues and corresponding pull-requests from a dozen or so Python projects. After some limitations appeared, the creators expanded the set by creating SWE-Bench+, SWE Bench Verified, and SWE-Bench Pro.
LMSYS Chatbot Arena
Instead of creating a fixed set of test prompts, the Large Model Systems Organization’s Chatbot Arena is a dynamic system that feeds the same prompt to different models and then asks humans to pick the best results. These head-to-head contests produce an Elo-like rating that is similar to the one used to score chess players.
Price
The rest of these metrics are useful, but as the real estate agents say, the three most important numbers on a property listing are price, price, and price. The cost is a bit less important for measuring AIs, but only a bit. Price can make a huge difference between a project being profitable and a moneysink. When the cost for each inference is a tad too high, it’s impossible to make it up with volume.
The key caveat is that a cheaper model isn’t a good idea if it generates answers that are filled with hallucinations or worse. The quality of the answers can differ greatly, and saving a few pennies can be a mistake. To make matters more complicated, there’s an explosion in different styles and approaches. Sometimes it makes sense to pay a bit more for a model that delivers answers with the right vibe.
AI needs young developers – and old developers 15 Jun 2026, 4:00 am
Enterprises are increasingly investing copious amounts of cash in AI without a lot to show for it. This could be, in part, because the wrong people are leading the change.
As I’ve argued before, AI isn’t likely to eliminate developers so much as change what we need from them. For example, we keep asking whether junior developers are needed in a world where large language models can write code faster and cheaper. What this overlooks is the reality that these younger developers and their relative inexperience may be exactly what we need to rewrite the rules of software development.
This thought hit me while reading James Governor’s riff on something Ben Griffiths wrote about our industry’s habit of confusing age with authority. Griffiths remembered sitting through a conference talk in which a speaker tried to shame a young audience for not recognizing some of the older men who had shaped computing. The irony, Ben noted, was that many of those “old men” had done their world-changing work when they were younger than the people being lectured. Bill Joy wrote vi when he was 22, John Carmack created Doom at 23, Linus Torvalds launched Linux at 22, etc. Many of our industry’s titans made their biggest contributions before they had decades of experience.
The point isn’t that young people are smarter. They’re not. The point isn’t that the key to AI success is to ignore more experienced developers. That’s dumb. Rather, it’s a suggestion that Griffiths’ larger point is right: At the beginning of big shifts, experience can be a mixed blessing. It can help you see risk, but it can also make you overconfident in old ways. The most successful enterprises will find ways to balance youthful innovation with experienced guardrails.
The factory doesn’t redesign itself
Zara Zhang recently pointed to Paul David’s classic 1990 paper, “The Dynamo and the Computer,” as a way to understand why so many companies have “adopted” AI without much to show for it. David’s argument, simplified, is that electricity didn’t immediately transform factories. For a long time, factories simply swapped out the central steam engine for an electric motor while keeping the same layout, the same workflows, and the same assumptions.
Electricity was new, but we largely stifled its potential by force-fitting it into old factory systems.
The big productivity gains came later, when factories stopped treating electricity as a cleaner steam engine and started redesigning work around smaller motors distributed throughout the factory. Once each machine could have its own motor, the factory no longer had to organize itself around a single driveshaft. Work could instead be reorganized around the flow of production.
That’s a decent description of where many enterprises are with AI. Enterprises today are buying copilot licenses by the thousands, wiring agents into existing applications, etc., and then wondering why the results are so uneven, as I’ve written. This is the equivalent of swapping the steam engine for an electric one and declaring that the AI modernization work is done. It’s not. Not even close.
The real payoff won’t come from asking AI to write the same tickets a bit faster. It will instead come from changing how teams define work and how (and what) developers build. The “factory” has to change.
So here’s the uncomfortable question: Who is most likely to build the new factory?
Experience cuts both ways
There’s an obvious danger in romanticizing youth. Plenty of bad software has been written by people with unlimited confidence and limited context. Enterprises need software that works, yes, but “works” also means it complies, scales, respects security boundaries, and more.
This is where experienced developers matter. A lot.
As I pointed out recently, the agent era makes engineering judgment more important than ever. After all, AI makes it easier to generate code, but easier code generation can become easier technical debt generation. Hence, the limiting factor becomes less of “Can we create something?” and more of “Can we create the right thing, in the right place, with the right constraints?” Taste is required, in other words.
Senior engineers are often better at seeing those constraints because their experience gives them “taste.” They know why the weird validation rule exists, and they remember the customer who depended on the undocumented behavior. They understand why a simple schema change can turn into a multi-week migration.
But experience also has a shadow side, because it can make the current process feel inevitable. A senior engineer may see an AI assistant as a faster autocomplete because that’s the easiest way to fit AI into their existing mental model. A junior developer, less invested in the old workflow, may ask the more interesting questions: Why are we doing this ticket at all? Why isn’t the spec executable? Why can’t the agent generate the test harness first?
It’s not that the more experienced developers don’t know these questions. Rather, they may simply not have the energy to rage against the machine, as it were.
The value of inexperience
The worst way to use junior developers in the AI era is to treat them as cheaper versions of senior developers. That was always a bad idea, but AI makes it worse. If the job is “take this ticket, generate some code, and send it to a senior person for review,” the junior developer becomes a human wrapper around a coding assistant. That helps no one. The junior doesn’t learn much, the senior gets buried in review, and the enterprise ends up with more code, which, as I’ve said, is hardly a good thing.
Instead, junior developers should be given room to explore new workflows, with just enough oversight from experienced colleagues. That might mean giving these newer developers interesting questions to answer, such as:
- How would we redesign onboarding if every internal API had an AI-readable contract and examples that actually worked?
- How would we change code review if the agent produced a change summary, test evidence, dependency risk, and rollback plan with every pull request?
- How would we build features if product requirements were written as executable acceptance tests rather than vague prose?
- How would we reduce toil if agents could safely perform routine migrations, dependency updates, or incident triage within clearly defined boundaries?
These are not toy problems. They’re not “junior work.” They’re exactly the sort of process redesign that enterprises need but generally avoid because everyone is too busy running on the existing hamster wheel.
Finding the balance
So what should engineering leaders do? First, stop treating AI adoption as an individual productivity contest. We seem to be moving quickly away from the idea that “lots of tokens” equals “great engineer,” but the fact that we even flirted with it is damning. I love how Santiago Valdarrama eviscerates this vanity metric: “Measuring AI productivity in number of lines written is a stupid mistake. One day, everyone will have always been against this.” Instead we should be asking questions like, “What part of our software delivery process no longer makes sense?” AI’s biggest gains will come when we change how we specify, test, review, and ship software.
Second, mix up your AI workflow teams. No, not committees or PowerPoint-producing centers of excellence. I’m talking about combining two or three newer developers who are already fluent in AI-native tools with two or three senior engineers who understand production, security, architecture, and organizational constraints. Then give them a real workflow to redesign, such as dependency upgrades or test creation.
Third, make the senior engineer’s job less about saying no and more about defining the guardrails within which others can say yes. I’ve argued that golden paths are key to using AI effectively. Good senior engineers should define the paved roads: approved patterns, test requirements, observability standards, etc. Then let junior developers and agents move quickly inside those boundaries.
Fourth, reward deletion. This may be the most important point. Going back to the factory electricity metaphor, we’ll fail with AI modernization if we simply add AI without removing outdated processes.
Bring everyone to the table
The future of software development won’t belong to the young. It won’t belong to the old, either. It will belong to teams that combine the talents of both.
Newer developers often bring impatience. They’re less likely to accept the existing workflow as sacred. They’re more likely to try weird tools, compose them in unexpected ways, and wonder why enterprise software development feels like a ritualized exercise in waiting for permission.
Experienced developers bring judgment. They know that software has users, auditors, attackers, budgets, latency, history, and consequences. They know that the right answer is often boring, and boring is good.
Enterprises need both. They need the developer who asks why the factory is still organized around the old drive shaft, and they need the developer who knows which machines will kill someone if moved casually. In sum, every development team needs people who know why the old system exists… as well as those who don’t.
Google unveils DiffusionGemma, an AI model that breaks free of left-to-right processing 12 Jun 2026, 4:18 pm
Extremely powerful large language models (LLMs) still operate as though they’re typing on a keyboard, processing workloads in a simple left-to-right fashion. But in locally-run, single-user scenarios, this sequential processing can leave graphics processing units (GPUs) and tensor processing units (TPUs) underutilized.
Google is betting that DiffusionGemma can get around this bottleneck. The new experimental open model generates text “exceptionally fast,” creating entire blocks of text simultaneously through diffusion techniques rather than through token-by-token processing. The company says this technique results in 4x faster inference compared to auto-regressive models that rely on sequential processing.
It can also save users money. Technology analyst Carmi Levy noted that existing pay-per-token monetization models “penalize the use of less than optimally efficient AI solutions.”
But DiffusionGemma “could herald a new generation of task-defined, efficient solutions that can enable expanded compute capacity without draining the operations budget,” he said.
A contrast to left-to-right processing
Built on Google’s Gemma 4 family and its Gemini Diffusion research, DiffusionGemma is a 26B mixture-of-experts (MoE) model designed to maximize text output generation.
It essentially shifts how models use hardware, giving processors a larger hunk of work each cycle so it can draft full 256-token paragraphs in sequence. This allows the model to generate text up to 4x faster on GPUs, Google claims. It activates only 3.8B parameters during inference, and, when quantized, can fit within 18GB VRAM on high-end consumer GPUs like Nvidia RTX 5090.
“It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously,” Google research scientists Brendan O’Donoghue and Sebastian Flennerhag wrote in a blog post.
AI image generators begin with pure, random ‘visual noise’ and iteratively refine that into a finalized picture (what’s known as ‘diffusion’); DiffusionGemma applies this same process to text. It does not generate tokens in order, but begins with a “canvas of random placeholder tokens” that it processes in multiple passes, identifying the context tokens it feels are most relevant and using those to refine the rest.
The model has the ability to self-correct, using confidence scoring to re-evaluate tokens in the next pass. “The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time,” O’Donoghue and Flennerhag explained.
DiffusionGemma also has bidirectional attention, they wrote. “Generating 256 tokens in parallel with each forward pass allows every token to attend to all others.” This can be particularly helpful in domains that are non-linear in nature, such as mathematical graphs, code infilling, and in-line editing, they said.
DiffusionGemma is optimized across Nvidia’s hardware stack, making it compatible with consumer setups as well as with high-performance enterprise systems like Hopper and Blackwell.
Because it is released under the Apache 2.0 license, developers can freely use, modify, distribute, and commercialize the software using their preferred tools. It can be run on GPUs or in the cloud through Google Cloud Model Garden or Nvidia NIM, and is available on Hugging Face, GitHub, and vLLM, with support for the open-source library llama.cpp coming soon.
Key use cases
The model is particularly useful in local workflows that are “speed critical,” such as generation of non-linear text structures, and unlocks what Google calls “new patterns of model behavior” like multimodal understanding and generating and rendering code in near real-time.
Levy explained, “DiffusionGemma is particularly well suited for interactive coding and editing where its efficiency allows rapid processing and iterations,” noting that its ability to fit within 18GB of VRAM and its deployability on commonly available local GPUs can potentially benefit customer service-related workloads that lean heavily on real-time interaction and local processing.
“DiffusionGemma also incorporates a thinking mode that is especially adept at problem solving,” he said. For instance, the model was fine-tuned to play Sudoku, a typically challenging task for autoregressive models because each token depends on future tokens. This “rather handily” illustrates the model’s capability to solve more complex problems, Levy noted.
Limitations
Google freely admits that DiffusionGemma is geared to specific workflows, and there are “key trade-offs.”
The model is engineered for small batch size inferencing and low-latency, high-speed generation low-to-medium batch sizes on a “single capable accelerator.”
In high-QPS cloud serving environments, (where infrastructure is designed to handle tens or hundreds of thousands of requests per second with ultra-low latency), DiffusionGemma’s parallel coding “offers diminishing returns,” and can even result in higher serving costs, Google conceded. In addition, its overall output quality is lower than that of standard Gemma 4, which is built for apps demanding maximum quality.
However, Levy noted that while DiffusionGemma “can be less precise than other models in certain workloads,” subsequent refinement cycles could overcome this limitation.
While Google isn’t sharing runtime costs, it’s clear that this is an efficiency play, he added. “When deployed across the kinds of workloads that would optimally benefit from its architecture, DiffusionGemma seems to have the potential to reduce processing overhead and related costs,” he said.
OpenAI buys Ona to help rein in AI agents 12 Jun 2026, 3:01 pm
CIOs and CISOs have many strategic and operational fears when it comes to unleashing fully-autonomous agents on tasks and hoping that everything works out. Will the agent start to delete critical files? Will the agent go off on a mission tangent and generate a massive token bill for the team when they return the next morning? Will it be tricked by a state actor and engage in malicious actions?
To help alleviate those concerns, OpenAI announced on Thursday that it has agreed to acquire Ona, a 79 person cloud development environment (CDE) provider formerly known as Gitpod, to accelerate its efforts to make agentic AI enterprise-friendly.
An OpenAI statement said Ona’s technology “provides secure, persistent environments where agents can access the tools, systems, and context they need to make progress over time. By bringing Ona to OpenAI, we will expand Codex beyond work tied to a single device or active session and help more organizations deploy agents securely in production.”
An Ona statement attributed to CEO Johannes Landgraf shared similar sentiments.
“Ona brings the building blocks agents need for enterprise work: trusted, customer-controlled cloud environments where work continues across devices, inside the systems where software actually lives,” Landgraf said. “OpenAI brings frontier intelligence, product polish, and a scale of research and distribution we could never reach alone.”
Landgraf’s statement did not provide any annual revenue numbers, but did hint, without naming, at some large customers. “Since the beginning of the year, weekly Ona agent sessions have grown 13x in production across some of the world’s most demanding institutions: the oldest bank in the US, one of Europe’s largest pharma companies, one of Asia’s largest sovereign wealth funds and many others,” he wrote. “The largest enterprises out there love the platform and are expanding more rapidly than ever before.”
Arnal Dayaratna, research VP for software development at IDC, said IDC’s figures for Ona put its annual revenue for 2025 at “roughly $7 million.” He speculated that Ona’s revenue for 2026 would be higher: “Let’s say it’s $15 million. I am being generous. Maybe it’s really $10 million or $12 million.”
Dayaratna said if he uses a standard acquisition price of roughly a multiple of 30 times revenue, then depending on the actual 2026 figure, “that comes to $450 million or $500 million or so.”
But IDC sees this being a potentially good move for OpenAI, regardless of the specific acquisition price, given that OpenAI had the classic “buy or build” challenge.
OpenAI has a substantial Codex effort, Dayaratna said, but what they lack is a safe area to protect enterprise autonomous agent efforts. “This is outside of what OpenAI has now. These are secure environments where agents can have memory and operate securely,” he said. “This is the kind of technology that one would expect to be needed, but I don’t know how good it is, to be honest.”
Gartner’s First Take, published today, noted that the acquisition will bring Codex “the essential scaling capability it lacked,” but also pointed out it forces some difficult decisions on enterprises: “Software engineering leaders must weigh the benefits of a vendor-specific integrated stack against the flexibility of staying vendor-agnostic.”
In addition, Gartner wrote, “This acquisition appears to be OpenAI’s response to Anthropic supporting self-hosted sandboxes in Claude Managed Agents, starting May 2026.”
Tom Findling, CEO of Conifers.ai, said he also sees OpenAI’s fear of Anthropic playing a meaningful role in this deal.
“It feels like a move to keep pressure on Anthropic, especially as Claude Code gains traction with developers and enterprise buyers,” he said. “So I’d read this less as OpenAI taking out a small competitor and more as OpenAI trying to make sure Codex is enterprise-ready before Anthropic gets too far ahead. In the enterprise market, the battle is not just who has the smartest coding model, but who can make AI agents safe and useful enough for big companies to actually deploy.”
He added, “I don’t think this means OpenAI suddenly needs help making Codex better at writing code. The bigger issue is making Codex work inside real enterprise environments, where security, access controls, persistent cloud workspaces, audit trails, and integration with existing developer workflows matter just as much as the model itself. Ona gives OpenAI some of that missing plumbing.”
Jason Andersen, principal analyst for Moor Insights & Strategy, echoed the concerns about Anthropic.
“To be honest, I think it reinforces what I think, which is that OpenAI and Codex have given a lot of ground to Anthropic and Claude Code, who are winning right now,” he said. “But again, this is not about the market today, I think it’s about how OpenAI will need to position itself as more than just a model as we see the incumbent players, particularly Microsoft, bolster their enterprise coding infrastructure story.”
Andersen said that Moor doesn’t have any strong basis for a guess on the financials, but added, “I am going to assume it was a fairly high multiple, but on a small base. I would not speculate on an amount, but given the enterprise customers that Ona did have, it may be more than we think.”
He also reinforced the idea that OpenAI is going to need help to achieve its own objectives.
“We continue to see that AI adoption is strongest in coding, and other use cases are not as far along,” he said. “So, if you are a general-purpose AI company like OpenAI, you need to double down on development use cases. The meaningful investment and spending on development is happening at the enterprise level, and those customers have more demands for governance, security, etc. than Codex or Claude code can handle.”
That said, he noted, “what you’re seeing is traditional software and cloud plays building out the coding and ops infrastructure around the popular models. That increased competition, while good for selling tokens, is still keeping OpenAI and Anthropic on the outside looking in. So, OpenAI and Anthropic need a stronger enterprise dev story, or they are just another model that could be easily replaced.”
Jeremy Roberts, senior director at Info-Tech Research Group, said that he also sees this as likely a good move for OpenAI.
“OpenAI is growing up a little bit,” and they may be falling behind Anthropic, Roberts said. “I see Ona as a boring company, but not in a bad way. They are not flashy, but absolutely necessary.”
Ona is delivering a workspace for Codex that an enterprise can run in its own virtual private cloud, with governance and persistence and an environment where the company can apply their own controls including log management, credential management and resource access, he said. “It is a bucket for the agents to operate in” where IT can “make sure that access is properly credentialed and is controlled effectively to prevent the model doing what it shouldn’t be doing,” which includes managing read/write protections.
Software engineer reportedly wins religious exemption from AI use 12 Jun 2026, 11:19 am
When Pope Leo XIV wrote about the effect that AI is having on our world in his encyclical, Magnifica Humanitas, he may not have imagined the document being referenced in an HR environment.
But, according to a report by Business Insider, Erin Maus, a software developer in North Carolina, used the Pope’s message about the need for vigilance in how AI would be deployed to gain a special exemption from her employer about using the technology for coding.
Maus is not even a Catholic but a Unitarian Universalist, according to the report. However, it said, she maintained that the use of AI didn’t align with her religious beliefs.
Business Insider said that to make her case, she consulted an employment lawyer — a move to be expected — and her local chapter’s minister — which probably wasn’t. Her wishes were reportedly granted last month. “I’m writing my code and reviewing my code by hand, which seems crazy to say,” she told the publication.
She’s certainly not alone in wondering whether AI is always the way forward for techies: a journalist at PC World has also been rethinking its use after reading the encyclical.
It remains to be seen whether this will be the spur for a torrent of claims from Catholic workers, asking to be freed from the demands of using AI or whether Business Insider’s report is an outlier.
The causes of cloud outages are changing 12 Jun 2026, 4:00 am
For years, the cloud market has made a simple promise: Move workloads to large-scale platforms, gain better resilience, and worry less about downtime. That promise was never entirely wrong, but it is becoming less complete. The latest findings from Uptime Institute’s seventh Annual Outage Analysis suggest that the outage landscape is changing in ways that should concern both cloud providers and cloud customers. The biggest risks are no longer limited to broken physical infrastructure. They are increasingly tied to the complexity of the systems used to run, coordinate, update, and recover that infrastructure.
The most alarming number in the report is that IT and networking issues accounted for 23% of impactful outages in 2024. Uptime Institute links these increases to growing IT and network complexity; the long-term shift toward colocation, cloud, and third-party digital services; and the resulting increase in change-management failures and misconfigurations. That number is more than a statistical footnote. It points to a structural change in how outages happen and why cloud outages are becoming such a stubborn problem.
Hardware redundancy can protect against component failures, but it doesn’t help much when the outage stems from a bad configuration, an automation error, a faulty network change, or an underappreciated control-plane dependency. In those cases, the infrastructure itself may remain intact while the system that governs it breaks down. The industry is learning that resiliency is less about duplicating equipment and more about managing complexity. Today’s increasingly distributed and software-defined environments cannot operate safely at scale.
Failures at the operational level
Uptime’s findings show that power remains the leading cause of major outages, underscoring that traditional infrastructure engineering still matters a great deal. But even as providers continue to improve physical resilience, outages can still arise from the digital and procedural layers above it. Cloud platforms are now dense stacks of services, APIs, orchestration systems, software-defined networks, identity controls, failover logic, and third-party dependencies. That complexity creates more possible points of interaction and more opportunities for an error in one layer to cascade into several others.
This helps explain why outages can feel more surprising today than they did a decade ago. In older data center models, an outage often had a more apparent root cause, such as a power event, a cooling failure, or a hardware fault. In cloud environments, the trigger may be a small configuration change that propagates across regions, a policy update that unintentionally blocks service communication, or a network control failure that affects seemingly unrelated services. These are not failures of raw infrastructure capacity. They are failures of complexity management.
The report’s language around change management and misconfiguration is especially important because it challenges one of the most common assumptions in the cloud market: that scale automatically produces better operational outcomes. The reality? Scale can magnify both strengths and weaknesses. Large cloud providers have more engineering talent, more sophisticated tools, and more redundancy than almost any enterprise customer. But they also run far more interconnected systems at far greater speeds with far more automation. A single process failure can have a wider blast radius.
Another important lesson from the Uptime analysis is that automation has not removed the human factor. If anything, it has changed its form. Even in highly automated environments, human error remains central to the problem. The report notes that in 2025, the share of outages caused by human failure to follow procedures rose by 10 percentage points compared with 2024. A related industry summary of the report notes that 58% of human error-related outages were caused by staff failing to follow established procedures.
That matters because cloud providers often position automation as the answer to reliability. Automation is essential, but it only works as well as the operational model that surrounds it. If teams deploy changes too quickly, rollback paths are weak, approval chains are bypassed, or procedures are incomplete, automation can accelerate failure rather than prevent it. In a modern cloud environment, a human mistake is rarely just a single keystroke. It is more often a design weakness in process, governance, testing, or accountability.
This is also why customers should resist the comforting notion that outages are somebody else’s problem once workloads move to the cloud. Provider-side mistakes remain real, but customer architectures are increasingly entangled with provider networking, identity, observability, and platform services. When an outage occurs, the customer may not have caused it, but they still bear the business impact. The shared responsibility model does not end with security. It extends to resilience planning as well.
Better change management
The Uptime data points to a clear conclusion: Cloud providers need to treat operational discipline as a first-class design requirement. That starts with better change management. High-risk changes should be tested more aggressively, staged more gradually, and accompanied by stronger rollback mechanisms. Providers also need better dependency mapping to understand how a change in one control layer can affect services far beyond its immediate scope. If the system is too complex to clearly explain, it is too complex to operate.
Providers also need to improve procedural quality. The rise in outages caused by failing to follow procedures suggests that procedures are being ignored under operational pressure or that they are too cumbersome, outdated, or unclear for real production conditions. Neither explanation is comforting. Stronger runbooks, better training, more realistic failure drills, and tighter operational guardrails are not glamorous investments; they are increasingly central to resilience.
Another pressure point is visibility. Uptime notes that software-based and distributed resiliency tools can improve availability, but they also introduce new risks and complicate root-cause analysis. Cloud providers need more transparent and faster incident diagnosis, not just more layers of abstraction. Customers cannot build trust in resilience if every major incident becomes a long exercise in reconstructing opaque service dependencies after the fact.
Design with outages in mind
What’s the financial impact of more frequent problems? Uptime’s 2024 analysis found that 54% of respondents reported that their most recent significant outage cost more than $100,000, and 20% said it cost more than $1 million. These are not edge-case losses. They show that outages remain costly even if they are less frequent than in earlier years.
Customers need to stop evaluating cloud resilience through uptime promises and start evaluating it through failure behavior. How does a provider isolate faults? How transparent is incident communication? How portable are workloads if a major service degrades? How dependent is the architecture on a single region, network path, identity service, or control plane? These are not just technical questions; they are now critical business questions.
The core lesson from Uptime’s data is simple. Outages are becoming a bigger problem for cloud providers and customers because the cloud’s biggest vulnerabilities are increasingly tied to complexity, process failures, and control-plane mistakes, not just broken infrastructure. In addition to adding redundancy, the next phase of cloud improvement will focus on building systems that are easier to understand, safer to change, and more disciplined to operate.
Microsoft open sources AI evaluation framework for enterprise agents 11 Jun 2026, 7:36 am
Microsoft has open-sourced an AI evaluation framework that converts natural-language requirements into executable tests, expanding its push into enterprise AI governance as organizations struggle to validate agent behavior before production deployments systematically.
The framework, called ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), generates evaluation scenarios, datasets, metrics, and scorecards from written specifications, product requirements, and governance documents, Microsoft said in a blog post announcing the release.
“Agents fail in ways that are hard to see,” Microsoft wrote in the blog post. “They drift from policy, produce unsafe outputs in edge cases, and behave differently in production than they did in testing. Generic benchmarks do not catch these failures because they are not built around your policies, your agent, or your use case.”
Rather than requiring developers to manually create evaluation suites, ASSERT translates written intent into reusable tests that can be integrated into AI development pipelines, the company said in the blog post.
With ASSERT, Microsoft is entering an increasingly competitive AI evaluation market that already includes platforms such as LangChain’s LangSmith, Braintrust, Patronus AI, Galileo, Arize AI’s Phoenix, and Promptfoo, which help enterprises benchmark, monitor, and validate large language model applications.
Behavioral testing remains immature
The release comes as enterprises rapidly expand AI agent deployments while formal evaluation practices remain the exception rather than the rule.
“Most organizations, in fact, 99% of them, do not evaluate any AI agents pre-production,” said Anushree Verma, senior director analyst at Gartner.
According to Verma, the industry’s next competitive advantage will depend less on advances in reasoning models than on how effectively organizations simulate and stress-test AI agents before deployment.
“The next competitive moat in agentic AI is not about the sophistication of reasoning models or the underlying architecture,” she said. “It will be about the depth and realism of the training environment through agentic simulation, particularly for mission-critical deployments.”
Gartner estimates that by 2029, more than 75% of domain-specific agents designed without agentic simulation in regulated industries will fail to deliver value.
Forrester sees enterprises moving toward behavioral evaluation but says most organizations have yet to make it a formal production requirement.
“Most enterprises are still in an intermediate stage where behavioral evaluation is inconsistently applied rather than treated as a formal production gate,” said Biswajeet Mahapatra, principal analyst at Forrester.
According to Forrester data, more than 45% of organizations are already using AI agents, and another 25% are piloting them, yet many continue to struggle with scaling because of immature governance and limited operational rigor.
“The net is that behavioral evaluation is becoming important, but for most organizations it is still ad hoc or tool-driven rather than a standardized release gate enforced across the lifecycle,” Mahapatra said.
AI judges still need human oversight
Microsoft said ASSERT uses large language models as judges, with model-generated evaluations agreeing with human reviewers 80% to 90% of the time in the company’s internal validation.
That level of agreement can help automate large portions of AI testing, but should not be treated as a standalone governance mechanism, Mahapatra said.
“An 80% to 90% agreement rate with human reviewers indicates strong alignment but is not sufficient as a standalone control for governance or compliance,” he said.
Instead, enterprises should adopt layered oversight where AI evaluates AI at scale while humans retain supervisory accountability for high-risk, regulated, or ambiguous scenarios. Buyers should also watch for bias, consistency issues, and overreliance on a single model acting as both generator and evaluator, he added.
Open source reduces lock-in, not governance risk
Microsoft released ASSERT under the MIT open-source license, allowing organizations to inspect, modify, and integrate the framework into existing AI development workflows.
But open sourcing a framework does not eliminate questions around evaluation neutrality, Mahapatra said.
“Open sourcing under an MIT license reduces lock-in concerns and enables broad interoperability across model ecosystems,” he said. “However, it does not fully eliminate trust or conflict-of-interest questions because the originating vendor still influences how evaluation criteria, scoring logic, and definitions of acceptable behaviour are encoded.”
Instead of relying on a single evaluation framework, enterprises should validate AI systems against multiple evaluation approaches and retain ownership of internal evaluation policies, he said.
Databricks’ OpenSharing targets the ‘integration tax’ of enterprise AI 11 Jun 2026, 7:04 am
Databricks on Wednesday unveiled OpenSharing, a new open protocol designed to let enterprises share AI models, agent skills, dashboards, and unstructured data across platforms without having to copy or move those assets.
That sharing is made possible by OpenSharing’s zero-copy credential vending model that allows recipients to securely access shared assets directly from a provider’s cloud storage using temporary, scoped credentials rather than requiring the assets themselves to be copied, moved, or replicated, the company wrote on its GitHub page.
Reducing the integration tax of enterprise AI
The ability to share AI assets without creating duplicate copies could help reduce integration complexity, improve governance, and limit the operational overhead associated with operationalizing AI systems across environments for CIOs, said Ashish Chaturvedi, leader of executive research at HFS Research.
“Every organization building AI, such as multi-agentic systems, is hitting the same wall, i.e., the model, the skill, and the consumer reside on three different platforms. The integration tax is enormous, and it grows exponentially with every new partner, customer, or internal team,” Chaturvedi said.
Echoing Chaturvedi, The Futurum Group’s lead of the CIO practice, Dion Hinchcliffe, pointed out that the reduction in operational overhead could help CIOs cut down on the hidden costs of integration around AI deployments: “Today, hidden costs include more than just model development. It is the endless packaging, translation, sync, and governance effort required to operationalize AI assets across organizational boundaries.”
From data sharing to AI asset sharing
That cost reduction is becoming even more important because enterprises are beginning to treat AI assets as business assets that need to be shared, said Stephanie Walter, practice lead of the AI stack at HyperFRAME Research.
“Enterprises are quickly realizing that the value is no longer just in the dataset. It is in the governed context, logic, and intelligence built around the dataset. Existing approaches can share datasets well, but they often do not address the broader AI package,” Walter said.
“OpenSharing is directionally aligned with that shift because it extends the sharing model beyond tables and files toward the artifacts that power AI workflows,” Walter added.
For Hinchcliffe, that alignment should work in CIOs’ favor, trying to operationalize AI across their systems: “CIOs increasingly want AI supply chains, not isolated data lakes like before.”
Additionally, Chaturvedi pointed out that the new protocol can help CIOs accelerate the monetization of AI investments.
“For CIOs, the speed at which you can share AI assets across partners, subsidiaries, and customers determines the speed at which you can monetize your AI investments. If sharing an agent skill takes six weeks of integration work, you’ve lost the window. If it takes a protocol call, you’ve turned AI into a distribution business,” he said.
How OpenSharing could simplify AI development
Achieving those benefits, however, will require developers to move AI assets across disparate platforms more efficiently, and analysts pointed out that OpenSharing’s ability to reduce integration complexity could significantly improve productivity.
“Developer productivity depends on reducing platform translation work. Developers do not want to rebuild the same asset for every consuming environment, and enterprises do not want every partner or customer interaction to become a platform migration conversation,” Walter said.
In fact, Chaturvedi sees the new protocol as unique in the industry, in the specific sense that “no other open protocol covers agent skills and AI models as shareable, governed objects”.
Walter, in contrast, sees the openness of the protocol as novel: “What is more interesting is the combination: an open protocol, cross-platform interoperability, Linux Foundation governance, and a broader asset model that extends beyond datasets into AI models, agent skills, dashboards, applications, and unstructured data.”
“The novelty is not that Databricks invented sharing, zero-copy access, or marketplace-style distribution. Those capabilities already exist in various forms across the market,” Walter said, pointing towards Snowflake’s offerings, such as the Zero-Copy integrations.
The difference, though, the analyst noted, is that Snowflake allows data to be copied only if both provider and receiver are on Snowflake.
With Databricks’ OpenSharing, data can be copied across platforms, Walter added.
OpenSharing, which is an evolution of Databricks’ existing Delta Sharing protocol, is currently a sandbox project under the Linux Foundation AI & Data Foundation and is available via GitHub.
Its current list of generally available connectors includes Python, Apache Spark, Tableau, PowerBI, Snowflake, DuckDB, Clojure, Node.js, Java, Arcurate, Rust, Go, C++, and R.
Other connectors that are expected to be made generally available soon include Google Spreadsheet, Excel, Airflow, and Lakehouse Sharing.
It’s crunch time for Java modernization 11 Jun 2026, 4:00 am
Between 2029 and 2032, every currently supported long-term support (LTS) version of Java will reach end-of-support within a single three-year window: Java 17 in 2029, Java 8 in 2030, Java 21 in 2031, and Java 11 in 2032.
On paper, this looks like a manageable upgrade cycle. In practice, it creates a collision of timelines that most enterprises have failed to forecast. Organizations attempting to modernize incrementally—moving application by application, version by version—are operating on a model that the calendar has already rendered obsolete.
The primary danger here is the illusion of time. Traditional modernization plans rely on sequential upgrades and controlled pacing. However, when every major Java version expires in the same compressed window, sequential planning collapses. By the time this becomes obvious, organizations will be forced into reactive mode, making rushed decisions under extreme pressure.
The modernization illusion
For organizations planning traditional stepwise upgrades—Java 8 to Java 11 to Java 17 to Java 21—this convergence elevates a routine maintenance task into a structural crisis. Enterprises with large Java estates will be forced to upgrade multiple applications across multiple versions simultaneously to maintain security compliance and business continuity. Waiting until the late 2020s to act guarantees a modernization process under emergency conditions.
While modern Java versions maintain strong backward compatibility, they cannot offset the drag of what enterprises are carrying forward: decades of accumulated technical debt.
In large Java environments, technical debt is pervasive. It exists as unused libraries, obsolete logic, forgotten dependencies, and dormant features—quietly inflating the size, risk, and complexity of every modernization effort. In many organizations, a significant portion of the codebase no longer executes in production, yet it still consumes developer attention, security oversight, and planning effort.
As codebases grow older and larger, this drag compounds. What looks like a simple version upgrade on a roadmap becomes a massive operational burden in practice.
Why incremental planning fails
Most modernization strategies assume that upgrades can be sequenced and absorbed gradually. That assumption is now dangerous. When multiple Java versions reach end-of-support in the same narrow window, enterprises don’t face a single modernization project—they face parallel modernization across their entire estate.
This shifts the challenge from engineering complexity to organizational capacity.
Consider a typical enterprise with 100 developers. If even a fraction of their time is spent maintaining, investigating, or working around unused and obsolete code, the organization burns meaningful engineering capacity on work that delivers no business value. Multiply that across dozens or hundreds of applications, and the bottleneck becomes clear: modernization is limited by people, not frameworks.
Parallel modernization requires parallel capacity—something most organizations haven’t budgeted for.
This explains why traditional approaches struggle to scale. Tools that analyze code in isolation cannot distinguish what actually matters in production. Without clear visibility into what code is relevant, organizations default to caution, effectively converting their timelines into risk.
The real bottleneck: developer capacity
The Java modernization crunch is a crisis of resource allocation, not a technology problem.
Every hour developers spend maintaining obsolete code or investigating unused dependencies is an hour lost to modernization. When organizations face simultaneous upgrades across multiple applications, human capacity becomes the limiting factor. Sequential planning and parallel modernization require the time and capacity most enterprises no longer have.
Organizations that delay action are consuming their flexibility rather than preserving it. Each year of inaction increases the volume of code that must be moved, reviewed, secured, and modernized within the same fixed window. By the time deadlines become unavoidable, the only remaining options are compression, shortcuts, and uncomfortable trade-offs.
A different way to think about readiness
The organizations that navigate this transition successfully will prioritize clarity over immediate upgrades.
Modernization at scale requires an accurate understanding of what actually matters in production before attempting to move it forward. Without that visibility, every upgrade effort inherits unnecessary complexity, consumes excess capacity, and introduces avoidable risk.
The goal is not simply adopting better tools, but reducing the structural load enterprises carry into modernization. Leaner systems modernize faster. Simpler estates scale better. Complexity compounds under time pressure.
The timeline is already set
The Java modernization crunch is a timing problem that is already locked in.
Enterprises that treat the next few years as business-as-usual will discover that sequential plans cannot survive compressed timelines. Those that confront technical debt now—before the pressure hits—will find the coming transition difficult but manageable. Those that don’t will face rushed decisions and permanent trade-offs.
By the time 2029 arrives, the window for gradual modernization will have closed. The calendar won’t wait for us to be ready.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
Build an agent? Sell an agent 11 Jun 2026, 4:00 am
Modern AI systems have evolved beyond the simple chatbots that quickly became popular. Now they use semantic tools to manage workflows and link machines to machines, providing a flexible and effective framework for the next generation of business automation. What you used to build in Microsoft’s Power Platform or construct inside Biztalk is now an agent, built around large language models (LLMs) that can parse both your data and the APIs that you want to use your data with, orchestrating workflows with a level of autonomy that traditional tooling can’t match.
That shift has offered new opportunities, much like those that came with business platforms like Microsoft Dynamics and Salesforce. Here, tools built to solve one set of business problems could be turned into applications that could be sold to other companies. What worked for you to solve one of your problems could now be an added revenue stream, sold through platform marketplaces that helped customers manage installations and customizations.
Agents are business applications now
Modern agents are much like those business applications. Often developed to solve a specific need, but quicky adopted by organizations and refactored to apply enterprise standards (using tools like the Agent Governance Toolkit and frameworks like Microsoft’s Agent Framework), they’re rapidly maturing and are ready to be shared more widely. The process of sharing needs to be curated and controlled, and, if possible, tied to a revenue stream.
There’s certainly some urgency here. Until recently, subsidized tokens have kept costs artificially low. Now companies like GitHub and Anthropic are moving to a more sustainable (for them) pricing model, increasing the cost of inferencing and squeezing companies’ AI budgets. As a result, switching AI projects away from a cost to a revenue source is high on CIOs’ agendas. If those tuned and trained agents can be sold on a marketplace, then that token budget can be justified.
Microsoft has always been a company built on partner relationships, starting with individual developers and working all the way up to the largest software companies and consultancies. That reach is key to helping partners extract as much value as possible from their agents, as it allows Microsoft to integrate its partner sales tools into its own products and services, as well as into other platforms.
Extending the Microsoft Marketplace for AI developers
We’re already familiar with many of Microsoft’s marketplaces, built into individual tools like Teams, into platforms like Microsoft 365 or Visual Studio, or into the Windows Store. Now the company is doing the same for AI developers, extending Microsoft Marketplace to software agents. Announced at Build 2026, the updated Microsoft Marketplace provides ways to publish code—apps and agents—developed across all of Microsoft’s development platforms, including Copilot Studio, opening the marketplace up to traditional and non-traditional developers alike.
Perhaps the most important aspect of this new Marketplace is its own intelligence, using context to expose your code to the right audience. If you’ve developed an agent for use with Microsoft 365, it will be exposed inside the Microsoft 365 Copilot Agent Store; for Visual Studio, in the Visual Studio Marketplace; or for Teams, in the Microsoft Marketplace. All of these are different views on the same back end, using AI to ensure that the relevant agents are displayed.
Replacing search with Intelligent Discovery
This is extended by another new service, Intelligent Discovery, which adds natural language support to search, using AI to infer user intent and highlight the most relevant tools. Building on the familiar metaphor of the search bar, the initial smart search model offers a freeform way to explore the Marketplace. While there are suggested prompts, they’re not necessary. The search tooling allows you to generate comparisons between tools using your own criteria, with the Marketplace AI generating views based on your requirements.
Microsoft’s aim here is to shift discovery from keywords to use cases, so that buyers can quickly get the tools they need without having to evaluate different solutions, before completing a purchase. By handing that aspect of the buying process over to Marketplace’s AIs, customers can go straight to trials or even to buying agents and applications.
For a tool like this to be successful it needs to be trustworthy. By building it on top of the same development frameworks as your agents, Microsoft can take advantage of the AI guardrails built into Microsoft Foundry as well as low-level tooling like the Agent Governance Framework. Restricting the intelligent search to the Marketplace catalogue reduces the risk of hallucination, as output is grounded in Marketplace data and metadata.
A developer-friendly marketplace
Microsoft is providing tooling to help developers get their listings right. According to Cyril Belikoff, Microsoft’s vice president of Commercial Cloud and AI, “We actually have a separate AI tool that we give to software companies to optimize their listings, called a listing optimizer, funny enough, and that listing optimizer reviews their listing and then provides them with particular guidance on how to best improve it, so that it can be best discoverable in today’s search world.”
You can expect the listing optimizer to be tuned to work with the new Marketplace tooling, but for now it still focuses on traditional search. As Marketplace is a B2B platform, there’s a lower risk of spam applications, but even so, Microsoft remains aware of the possibility of a new system being gamed, and will be rolling out Intelligent Discovery carefully, monitoring its performance as more customers get access over time.
Having a new discovery method is one thing; getting quality AI applications in the Marketplace is another. Microsoft is validating all code submitted, though the criteria will differ between target platforms. An agent built for Teams will be treated differently than one built on Microsoft 365’s WorkIQ. It’s an approach that allows Microsoft to support new standards as they become available.
Alongside its agent development tooling, Microsoft is rolling out a new set of guidelines and processes to help developers get ready to sell their agents. Hosted on GitHub, these offer code templates as well as a link to the App Advisor guidance tools.
Still gaps to fill
This first release of Intelligent Discovery is promising, but some key features are missing. With agent token costs an increasing problem for businesses, it would be nice to see tools that help predict costs, integrating with finops tooling. We’re living in an age of shadow AI, and putting the AI we used to buy with credit cards in Microsoft Marketplace is one way to shine a light on those shadows — bringing the necessary control and governance to AI purchases, and maybe even providing support for site licensing.
Microsoft Marketplace is becoming a useful resource for AI application developers. It encompasses the entire development life cycle: offering tools that can help you build agents, the models that you need to power your agents, and finally a way to monetize your work. There’s a longer-term opportunity here, for both the Marketplace and Intelligent Discovery to offer Model Context Protocol (MCP) interfaces, ensuring that tooling and tool discovery become part of the developer workflow, and making developers aware of new tools that might help solve a problem or simplify a task.
GitHub finally pulls the plug on automatic install script execution for npm 10 Jun 2026, 7:42 pm
The ability for attackers to leverage automatic install script execution in npm will finally come to an end when expected changes arrive from GitHub in July. Coders will still be able to enable the function, but the default setting will block it.
In V12, default settings are changing, GitHub said in its changelog, noting, “it turns an npm install behavior that runs automatically today into one you explicitly opt into.”
Specifically, the post said, “allowScripts defaults to off: npm install will no longer execute preinstall, install or postinstall scripts from dependencies unless they are explicitly allowed in your project. This includes native node-gyp builds; a package with a binding.gyp and no explicit install script still gets blocked, because npm runs an implicit node-gyp rebuild for it. Prepare scripts from git, file, and link dependencies are blocked the same way.”
Analysts, consultants, and users generally applauded the change, but said that it would only narrow the exposure to supply chain attacks instead of eliminating it.
Attacks likely to move elsewhere
Sonu Kapoor, maintainer for CVE Lite CLI in the OWASP Incubator Project, said that this change is likely to force the supply chain attacks that leveraged the automatic execution to move elsewhere.
“This does not eliminate npm supply chain risk, it removes a major automatic execution path,” Kapoor said. “Attackers can still move to other paths: malicious package code that runs at application runtime, compromised maintainer accounts, dependency confusion, typo-squatting, poisoned GitHub Actions workflows, malicious transitive dependencies, or stolen publishing tokens. This closes one very dangerous door, but it does not secure the whole house.”
Still, attacks leveraging the setting have been regularly used in supply chain attacks.
However Alan Parkinson, director of secure medical device firm Threat Detective, said more sophisticated attackers have already moved beyond this hole.
“The install script attack vector has been known for years,” Parkinson said. “Most security teams marked it as low risk and moved on to higher risk threats. What raised its profile wasn’t the technical exploitability changing, it was a run of high-profile victims and some threat actors openly chasing notoriety.”
He added, “the pre and post install scripts was never a clever attack vector to begin with. Running code from an install hook is crude and noisy, which is why it caused such visible damage. The more capable actors are already moving to other methods, so v12 mainly shuts the door on less sophisticated threat actors.”
Although GitHub declined an interview, Zach Steindler, a GitHub principal engineer, answered InfoWorld’s questions by email. He said the volume and pace of supply chain attacks forced the default settings change.
“We’ve seen attackers target these capabilities to quickly propagate attacks from one compromised package to many. Years of security and usability research have shown that it’s not enough to make secure functionality available; the secure path has to be the default in order for it to be widely adopted,” Steindler said.
He added, “we believe that these changes are a great way to provide high impact secure defaults while still providing the option for some users to fall back on functionality they might need in some circumstances.”
Change overdue
Sanchit Vir Gogia, chief analyst at Greyhound Research, said that GitHub was the last of the repositories to make the setting default change. “Rivals moved first: Yarn, pnpm and Bun all block third party install scripts by default in their own ways,” Gogia said. “Npm is not inventing a new doctrine. It is finally adopting one.”
Steindler didn’t dispute Gogia’s comment.
“It’s not easy being the stewards of the largest package repository in the world. Community consensus on what security capabilities should be standard, and when it’s okay to make breaking changes shifts over time. From our continual conversations with the community, it was clear it was time to make this change,” Steindler said.
“The recent attacks are alarming,” he noted, “but stewarding these package repositories is a multi-decade effort, not just a moment in time. As attacks evolve, so will our defensive security capabilities. We’re in this for the long haul.”
Gogia said that the change, although overdue, is a good one.
“Npm is removing one of the most comfortable hiding places for software supply chain risk: code that executes the moment a developer types install,” Gogia said. “With npm v12, execution becomes something that must be approved, recorded in the project, and committed for review. That is not a design adjustment. It is a change in control philosophy.”
Bad defaults become infrastructure
Gogia had his own take on why GitHub waited so long.
“Npm waited because its risky default acquired a constituency. As far back as 2016, npm’s own position was that the convenience of install scripts outweighed the worm risk, with an opt-out flag for the cautious. The trade-off was a documented product decision, not an oversight,” he said.
“The trouble with bad defaults is that they become infrastructure,” he added. “Native module builds, browser installers such as Playwright and Cypress, Electron download flows and Husky hooks all grew around automatic execution. Turning it off became less a technical adjustment and more a constitutional reform.”
Liability changed hands
The real pressure for the change, however, came from regulators.
“The deeper answer is that the liability changed hands. Once regulation such as the EU Cyber Resilience Act and securities disclosure rules placed supply chain failure on corporate balance sheets, a documented unsafe default became indefensible,” Gogia said.
Kapoor agreed that long-used procedures enabled this security hole to survive longer than it should have.
“The reason this likely was not done long ago is compatibility,” he said. “Install scripts are not only used by attackers. Many legitimate packages use them to compile native modules, download platform-specific binaries, generate files, or complete setup steps. Changing the default breaks assumptions that have existed in the npm ecosystem for years. That is why these security changes often arrive slowly. The safer default is obvious from a security perspective, but painful from an ecosystem compatibility perspective.”
In addition, he noted, “the bigger point is that package managers are moving from implicit trust to explicit trust. That is the right direction. Developers should have to approve which dependencies are allowed to execute code during install. But approval cannot become a blind checkbox. Teams need visibility into which package wants to run a script, whether it is direct or transitive, why it is there, and whether it belongs in the project at all.”
Kapoor added that this change matters because install-time execution often happens in privileged environments with access to tokens, secrets, internal registries, build artifacts, or deployment paths. “Even if the script does not compromise production directly, it may be able to steal enough context to support the next stage of an attack,” he said.
Value in the pain
Cybersecurity consultant Brian Levine, executive director of FormerGov, agreed that the closing of this security hole is a very good thing.
“It seems like virtually every major supply chain attack of the last decade has had the same original sin: code that ran automatically because the ecosystem let it. Npm finally closing that door by default is overdue, but it’s genuinely significant. This is the package manager for hundreds of billions of downloads a month,” Levine said.
“When npm changes its defaults, it changes the security posture of practically every enterprise dev environment on the planet. It may have been the last large code repository to still allow this kind of automated execution.”
Levine added that this change might not merely stop a security hole, but the new process may meaningfully improve security.
“There’s actually something valuable buried in this migration pain. Having developers explicitly approve which packages can run code and commit that list to source control is a form of software supply chain governance that many organizations never had,” Levine said. “It creates an auditable record which is meaningful, especially for regulated industries.”
EU rules on securing IT products could affect open source software users beginning this week 10 Jun 2026, 5:09 am
Too many enterprises remain ignorant of the European Union’s 2024 Cyber Resilience Act, the first elements of which enter force on June 11, according to a new survey.
Two-thirds of respondents to the survey by Open Source Security Foundation said they were unfamiliar with the CRA, which aims to make hardware and software sold in the EU more secure.
As well as the CRA’s demands on vendors, it also has implications for users of open-source software, hence the Foundation’s interest in the topic. Among other measures, the CRA creates the role of open-source steward within the enterprise, with responsibility for ensuring that a security policy is in place for any software being used within the organization.
The first part of the CRA to enter force, on June 11, concerns the designation of conformity assessment bodies by member states. Then, from September 11, manufacturers will be required to begin reporting vulnerabilities in their products to the relevant authorities. The remaining obligations under the Act, which include substantial financial penalties, will apply from December 11, 2027.
The impending sanctions seem not to have concerned businesses: 56 percent of respondents to the OpenSSF survey were unaware that non-compliance fines could reach €15 million or 2.5 percent of global annual turnover.
The lack of knowledge about the implications of the Act surprised OpenSSF CTO Christopher Robinson. “We’ve been speaking on this topic for some time and we’re scratching our heads on why more companies are not aware of the implications of the Act,” he said.
Global concern
He surmised that some companies don’t think EU regulations on hardware and software security apply to them — but such concerns will soon be a global matter. “Other countries, like Japan, are considering similar laws,” he said.
One area of misunderstanding could be that the CRA applies to vendors, and their customers may think that the requirements under the Act didn’t apply to them. He said that this was a misguided approach, particularly when the CRA’s application to open-source software is taken into account.
“There are about 700 million projects in Git Hub. If you work for an organization like a bank, you have little idea which of those projects are being used,” he said.
Under the Act, software companies will have to supply a software bill of materials (SBOM) that has been passed as secure, he said.
Companies that supply US federal government organizations already face this requirement, he said: “If you’re selling to the US government — which is the largest customer on the planet – you should be providing an SBOM.”
Cybersecurity consultant Hans Study said that by addressing the supply chain issue, the CRA is a step in the right direction. “Almost every application has dependencies, whether that is free and open-source software, commercial packages, or some mix of both. The problem has always been responsibility, and the blame game that comes with it. What the CRA does is make it harder for companies to dodge that responsibility when they are building, selling, or placing products with digital elements on the market,” he said.
AI ignorance
According to Michael Callahan, VP of Cyber Strategy at Salt Security, one of the issues that could cause problems in the future is the growing use of AI in the software development process. “The Cyber Resilience Act assumes enterprises know what is in their software. That assumption breaks down when AI coding assistants are generating a significant share of code. An AI assistant has never read your organization’s security policies, your licensing obligations, or your open-source governance standards. The code it produces may contain dependencies, patterns, or vulnerabilities that your security team cannot easily trace back to a specific decision or a specific developer.”
Enterprises are quickly running out time to fix issues and many are pessimistic about their chances. According to the OpenSSF survey, only 41percent of manufacturers expect to be fully compliant by December 2027, while 39 percent do not know when they will be.
It may be that the proposed fines could concentrate minds. Robinson said that it could be like GDPR where a few heavy fines drew companies’ attention to the regulation. The upper limit on fines is per infraction, not per company, he said: “Something like that could wipe out an SME and seriously hit large corporations.” The legislation should be something that all businesses need to be aware of, but there is still a long way to go.
This article first appeared on CIO.
How to use virtual environments in Python 10 Jun 2026, 4:00 am
Of all the reasons Python is a hit with developers, one of the biggest is its broad and ever-expanding selection of third-party packages. Convenient toolkits for everything from ingesting and formatting data to high-speed math and machine learning are just an import or pip install away.
But what happens when those packages don’t play nice with each other? What do you do when different Python projects need competing or incompatible versions of the same add-ons? That’s where Python virtual environments come into play.
What are Python virtual environments?
A virtual environment is a way to have multiple, parallel instances of the Python interpreter, each with different sets of packages and different configurations. Each virtual environment contains a discrete copy of the Python interpreter, including copies of its support utilities (such as the package manager pip).
The packages installed in each virtual environment are seen only in that virtual environment and no other. Even large, complex packages with platform-dependent binaries can be corralled off from each other in virtual environments.
Why use Python virtual environments?
There are a few common use cases for a virtual environment:
- You’re developing multiple projects that depend on different versions of the same packages, or you have a project that must be isolated from certain packages because of a namespace collision. This is the most standard use case.
- You’re working in a Python environment where you can’t modify the site-packages directory. This may be because you’re working in a highly controlled environment, such as managed hosting, or on a server where the choice of interpreter (or packages used in it) can’t be changed because of production requirements.
- You want to experiment with a specific combination of packages under highly controlled circumstances, for instance to test cross-compatibility or backward compatibility.
- You want to run a “baseline” version of the Python interpreter on a system with no third-party packages, and only install third-party packages for each individual project as needed.
Nothing says you can’t simply unpack a Python library into a subfolder of a project and use it that way. Likewise, you could download a standalone copy of the Python interpreter, unpack it into a folder, and use it to run scripts and packages devoted to it.
But managing such cobbled-together projects soon becomes difficult. It only seems easier to do that at first. Working with packages that have binary components, or that rely on elaborate third-party dependencies, can be a nightmare. Worse, reproducing such a setup on someone else’s machine, or on a new machine you manage, is tricky.
The best long-term solution is to use Python’s native mechanisms for creating, reproducing, and working with virtual environments.
How to use virtual environments in Python 3
Python has native tooling for virtual environments that makes the whole process quite simple. This wasn’t always the case, but now all supported versions of Python use the native virtual environment tool, venv.
Create the Python virtual environment
To create a virtual environment in a given directory, type:
python3 -m venv /path/to/venv
For instance, to create the virtual environment in the current directory, using the subdirectory .venv type:
python3 -m venv .venv
On Microsoft Windows, you can use py instead of python3 to reliably access an installed Python version. (See this article for more about using the py launcher in Windows.)
The exact name for the virtual environment directory is arbitrary, but it’s typically .venv.
The whole process of setting up the virtual environment may take a minute or two. When it’s finished, you should see that directory with a few subdirectories in it. The most important subdirectory is bin on Unix or Scripts on Windows, which is where you’ll find the copy of the Python interpreter for the virtual environment along with its utilities.
Note that because each virtual environment contains its own copy of the Python interpreter, it can be fairly large. A Python 3.13 virtual environment will consume anywhere from 14MB to 26MB of disk space, depending on the operating system.
Venvs and version control
If you are setting up a virtual environment in a project directory managed with some kind of version control system (e.g., Git), exclude the environment directory from tracking before you make any commits after creating the venv.
Venvs should not be tracked along with their associated code, as they are meant to be destroyed and recreated as needed. You should, however, track the requirements.txt or pyproject.toml files associated with the project, as those are used to describe what gets installed in the venv for that project.
Activate the Python virtual environment
Before you can use this virtual environment, you need to explicitly activate it. Activation makes the virtual environment the default Python interpreter for the duration of a shell session.
You’ll need to use different syntax for activating the virtual environment depending on which operating system and command shell you’re using.
- On Unix or MacOS, using the bash shell:
source /path/to/venv/bin/activate - On Unix or MacOS, using the csh shell:
source /path/to/venv/bin/activate.csh - On Unix or MacOS, using the fish shell:
source /path/to/venv/bin/activate.fish - On Windows using the Command Prompt:
path/to/venv/Scripts/Activate.bat - On Windows using PowerShell:
path/to/venv/Scripts/Activate.ps1
Note that the activated environment only works for the context it was activated in. For instance, if you launch two instances of PowerShell, A and B, and you activate the virtual environment in instance A, that environment will apply only to A, not B.
Many Python IDEs will automatically detect and activate a virtual environment if one is found in the current project directory. Visual Studio Code, for instance, can do this when the Python extension is enabled. Opening a terminal inside Visual Studio Code will automatically activate the selected virtual environment. PyCharm automatically creates a virtual environment for each new project and enables it automatically.
Configure and use the Python virtual environment
Once you’ve activated the new virtual environment, you can use the pip package manager to add and change packages for it. You’ll find pip in the Scripts subdirectory of the virtual environment on Windows, and in the bin subdirectory on Unix OSes.
If you’re already familiar with the way pip works, you’re set. It should be just the same in a virtual environment. Just make sure you’re using the instance of pip that manages packages for the virtual environment in the context where it was activated—e.g., the bash session or Windows CLI/PowerShell session. If you want to verify that you’re using the right pip and the right virtual environment, type pip -V and check that the path it displays points to a subdirectory of your virtual environment.
Note that when you want to upgrade pip in a virtual environment, it’s best to use the command python3 -m pip install -U pip. This ensures the upgrade process is run in such a way that Python doesn’t lock crucial files. The command pip install -U pip may not be able to complete the upgrade properly.
To use the virtual environment you created to run Python scripts, simply invoke Python from the command line in the context where you activated it. For instance, to run a script, just run python3 myscript.py.
With PyCharm, you can use the IDE’s own package management interface to manage the packages installed in your project.
Managing packages in Python virtual environments
When you create a new virtual environment, pip will be installed, but that’s all. You’ll need to install any other packages you want to use in the environment. For projects with complex requirements, it is customary to keep in the root of the project a requirements.txt file that lists the requirements for the project. This way, if you need to recreate the virtual environment, you can reinstall all of the needed packages with the command pip install -r requirements.txt.
More recently, a new project metadata format has emerged for Python projects, called pyproject.toml. A pyproject.toml file contains the package requirements of the project, but also a great deal of other information about it. To install those requirements, you’d run pip install . in the same directory as the pyproject.toml file.
Note that the copy of pip that lives in a virtual environment is local to that virtual environment. Each virtual environment has its own copy, which will need to be updated and maintained independently. This is why you may get warnings about pip being out of date in some virtual environments but not others; pip has to be updated in each virtual environment separately.
Deactivating the Python virtual environment
When you’re done using the virtual environment, you can just terminate the session where you were using it. If you want to continue to work in the same session but with the default Python interpreter instead, type deactivate at the prompt. Windows users on the Command Prompt need to run deactivate.bat from the Scripts subdirectory, but Unix users and Windows users running PowerShell can simply type deactivate in any directory.
Removing the Python virtual environment
Virtual environments are self-contained. When you no longer need the virtual environment, you can simply delete its directory. Just make sure you first close any running copies of Python that use the virtual environment.
Whenever you want to refresh or recreate a virtual environment, you can simply delete the current environment directory and recreate the venv as described above (see “Managing packages in Python virtual environments”). This is typically the easiest way to upgrade a project to a newer version of Python: delete or temporarily re-name the venv directory (in case the new version doesn’t work), then create a new one and install the project requirements into it.
If you have many older projects that aren’t being actively used, you can remove their environments to save space. They can be recreated easily whenever needed, although you will want to note which version of Python was used to create them along with the needed packages (typically recorded in requirements.txt or pyproject.toml).
Relocating the Python virtual environment
It’s tempting to assume a virtual environment can be copied and moved around along with its project. Don’t do this. Virtual environments are tied to the location of the Python installation on the system where they’re created. If you want to move the project to another system, leave out the venv directory, and recreate the venv on the target machine. Do copy and move the requirements.txt or pyproject.toml file with the project, because those files are needed to recreate the venv on the other system.
How to use virtual environments in Python 2
With Python 2, virtual environments aren’t a native feature of the language. Instead, you need to install third-party libraries to create and manage virtual environments.
The most popular and widely used of these projects is virtualenv, which handles creating the directory structure and copying the needed files into a virtual environment. To install virtualenv, just use pip install virtualenv. To create a virtual environment directory with it, type virtualenv /path/to/directory. Activating and deactivating the virtual environment works the same way as it does for virtual environments in Python 3 (see above).
Note that Python 2 should not be used for any new development. Virtual environments in Python 2, like Python 2 itself, should be used only for the maintenance of legacy projects that should eventually be migrated to Python 3.
Virtual environments and the main Python package directory
Normally, when you create a venv, it cannot work with the packages already present in the Python installation that created it. This is by design, as it ensures that packages installed globally don’t interfere with local ones.
You can override this behavior, but only when you first create a virtual environment. If you pass the flag --system-site-packages to venv when you run it, the created venv will have access to the parent Python’s package directory.
Using Python virtual environments with Jupyter notebooks
If you’re using Jupyter notebooks (aka IPython notebooks), and you already have Jupyter installed systemwide, create your virtual environment and activate it. Then, from your virtual environment directory, run pip install ipykernel to add the needed components for IPython. Finally, run ipython kernel install —user —name=, where project_name is a name you want to associate with that particular project. From there you should be able to launch Jupyter and switch to the IPython kernel you installed inside the virtual environment.
Upgrading Python virtual environments
When you upgrade a Python runtime on your system, virtual environments that use that version of Python aren’t automatically upgraded. That’s your responsibility. And that’s by design, because unwitting upgrades to Python versions can break their attendant packages.
If you’ve upgraded an existing Python interpreter with a minor point upgrade—e.g., from Python 3.13.1 to Python 3.13.3—you can upgrade any corresponding virtual environments easily enough. From a command prompt in the project directory, enter:
python -m venv /path/to/venv --upgrade
Don’t activate the virtual environment beforehand, or the upgrade may not work.
Alternatively, as noted above (see “Removing the Python virtual environment”), you could elect to remove the venv completely and recreate it using your requirements.txt or pyproject.toml file.
If you’ve installed a major new version of Python—e.g., you already have Python 3.10 and you now install Python 3.11 alongside it—you’ll need to create a new virtual environment that specifically uses the new major point version. Do not attempt to upgrade an existing virtual environment to a higher major point version of Python.
The tokenmaxxing backlash is coming 10 Jun 2026, 4:00 am
I’ve been around long enough to remember when deploying an application meant copying a *.exe file from the developer’s machine right into production. I am not making this up. It was that simple, and that fraught with peril. Applications weren’t complex — they were often not anything more than that simple *.exe file — and the process around deployment didn’t need to be anything complex, but it probably should have been.
Proper deployment of an application is something we’ve learned to do over the years. The process of properly building, testing, and deploying an application has grown more complex for two reasons. First, the process must ensure that every deployment succeeds. Deploying complex applications can be convoluted and challenging, and a strict deployment process ensures everything happens properly and runs correctly. Second, the process must thoroughly test the application to make sure that all the moving parts work together to create a properly functioning application.
Today’s continuous deployment processes were hard-won from many lessons learned. Eventually, these practices became formalized, even to the point where the Sarbanes-Oxley Act was understood to require that IT departments formally document their deployment processes.
This kind of governance is what separates professional software development organizations from those that, well, don’t know what they are doing.
Agentic growing pains
Agentic development is headed in this same direction, but it’s all happening a bit more quickly. It was just a few months ago that people began to use AI to write code seriously. At first, most of us were doing it furtively, having Claude Code find and fix bugs, and then quietly checking in the solutions. Maybe we were a bit hesitant to mention that we had done this, but then we felt guilty about doing it and taking credit, and eventually we mentioned it. But it soon became apparent — like “within a week” soon — that Claude Code was up to the task, and we became pretty open about it.
Very quickly, it not only became accepted but actually encouraged, and we were off to the races. In a month, everyone was tokenmaxxing.
It almost seems a bit out of control. Sure, there is a lot of code being generated and non-trivial amounts of money being spent, but it isn’t quite clear if the results are worth the effort. I’m not at all sure if anyone can say that the money spent is returning the value needed.
At some point, as an industry, we are going to have to get control of all of this. I fear that there will be a rush to impose governance over the whole thing. This, like deploying an .exe directly, is also fraught with peril. Right now, there appears to be little control over what tools are being used where, how much is being spent for what purpose, and what that spending is actually getting us.
The governance over our deployment processes was successful and useful because it arose organically. The accepted, codified procedures arose from the lessons that practitioners learned by actually building and deploying applications. We all should work to ensure that a similar process happens with agentic coding.
Developers know best
Because agentic coding is happening so quickly and so furiously, the danger is that a governance process will be imposed over the top just as quickly and furiously. I want to encourage us all to take a deep breath, slow down a bit, and take a close look at what we are doing, and more importantly, how we are doing it.
Agentic coding will be governed in some manner. It’s critical that we practitioners take the lead in providing that governance, or we’ll have governance forced upon us. We are the ones that know what matters and how the tools should be used. We are the ones with skin in the game and the ones keeping pace with the technology.
Top-down governance of a technology moving this fast will never be able to keep up. Or as Uri Haramati, co-founder and CEO of Torii says, “The person closest to the tool usually understands why it’s being used, and governance works better when it includes those people instead of trying to control them.”
Copying that *.exe file into production is comically reckless. We don’t do that anymore because we know better. It took time, but we learned the right way to deploy software. Right now, we are in the “copying the *.exe” phase of agentic coding, and we need to figure out the right way to do it before someone comes along and does it for us.
Enterprises know AI-generated code is vulnerable; they’re shipping it anyway 9 Jun 2026, 10:01 pm
AI-generated code is riddled with security flaws, yet enterprises are shipping more of it than ever before. Why? Perhaps they’re over-confident, lack true visibility into security risks, or are simply choosing to ignore the problem and hope it goes away.
It’s a dangerous game to play at the dawn of the agentic AI era, as underscored in a new report from app security company Checkmarx.
The survey of thousands of security leaders exposes an underlying naivete about AI-built code and its vulnerabilities, even as tools like Anthropic’s Mythos are uncovering security flaws orders of magnitude faster than any human security team could ever hope to.
“Mythos-class models collapse the window between a vulnerability existing and a working exploit being available from months to minutes,” the report notes. Enterprises relying on traditional security tools and methods, it says, “cannot survive this reality.”
Security as an afterthought
Checkmarx’s survey of 2,350 CISOs, AppSec managers, and developers across 14 countries focused on how much AI-developed code enterprises are deploying, the vulnerabilities it introduces, how it impacts developer workflows, and overall sentiment about AI code and security posture.
Today, nearly half of production code is AI-generated, and the majority of enterprises also report that at least half their codebase is made up of open-source components, according to the report.
But the more AI-generated code that is pushed out, the more vulnerabilities are exposed. Enterprises who said 81% – 100% of their code is built by AI ship vulnerable code 3.4 times more often than businesses using AI more conservatively, relying on 20% or less AI code.
Additionally, 70% of developers said that AI code generation created vulnerabilities in 2025, and almost all enterprises surveyed (93%) had at least one security breach as a direct result of in-house developed apps.
Still, risk is becoming “normalized,” the report notes, with three-quarters of enterprises knowingly deploying vulnerable code as they face increased pressure for ROI. Startlingly, about 30% of respondents admitted they ship compromised code and hope the vulnerability won’t be found. Similarly, more than a third of organizations leave half of their known vulnerabilities unfixed for 90 days or more.
The report points out that the organizational bottleneck isn’t detection, “it’s the human decision to ship anyway, suppress the finding, or defer to the next sprint.”
Along with this, AppSec teams are often limited to reactive incident response as they deal with tool sprawl. And developers only continuously secure code a small percentage of the time (18%), even though nearly all are equipped with security tooling.
Ultimately, developers are “set up to fail,” the report contends. They face significant pressure to deliver, and are forced to choose quantity and speed over security. Yet, even as they face significant consequences when it comes to post-mortems, performance reviews, escalation, and blocked releases, the tools that contribute to security issues, delivering low-value findings, unclear guidance, or late feedback, continue to go unfixed.
“Developers remain accountable for outcomes, even when systems and workflows are not aligned to support them,” the report notes.
Overconfidence, outdated practices
Alarmingly, many enterprises seem to be deluded when it comes to their security posture. Of those that rate themselves as “highly mature” AI organizations, 42% often ship the most vulnerable code, and have breach rates “barely distinguishable” from other enterprises.
“Confidence isn’t protecting them,” the report notes. “It’s blinding them.”
Underscoring this, only 22% of organizations have formal AI governance, and developers still rely on manual code reviews to ensure their code meets compliance standards.
The result is a mismatch between the speed of software creation and the speed of governance, the report notes. “Compliance frameworks are evolving, but many organizations are still attempting to govern AI-scale development with processes designed for a slower era of software delivery.”
Strategic imperatives for enterprises
Enterprises do seem to have wised up (a bit) after Anthropic’s Mythos proved capable of not only discovering vulnerabilities across major operating systems and browsers, but exploiting them 100 times faster than previous Claude models. And the subsequent Project Glasswing almost immediately surfaced thousands of previously-unidentified security flaws.
Checkmarx’s survey, which, it should be noted, was conducted a month prior to Mythos’ arrival, found that enterprises are finally taking proactive measures, focusing more heavily on AI security threats overall, and investing more in DevSecOps practices, automation, and developer training.
The report emphasizes the importance of prioritizing risk over code volume; vulnerabilities should not be considered isolated incidents. Also, it’s critical to embed security into developer workflows rather than treating it as a checkpoint. Enterprises must have systems that reduce noise, provide clear guidance, and allow them to take action when an issue arises.
Security “must be integrated directly into how developers write, test, and ship code within the IDE, pipelines, and AI-assisted workflows where development now happens,” the report notes.
Similarly, enterprises would benefit by reducing fragmentation and tool sprawl and defining ownership of the AI tools. By simplifying security stacks, they can align responsibilities and ensure consistent tool use, according to the report.
Further, AI needs strong governance, and teams must move beyond outdated manual triage and “human-gated remediation.” AI can fight AI in a strong system built to prioritize, remediate, and resolve risk “without waiting for a human to approve each step,” the report notes.
Ultimately, it says: “Progress depends on embedding intelligence directly into workflows, enabling risks to be prioritized, remediated, and resolved, all within the systems that they operate in.”
This article originally appeared on CIO.com.
The GPU multitenancy mess 9 Jun 2026, 4:00 am
We’re seeing an interesting infrastructure tug of war today where GPU clouds are being pulled in two directions. For the economics of AI to work, the enterprise market needs to carve expensive hardware into smaller, shareable units and hand it to customers on demand, similar to how CPUs are doled in public cloud infrastructure. But the more the providers push GPUs to behave like elastic cloud infrastructure, the more they run into the reality that this GPU hardware was never built for safe multitenant use, fast fault recovery, or clean isolation between workloads. That tension is becoming one of the defining operational problems of the AI infrastructure market.
When a gamer launches Steam or the Epic Games Store on their laptop, they don’t have to worry about which GPU is being scheduled, how memory is going to be divided, or really any of the security boundaries or hardware assignment issues on their PC. For consumer PCs, these issues are not just hidden from view, they are irrelevant.
But for today’s IT teams managing GPU-driven AI workloads across distributed systems, those types of allocations and partitions need to be managed manually and carefully. This includes deciding which GPU to assign to each workload, how to divide memory, how to isolate tasks, and how to maximize utilization of this very expensive hardware. That’s why you heard so much about “Day 2” AI infrastructure themes around Nvidia’s GTC event this year.
The legacy hardware bottleneck
GPUs were originally developed to speed up graphic rendering and to perform local compute via shaders in service of graphical rendering. Their design assumes a trusted computing environment in which a single application controls the device. When a user runs an application on a GPU, it accelerates that particular workload.
But while GPUs are optimized for throughput, and therefore have thousands of simple cores designed to execute the same instruction over large datasets, this design paradigm creates several major technical limitations regarding context switching and memory isolation. GPUs were designed to produce pixels, not to run sensitive AI applications from multiple tenants using the same hardware.
As a result, GPU infrastructure today behaves less like elastic cloud infrastructure and more like carefully managed hardware appliances.
The partitioning paradox
Today’s AI infrastructure requires GPUs to behave like shared, elastic cloud resources. As inference workloads begin to outstrip large-scale training runs, the ability to slice and share expensive hardware among multiple tenants in real time while maintaining acceptable fault tolerance is no longer optional.
Hardware vendors have introduced new approaches to dividing GPUs into multiple isolated compute slices. Other frameworks approach partitioning through schedulers and container runtimes. But resource partitioning is just one slice of the overall GPU operations pie.
There is currently no widely adopted, cross-vendor operating model for achieving this safely at scale. Most providers are faced with either dedicating a single customer to a physical machine (thus wasting available capacity), or accepting the multi-tenancy security risks currently without a known solution. The current engineering challenge has moved from beneath the model layer and into the infrastructure layer. Success now relies on the ability to quickly launch new workloads and to rapidly contain hardware faults so that a single GPU failure does not bring down all workloads running on the server.
Untrusted code and tenants
The vast majority of current GPU programming models rely on the idea that the driver has complete control over memory protection and that no user will act maliciously. Unfortunately, that assumption completely falls apart in a cloud environment where one VM or container can leave behind data remnants in memory that another VM or container may access. Especially considering that how GPUs execute code is often completely opaque.
A single faulty workload or a single faulty driver failure in a shared GPU environment can also take down every workload (job) that was being run on the same server, increasing the amount of damage caused by an operational failure.
Currently, there are immature options for runtime inspection or behavioral auditing, limiting both visibility and control for security teams. GPU drivers provide a large attack surface and generally limited telemetry from the hardware. In these shared environments, embeddings, weights, prompts, and tokens are now all sensitive data points, creating significant blind spots for those attempting to protect intellectual property.
The high cost of cold starts
The real constraint in many GPU clouds is not model performance but operational efficiency. Right now GPU operations looks like 30 minute tenant spin-up times, 70% idle rates, and engineers continuously debugging infrastructure stacks. Today’s GPU clouds are stalled not due to inferior models, but because the infrastructure layer underpinning these clouds was never designed to support such a high degree of scale.
A 30 minute cold start is a fundamental limitation on the modern AI business model. Those GPU clouds that can spin up workloads in seconds will ultimately win against those that do not. Multitenancy is the only viable means of producing sufficient unit economics to make this very expensive hardware viable for the long term.
Bridging the orchestration gap
Platform teams are beginning to recognize that GPUs require a specialized operating layer between the hardware and workloads. Operators need a unified operating model that supports multiple hardware vendors and GPU models. Cloud providers need a method to safely slice and share servers among tenants, and to prevent cascading failures, while rapidly launching new workloads.
Enterprises are increasingly seeking to run sensitive AI applications with stronger isolation guarantees, and thus are turning to newer categories of software designed specifically to manage the “dirty work” of hardware orchestration.
The race to make GPUs more operational
Prior to Kubernetes standardizing container orchestration, the industry was constantly debating the efficiency of container scheduling and bin packing across clusters. Those operational concerns were eventually automated and incorporated into infrastructure layers that made the complexities invisible to the end user.
A similar evolution is occurring around GPUs today. While platform teams continue to argue over placement strategies and memory tuning, these decisions will likely be automated within five to 10 years. As AI infrastructure evolves, the most valuable layer may not be the GPUs themselves but the operating layers that make them secure, elastic, and efficiently sharable. So the winners in the AI race won’t necessarily be just those with the most silicon, but those who have the best operating models for making that silicon secure and elastic.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
8 cutting-edge web development tools you don’t want to miss 9 Jun 2026, 4:00 am
There is no ordained path. The hope that we were converging on some kind of consensus in web development has been eradicated by recent, ingenious developments that point in almost every direction. Yet, if there is a central theme uniting these efforts, it is the desire to mitigate the layers of liturgical embellishment that have grown up around the reactive canon. How can we look at things differently to attain the power that we need, without the heavy intricacy?
Here are eight cutting-edge web development tools that point the way.
Front-end maestro
If you put a bunch of classical musicians in a room together with sheet music and let them run, you might get to a cohesive piece—but you probably want a conductor, a maestro who coordinates all of the parts. That is Astro for your front-end frameworks.
Astro addresses the “hydration” of the front end, that is to say, the process of making the shell reactive. In conventional server-side rendering (SSR), like Next.js or Nuxt, the server not only sends the HTML, but also sends the massive framework runtime down the wire, just to attach event listeners to the page. Astro allows you to write components in React, Svelte, Vue, or Solid, and its compiler strips away all of the JavaScript before it reaches the browser. Astro ships zero JS by default, relying on its islands architecture to hydrate only the specific components that demand interactivity.
Because Astro isolates interactivity into distinct islands, sharing complex state between those islands (e.g., a complex filtering sidebar communicating with a separate dynamic data grid) is fundamentally harder than it is in a monolithic single-page application. If you are building a highly interactive, dashboard-heavy app where every component affects every other component, Astro’s isolated islands might begin to feel more like a straitjacket than a liberation.
See also: Qwik. If Astro unbloats by stripping away the JavaScript entirely, Qwik unbloats by delaying it. Qwik delivers instant HTML and serializes the application state, downloading and executing only the JavaScript code required for a specific interaction at the exact millisecond the user clicks a button.
Biome: Lint like it’s 2026
Rust is gradually replacing the underlying infrastructure in the JavaScript ecosystem. But while Rust gives Biome its close-to-the-metal speed, Biome’s true calling card is its unification of the sprawling toolchain under a cohesive umbrella.
The .eslintrc and .prettierrc files and the dozen associated plugins can become a dark and unhappy bog in a project. Biome is the way out of the mire. It is a single, blazingly fast binary that replaces your entire tangled formatting and linting ecosystem, providing a path to code quality that doesn’t require a sprawling web of dependencies.
Probably the biggest drawback to Biome is that you lose the wide-open extensibility—which is exactly the same feature that makes Biome lean.
See also: Rspack. Biome cleans up the linting. Rspack unbloats the build step. Also built on Rust for speed, Rspack challenges the new “unbundled” esbuild-based dev mode championed by Vite and uses bundled dev mode.
Bun: Fast and integrated back-end JavaScript
Most cutting-edge JavaScript enthusiasts are already well-aware of Bun. For those who haven’t yet experienced Bun’s enthralling blend of one-stop shopping and blistering speed first-hand, it’s a virtually irrefutable must-try.
Fast is probably an understatement. If you are used to Node and you try out Bun, you will likely be immediately impressed with the speed at which commands execute. The Bun team has also made an extensive, multi-year effort to bring its engine into close compatibility with Node’s APIs. Overall, Bun is an extraordinary engineering effort that every JS developer should explore.
However, while Bun’s Zig-based engine is in most respects a drop-in for Node, it isn’t perfect, especially when considering the gargantuan landscape of Node packages out there. Node remains the conservative, happy-path engine for server-side JavaScript.
See also: Deno. Although Bun has justifiably earned a reputation for bleeding-edge innovation, Deno has quietly pressed ahead with an appealing set of enterprise features like an integrated deployment platform and a front-end framework (Deno Fresh).
The Bun curious also may want to check out my interview with Bun creator Jared Sumner.
HTMX: Ajax KISS
If we are talking about clever ways to de-complexify the web, HTMX could reasonably be considered the poster-child. It takes the core mechanisms of the modern web client, like Ajax and partial updates, and turns them into simple HTML attributes. That means the state lives exclusively on the server, which is responsible for sending HTMX fragments.
Of course, there are trade-offs. Perhaps most unavoidable is the extreme dependence on the network. Because there is no client-side state machine, the browser will be orphaned and helpless without a connection to the server. That is, unless you get experimental with a local-first datastore.
Long story short: if your app falls into the realm of HTMX’s ability, HTMX is likely to be the most direct RESTful way to build it. And HTMX can in fact handle quite a lot.
See also: Hotwire. A collection of tools for building single-page-style applications using HTML over the wire, Hotwire has great features like page morphing, which can diff HTML instead of cold-loading it, with a simple import. True to classic “free as in speech” software culture, the HTMX and Hotwire projects freely exchange ideas.
PowerSync: Data layer redo
Although the local-first data revolution that PowerSync represents implies a fairly serious engineering deep dive, its core proposal — to entirely reshape the way data moves in web architecture — is something a web developer needs to be aware of.
Usually, we create architectures that require a complex middleware to broker between a reactive client and the datastore. PowerSync proposes a radical alternative: bypass the middleman entirely by dropping a robust SQLite Wasm database directly into the browser.
The UI works on local data using familiar SQL, synchronously. Latency is zero. The dreaded loading spinner vanishes entirely. In the background, PowerSync automatically reconciles your local store with your central Postgres database. It handles the complex syncing algorithms and network drops, effectively making your application offline-first by default.
The catch, of course, is that local-first development forces a massive mental shift. You have to define data slices (similar to a view) that each client user holds. The PowerSync engine does most of the hard work, but things like schema migrations and conflict resolution (when two users edit the same record while offline) require a significantly steeper initial setup than a standard REST API.
See also: RxDB. RxDB is a slightly different flavor of local-first datastore. Whereas PowerSync relies heavily on Postgres, SQLite, and background daemons, RxDB provides a NoSQL, offline-first, reactive database that treats queries as observable streams, pushing UI updates the exact millisecond the local data changes.
RooCode: Use any AI you want
The beauty of RooCode lies in its ability to orchestrate whatever AI providers you have—for free. RooCode is an extension to Visual Studio Code that provides an AI manager layer. This layer bridges between the general abilities of the LLM and your code-specific, project-level structures.
RooCode is strong enough to be somewhat agentic in its capabilities. It doesn’t reach the powerhouse abilities of something like Cursor or Antigravity, but it is quite able to handle most small to medium-sized requests. And it does so with a minimum of unnecessary overhead. I find myself often using RooCode alongside my AI-assisted IDE to knock out lesser requirements, for less cost and without interrupting the flow of ongoing epics.
RooCode keeps you free of proprietary ecosystems. It allows you to plug in your own API keys—whether that is Claude, OpenAI, or even a local model running on your own hardware.
The hidden tax of any AI coding assistant, however, is that it fundamentally shifts your job description from “writer” to “editor.” The unbloating of keystrokes can paradoxically lead to massively bloated codebases if developers blindly accept AI-generated boilerplate without actively reviewing its architectural impact. It is incredibly easy to let an agent spin up 500 lines of complex React when 50 lines of plain JavaScript would have done.
See also: Antigravity. RooCode is a lightweight extension that supercharges your existing environment. Google’s Antigravity is a custom-built editor designed from the ground up around AI, geared for agentic development workflows.
TanStack Query: Syncing made simple(r)
Even when client-side state management is addressed (see Zustand below), there is still a big, gaping hole in the plot: syncing across the server boundary. That is where TanStack Query steps into the breach.
Distributed computing is a notoriously thorny problem, and in fact our standard reactive model walks right into these thorns by holding the same state in two different places: on the client and the server. Tanstack Query tries to make this inherent architectural friction as painless as possible by acting as an intelligent asynchronous layer.
Instead of using a bunch of manual fetches tied to useState updates, along with fragile isLoading flags and complex state synchronization logic, TanStack Query abstracts the heavy lifting of API responses, background updates, and request deduplication into a few elegant hooks. You tell TanStack Query where to get the data, and it uses a pattern known as “stale-while-revalidate,” which means it will cache and reuse data on the front end (eliminating reload waits) and sync to the latest state in the background.
The catch, however, is that cache invalidation remains one of the hardest problems in computer science—and TanStack Query forces you to face it head-on. You will spend time thinking about “query keys” and deciding when a piece of data should be considered “stale.” No free lunches in software.
See also: SWR. While TanStack Query is an absolute powerhouse for complex data manipulation, SWR remains a champion of API minimalism, doing exactly what its name implies (stale-while-revalidate) with almost zero configuration.
Zustand: Minimalist state
If you have yet to encounter the monstrosity of large-scale state management in a reactive app, then spoiler alert: it can be nasty. Zustand proposes to dispense with the ceremonial boilerplate of reducers, providers, and unwieldy context wrappers in favor of a tiny, brutally simple global store.
Instead of forcing your entire application tree into a massive React context provider (sometimes leading to cascades of superfluous re-renders across the DOM), Zustand uses custom hooks to tie state directly to the specific components that need it. Zustand strives to achieve the specificity in the VDOM reactive model (instead of eliminating it entirely a la Signals).
You define a store, you call it, and the reactivity just works. It is an expression of the KISS philosophy applied to front-end architecture, scraping away the intricacies of Flux-like patterns. The trade-off for this liberation is the burden of discipline. Because Zustand is unopinionated, it won’t stop you from turning your global store into a cluttered junk drawer. You’ll need to impose your own conventions and guardrails to keep a large-scale project manageable.
See also: Jotai. If Zustand is the unbloated global store, Jotai is the unbloated atomic approach. Jotai manages state from the bottom up, calculating changes with surgical precision without triggering massive re-renders across the application tree.
New directions in web development
The most remarkable thing about these eight tools is that they deal in large part with alternative approaches that challenge the familiar. Although you may not be able to adopt them immediately, you will want to keep an eye on them. They are key factors that will continue to influence the shape of web applications and how we build them.
Beware of the genAI token trap 9 Jun 2026, 4:00 am
Enterprises are moving aggressively into generative AI. On the surface, that seems like the right call. The technology is powerful, accessible, and increasingly embedded in how businesses build applications, automate processes, and support decision-making. A development team can connect an application to a large language model in days. A product team can add AI features in weeks. Business leaders see quick wins, faster innovation, and a path to modernizing nearly every part of the company.
These are the upsides everyone is talking about. The part we don’t discuss enough is the economic trap forming underneath all this convenience.
Most enterprises think of tokens as a technical billing detail. They are not. Tokens are the unit of economic dependency in generative AI. Every prompt, response, summarization, retrieval step, workflow action, and agent decision is measured and monetized through tokens. Tokens are not just part of the plumbing. They are the tollbooth between your enterprise and a provider’s intelligence platform. The more AI becomes central to your operations, the more power that tollbooth holds over your future costs.
Tokens are not just a pricing unit
A token is usually described as a chunk of text processed by a model. That is accurate enough for developers, but it misses the bigger issue for CIOs, architects, and corporate boards. In the enterprise, tokens are the mechanism by which AI capabilities are rented. They are the meter attached to the intelligence itself.
That distinction matters because token usage grows faster than most companies anticipate. A simple user prompt rarely remains simple in production systems. It can trigger retrieval from internal knowledge stores, multiple model calls, tool use, post-processing, policy checks, and agent loops. What appears to be a single transaction to the user may involve several layers of token consumption behind the scenes. As a result, enterprises often underestimate the true operating cost of AI-enabled systems, especially as those systems mature and spread across departments.
Today, those costs still feel manageable. In many cases, they feel surprisingly low. That is exactly why the trap is so dangerous.
The market is in a subsidy phase
Current token pricing is giving enterprises a false sense of comfort. Many remote LLM providers are aggressively competing for market share. They want developers building on their APIs. They want enterprise applications tightly coupled to their platforms. They want AI agents, copilots, workflows, and customer experiences to depend on their models. To make that happen, pricing remains highly attractive relative to the value delivered.
That does not mean the economics of generative AI are stable. It means the market is still being shaped by investor capital, strategic pricing, and growth expectations. Providers are racing to establish position, and enterprises are benefiting from that race. But no market stays in that phase forever. At some point, investors will expect durable profitability. At some point, weaker providers will disappear, consolidate, or retreat. At some point, the survivors will have more leverage and much less reason to price primarily for adoption.
That’s when the token trap closes.
Enterprises that build deep dependence on remote models during the subsidy phase may find that what seemed inexpensive at pilot scale becomes punishing at enterprise scale. The application that costs $1,000 per month today may cost 10 or 20 times that amount a few years from now, not only because usage has increased, but also because the market has repriced the dependency.
Easy to adopt, expensive to exit
Cloud computing followed a similar path, with many enterprises mistaking short-term convenience for long-term economics. In the early years, the case was compelling and largely accurate. Move faster, reduce friction, avoid capital spending, and scale with ease. Those benefits were real. Many organizations made architectural decisions that prioritized speed over leverage. They became dependent on managed services, provider-specific tools, and operating models that were easy to adopt but expensive to unwind.
Years later, many enterprises discovered that their cloud bills were much higher than expected and their exit options much narrower than advertised. That was not because the cloud failed. Architectural dependency eventually became financial dependency.
Generative AI is repeating that pattern, only faster. The integration barrier is lower, the pressure to adopt is higher, and the pace of enterprise experimentation is far greater. As a result, companies are wiring remote LLMs into applications, workflows, and agentic systems with very little thought about how these costs will behave in the next five to 10 years.
Agentic AI makes things worse
The more enterprises move from simple prompt-response systems to agentic architectures, the more dangerous the token trap becomes. Agents are not single-call systems. They plan, deliberate, retrieve information, invoke tools, evaluate results, retry steps, and often coordinate with other agents. Each of those actions consumes tokens. Costs no longer rise in a neat linear fashion. They compound.
This matters because agentic AI is increasingly being presented as the future of enterprise automation. It’s true in many cases. But if an enterprise builds agentic systems primarily on remotely hosted intelligence, it is also building future business processes on top of someone else’s pricing model. That is a major strategic risk. The more successful those systems become, the harder they are to replace. The harder they are to replace, the more pricing power shifts to the provider.
This is how businesses end up operationally dependent on a cost structure they do not control.
The appeal of AI sovereignty
The answer is not to reject public models or pretend that external providers play no role. They clearly do. There will always be cases where renting frontier AI capabilities makes sense. But enterprises need to stop assuming that renting is the default for every workload.
AI sovereignty is the alternative that deserves much more attention. That means building, tuning, deploying, and governing models inside the enterprise for use cases where long-term control matters more than access to the absolute frontier. Enterprises need to recognize that most business applications do not need a world-class general-purpose model. They need a model that is good enough for a specific purpose, aligned to the enterprise’s data, governed by the enterprise’s rules, and operated at a predictable cost.
It’s a very different way of thinking.
A self-hosted or enterprise-controlled model may not match the rich feature set of the largest public offerings. It may lack the same breadth, polish, or marketing appeal. But for many internal business tasks, those factors do not matter.
Here’s the most critical question to guide your architectural direction: Can a sovereign AI model solve the problem reliably, securely, and economically over time? If the answer is yes, owning that capability may be far more strategic than forever renting something with more power than you need. In effect, the enterprise becomes its own provider for the workloads that matter most.
Prepare for changing markets
Too many companies still treat generative AI architecture as a tactical IT issue. It is not. These decisions directly affect cost structure, operating flexibility, data control, and long-term competitiveness. If AI becomes a force multiplier across the business, the economics of AI become strategic to the business itself.
The companies that get this right will not necessarily be the fastest adopters. They will understand the difference between experimentation and dependency. They will use external models when it makes sense, but they will also invest in sovereign capabilities where ownership matters. They will think like architects, not consumers.
Here’s the takeaway: Cheap tokens come with strings. They are a gateway to a dependency model that will typically look very different once providers stop pricing for growth and start pricing for leverage. Enterprises cannot keep mistaking today’s bargain for tomorrow’s reality. Boards and executive teams need to act now to get ahead of this issue. The key question is not whether generative AI creates value. It clearly does. The real question is whether the enterprise can still afford and control the value it creates once the market matures.
Meet Hades: The malware that lies to AI security agents 9 Jun 2026, 12:05 am
Threat actors are continuing their onslaught against software supply chains, now with malware named after death itself.
The newly-discovered Hades Campaign is a “highly sophisticated” supply chain compromise that targets Python developer environments and runs as soon as infected packages are imported. It uses the popular Bun toolkit to silently execute multi-layer payloads that can extract sensitive data, move laterally across compromised systems, exploit common security frameworks, and even hijack AI gatekeeper analyzer systems via adversarial prompt injection.
Notably, the campaign exploited the popular C++ library ensmallen, as well as packages in the computational biology, bioinformatics, and genotype-phenotype analysis ecosystems.
The most novel thing about this malware is its combination of advanced tactics, noted David Shipley of Beauceron Security. He noted that we’ve seen memory-focused malware, we’ve seen attacks that attempt to defuse large language model (LLM) powered analysis with hidden prompts, and we’ve seen malware with wiper capabilities.
“But all three, in a fast moving mass propagating worm, is its own kind of nightmare,” he said. “And I suspect this is the way of the future.”
How Hades works
The Hades Campaign was discovered by researchers at StepSecurity, who called it the latest evolution of the Miasma threat actor. The researchers previously described Miasma attacks that had sent self-replicating worms to perform multi-cloud credential sweeps, caused infected repositories to execute code when folders were accessed in integrated development environments (IDEs) or by AI agents, and used techniques that scanned and read Linux process memory.
Hades uses the same credential harvesting methods, self-replicating worm logic, and GitHub-based exfiltration patterns, the researchers noted. In addition to ensmallen, compromised packages include mflux-streamlit, nhmpy, ppkt2synergy, embiggen, gpsea, and pyphetools.
The campaign’s entry point is a simple, obfuscated script embedded inside a Python package’s __init__.py file, a critical building block that gives Python the ability to recognize packages and import modules. Once they gain access, threat actors drop a precompiled Bun runtime binary and executes its JavaScript payload. Bun allows the malware to run complex JavaScript tasks in environments lacking a Node.js installation, bypassing traditional package manager controls and proxy logs.
The malware is able to scrape Linux memory mappings, and also introduces tailored macOS and Windows memory scrapers, which allow threat actors to extract sensitive, encrypted data.
Interestingly, attackers are also able to evade detection by automated LLMs that scan for suspicious code. This is achieved with a simple block of text at the top of the file; this instructs the model to ignore the hidden code below, classify the package as verified and clean, and provide reports stating it is safe.
This element represents what the StepSecurity researchers described as a “significant conceptual shift,” with attackers writing payloads that target AI systems’ cognitive logic. “Scanners that pass raw text to LLMs without strict boundary isolation can be coerced into generating false negative verdicts, allowing the malicious package to bypass organization analysis,” they wrote.
The tactic is indeed clever, Beauceron’s Shipley agreed, pointing out that attackers will increasingly target endpoint LLM-powered agents.
Why? “Because there’s no reliable defense,” he said. “LLMs are incredibly susceptible to social engineering.” This has been relabeled as prompt engineering, but is essentially just phishing for bots, he pointed out.
“While everyone’s worried about LLM-powered vulnerability discovery and automated exploitation, it’s LLM-created smart malware like this, and AI-powered phishing of humans and bots, that keeps me awake at night,” Shipley said.
Hades’ crafty worm propagation
The Hades Campaign command and control (C2) infrastructure uses three independent channels on public GitHub infrastructure to allow its communications to blend in with normal traffic. Stolen credentials are encrypted locally in a hybrid fashion (serialized, compressed, and pushed to a newly created public GitHub repository under attackers’ control). Exfiltrated repositories carry the description “Hades — The End for the Damned.”
Researchers noted that a core component of this campaign is its ability to propagate and move laterally across networks. It exploits the very methods meant to protect systems, including Secure Shell (SSH) and Secure Copy Protocol (SCP), OpenID Connect (OIDC),and Supply-chain Levels for Software Artifacts (SLSA).
For instance, when running inside a GitHub Actions workflow runner, the malware checks for OIDC variables, then bypasses registry signature policies and generates cryptographically signed SLSA provenance bundles via Sigstore. It can then fetch target libraries and inject the obfuscated script and JavaScript payload. From there, it can publish compromised versions to the Python Package Index (PyPI) repository and node package manager (npm) using the target’s credentials and the generated Sigstore bundle.
“This ensures that the published package appears to have valid, cryptographically verified build provenance from the organization’s official GitHub Actions build environment,” the researchers explained.
Further, if a harvested GitHub token has write permissions, the malware will target repositories to extract secrets using GitHub Actions runners. This occurs “directly from the runner’s address space without ever writing them to disk or making a suspicious network connection,” the researchers noted.
The malware also targets rule files and configuration directories for 14 different AI agents and systems, planting custom prompt instructions or executing hooks that trigger a bun run bootstrap command when the victim loads or consults the workspace with their AI assistant. Finally, it establishes persistence on the workstation and monitors for the presence of the stolen token; if that token is revoked, it executes a wiper process to erase the user’s files.
Broadcom beefs up Spring security to protect against AI-enabled attacks 8 Jun 2026, 4:02 pm
Broadcom today announced multiple security investments in its Spring and Java ecosystems that aim to help protect users from AI-enabled threats.
The company said that, first, it is releasing what it called the largest set of Spring security updates to open source in the product’s history, and, for customers, it is extending its clean-room build architecture to build the Java dependencies for the entire Spring ecosystem.
“Spring is one of the most widely adopted application development frameworks in the world, and as its steward, we have a deep responsibility for its security,” said Purnima Padmanabhan, vice president and general manager of Broadcom’s Tanzu Division. “Because we maintain Spring and are the sole committers, we can better secure it at the source for everyone who depends on it. This investment is about two things we will never separate: the health of the Spring community and the security of our customers who trust Spring to run their business.”
The company also announced that, as the number of security advisories reported by the community has exploded, its engineering team has “significantly scaled” its use of AI tools to help it identify vulnerabilities, assess remediation paths, and validate fixes across the dependency ecosystem. Although Broadcom declined to specify the AI models it’s using in its bug hunting, it is a member of Anthropic’s Project Glasswing, so Claude Mythos is likely part of the effort.
For paying customers only
One perk available only to Tanzu Spring enterprise customers is zero-day access to validated CVE patch-only releases through the Spring Enterprise Repository, before they are released to open source. These patches isolate the security fix from any other changes to let customers remediate more quickly.
“By utilizing Tanzu Spring’s private artifact repositories, customers can be confident that the artifacts are the official, validated patches from Broadcom, the steward of Spring,” Broadcom said in its announcement, adding that it will continue to issue CVEs for all versions of every Spring project under open source support, as well as older versions under Tanzu Spring enterprise support.
Broadcom’s Tanzu Spring enterprise support includes:
- Certified source for secure spring libraries
- Commercial-first release of patches for both current and older, enterprise supported versions
- Access to dependent Java binaries
- Automated, deterministic upgrades with Spring Application Advisor
- Exclusive Tanzu Spring components for governance and security
- 24×7 support, hands-on expertise and access to the Spring team.
In addition, Broadcom said, it has now added:
- Secured, SLSA Level 3–validated software supply chain for Java dependencies.
- Coverage that spans the full transitive dependency graph managed by the Spring Boot bill of materials.
- Thousands of secured dependencies, built and tested across every supported Spring version. Spring Boot 4.0 alone manages 1,768 of them; across the full supported portfolio, that totals more than 100,000 validated dependency builds.
“This capability gives customers validated dependencies across both current and end-of-life Spring versions, helping customers reduce software supply chain risk while continuing to benefit from the productivity and consistency of Spring Boot’s dependency management model,” the announcement noted.
Security fixes for sale
Seva Ioussoufovitch, senior research analyst at Info-Tech Research Group, sees these moves as mostly positive.
“It’s encouraging to see Broadcom take proactive steps towards dealing with the increase in AI-detected vulnerabilities that many organizations have had to contend with in recent months,” he said. “Announcements like Mythos have made it clear that the industry needs to re-think traditional approaches to security patching.”
Ioussoufovitch isn’t surprised at the size of the update release either, noting that it’s consistent with the result of AI scanning and remediation that has been occurring, and will likely continue.
“More meaningful is the provision of validated and secured dependencies,” he said. “This is a critical move in the right direction, especially with the endlessly growing list of supply chain vulnerabilities the industry has been managing in recent months.”
Ioussoufovitch is less happy with the restriction of zero-day patches to paying customers.
“Putting security fixes behind a paywall isn’t new, but when there are no drop-in alternatives for an ecosystem as critical as Spring, it just looks like a power move to force more of the open-source community onto the monetization track,” he noted. “Another approach might’ve been to release the CVE fixes to everyone while charging for enterprise-grade packaging, validation, and support, but, given Broadcom’s track record of aggressive monetization in recent years, what they’ve chosen here doesn’t necessarily come as a shock.”
Page processed in 1.418 seconds.
Powered by SimplePie 1.3.1, Build 20131001021811. Run the SimplePie Compatibility Test. SimplePie is © 2004–2026, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.
