AI SDLC
AI Changes Software Engineering, Part 2: AI Exposes Enterprise Fault Lines
AI often gets blamed for problems it did not create. In practice, it stress-tests data architecture, ownership, governance, requirements, testing, and observability.
Answer summary
AI systems expose unclear ownership, inconsistent metadata, weak lineage, fragmented governance, and poor discoverability because they consume enterprise information continuously. The result is a software lifecycle where requirements become more experimental, architecture becomes more important, testing becomes evaluation, observability must explain behaviour, and governance moves into runtime.
Key takeaways
- AI often exposes pre-existing enterprise weaknesses rather than creating them.
- Requirements become more experimental because implementation feedback arrives faster.
- Architecture becomes more important because architectural mistakes can spread faster.
- Testing increasingly becomes evaluation of probabilistic behaviour.
- Governance moves from pre-release approval toward runtime execution and evidence.
Series navigation: Part 1 - The Bottleneck Has Moved · Part 2 of 3 · Next: Part 3 - The New Operating Model
Continued from Part 1.
AI exposes problems it does not create
Across the AI adoption programs I have observed over the last two years, one pattern recurs more reliably than almost any other: AI is frequently blamed for problems it did not create. The mechanism is usually mundane. A retrieval assistant returns inconsistent answers, and the conclusion drawn in the room is that the model is unreliable. A governance review discovers that nobody can confidently explain which datasets are feeding a copilot, and the matter is logged as an AI governance concern. An agent struggles to perform a task consistently, and the discussion turns to whether the model is mature enough for production. Occasionally the AI system genuinely is at fault. More often, in my experience, it is behaving as a stress test for an enterprise environment that was already carrying unresolved structural debt.
The weaknesses that surface tend to be familiar to anyone who has spent time inside large organizations: unclear ownership, inconsistent metadata, weak lineage, fragmented governance, duplicated business definitions, undocumented dependencies, and poor discoverability. None of these problems are new. Many of them were named explicitly years ago. The metadata catalog initiatives that spread through large enterprises toward the end of the last decade were, in part, an attempt to address exactly this class of weakness, on the assumption that improved discoverability would pull governance maturity along behind it. Discoverability usually improved. Operational accountability moved more slowly, and the distance between the two is roughly the distance that AI systems now expose.
These weaknesses persisted comfortably before AI arrived, largely because nothing in the environment consumed enterprise knowledge aggressively enough to surface them. AI systems do exactly that. They read across domains, systems, repositories, and organizational boundaries continuously, and in doing so they convert tolerable inconsistency into visible operational friction.
The retrieval case is the clearest illustration. During a demonstration, a retrieval assistant often performs impressively, because the scope is narrow and the underlying content has been curated by people who care about the outcome. The difficulty appears later, during broader rollout, when the system begins drawing on knowledge that no one curated for this particular purpose. Different departments turn out to use different definitions for the same term. Documentation quality varies by team and, frankly, by decade. Ownership is ambiguous. Some content is obsolete, and some should never have been retrievable at all. The assistant becomes, in effect, the first consumer that forces the organization to confront these inconsistencies systematically rather than incident by incident.
This is one reason I increasingly treat enterprise AI adoption as a data architecture problem rather than a model problem. The models are improving quickly while the underlying enterprise data landscape improves much more slowly, which means the distance between what these systems can do and what the surrounding architecture can defensibly support is widening rather than closing.
Requirements become more experimental
Requirements engineering is changing in ways that are easy to understate. For most of the history of enterprise software, projects followed a broadly linear path, at least in intention: teams worked to understand a problem, documented requirements, designed a solution, and then implemented it. The reality was always messier than the methodology diagrams suggested, but the sequence was familiar enough to plan around.
AI disturbs this sequence because exploration has become cheap. A business stakeholder can describe a workflow in the morning and review a working prototype the same afternoon. Product teams can evaluate several approaches before committing meaningful engineering effort, and experiments that once required weeks of implementation can now be examined in hours. This is a genuine improvement, and I do not want to diminish it.
The complication is that cheap implementation exposes a cost that expensive implementation used to hide. Many requirements were never as clear as they appeared; the ambiguity simply remained dormant, because the effort of building something consumed everyone's attention long before anyone discovered that the original assumptions were incomplete. AI shortens that feedback loop considerably, and the result is that requirements become more experimental and iterative whether or not the organization intended to work that way.
At the same time, an entirely new class of requirement appears, and it tends to be the harder class. Teams now have to decide what level of hallucination risk is acceptable for a given workflow, when a human must review a result before it takes effect, what degree of explainability is required, how confidence should be communicated to the person relying on the output, which datasets may legitimately participate in retrieval, and which decisions an agent may take without supervision. Several governance teams I have worked with reached the same realization independently: generating answers was rarely the difficult part, and defining acceptable behaviour was. That distinction matters, because it locates the real work not in solution discovery, which AI accelerates, but in the specification of acceptable behaviour, which AI does little to resolve.
Architecture becomes more important
A common assumption surrounding AI-assisted development is that architecture matters less once implementation becomes easy. My experience points the other way, and the reasoning is largely economic. When building software is expensive, architectural mistakes propagate slowly, because an organization simply cannot create that many systems and therefore cannot create that many problematic ones. When building software becomes cheap, the same mistakes propagate quickly and across more surfaces.
Several engineering organizations have described exactly this progression to me. AI-assisted development made it easier to produce services, automations, integrations, retrieval systems, agents, and internal tools, and the number of such systems grew accordingly. The importance of architectural consistency grew with it, because every shortcut now reproduces faster and in more places than it would have a few years ago.
AI also introduces architectural concerns that did not previously exist as first-class problems: retrieval architecture, vector infrastructure, prompt orchestration, model routing, evaluation pipelines, policy enforcement, runtime governance, observability layers, and identity propagation across all of them. These concerns sit above implementation. They are architecture problems in the most traditional sense, even though the vocabulary attached to them is new.
A platform architect I worked with recently summarized the situation in a way I have since repeated in several reviews: the industry has reduced the cost of writing software without reducing the cost of making architectural mistakes. That observation captures much of what is currently happening inside organizations that adopted these tools enthusiastically and are now living with the topology that resulted.
Testing changes more than coding
If coding is the most visible change AI introduces, testing is probably the most underestimated. Traditional software testing rests on an assumption of determinism: for a given input there is usually an expected output, and the test checks whether the system produces it. AI systems weaken that assumption. Outputs are probabilistic, behaviour varies with context, and quality depends on retrieval relevance and prompt structure at least as much as on the model itself.
The practical consequence is that testing increasingly becomes evaluation, and although that shift sounds subtle, it is substantial in operational terms. Teams now have to reason about hallucination rates, retrieval relevance, confidence levels, adversarial behaviour, policy compliance, response quality, and reasoning consistency, none of which reduce cleanly to a single pass-or-fail assertion. More than one organization that expected AI to reduce engineering effort discovered instead that testing effort rose, because the software became easier to generate while confidence in its behaviour became harder to establish.
This is why a growing share of AI engineering investment is flowing into evaluation frameworks, benchmark datasets, adversarial testing, model comparison pipelines, and automated quality assessment. The discipline of testing has not disappeared. What has changed is the object being tested, and that change is large enough to reorganize where engineering time is actually spent.
Observability moves from system health toward explanation
Observability evolved around a relatively clear operational question: is the system healthy? Organizations learned to monitor latency, throughput, availability, resource consumption, and error rates, and the discipline matured around those signals. AI adds a second question that is considerably harder to answer: why did the system behave the way it did?
Several platform teams have told me that they can already determine whether an AI system is operationally healthy, and that the genuinely difficult problem is explaining why a particular answer was produced - which retrieval path shaped it, which policy was evaluated, which context influenced the outcome. Answering that requires visibility into a much larger surface than traditional telemetry ever covered: prompts, retrieval chains, model versions, token consumption, confidence signals, policy execution, evaluation outcomes, and agent interactions, correlated well enough to reconstruct a specific decision after the fact.
This is one reason observability is becoming a strategic capability rather than a purely operational one. As systems become more autonomous, the ability to reconstruct behaviour matters not only for debugging but for governance, and in regulated environments the second motivation tends to be the one that ultimately secures the budget.
Governance moves into runtime
The deepest shift AI introduces may be the way it relocates governance. Traditional governance operates largely before deployment. Systems are reviewed, documentation is approved, controls are validated, and applications are released, with the review itself occurring on a periodic cadence. That model works well enough when a system's behaviour is effectively fixed at release. It fits much less comfortably when the system continues making decisions long after deployment, which is precisely what AI systems do.
The governing question therefore migrates from whether something is acceptable to release toward how acceptable behaviour can be sustained while the system is running. Organizations responding to this find themselves building policy enforcement, access controls, retrieval restrictions, agent permissions, evidence generation, behavioural monitoring, and continuous evaluation into the runtime itself rather than into the approval process that precedes it.
Several banking and insurance organizations I have worked with arrived at a similar conclusion independently, and it surprised some of them: the hardest AI governance questions were not approval questions at all. They were operational ones. Who can access what, under which conditions, how do we know, and can we prove it later. Those questions connect governance, architecture, and operations into a single concern, and I suspect the need to answer them continuously - with evidentiary continuity that survives an audit request months later - will become one of the defining characteristics of enterprise AI over the coming decade.
What hasn't changed
Discussions about AI tend to exaggerate, partly because they concentrate almost entirely on what changes. A fair amount has remained stable, and the stable parts are worth stating plainly.
Over the last two years I have sat in discussions across banks, insurers, delivery organizations, and engineering teams whose technologies and structures differed considerably. The difficult conversations kept returning to familiar territory. Teams still had to decide which problems deserved investment. Stakeholders still disagreed about priorities. Platform groups still negotiated ownership boundaries. Engineering leaders still balanced speed, cost, maintainability, and risk against one another. In regulated environments, accountability remained particularly stubborn, because someone still had to explain why a decision was made, who approved it, and who would carry responsibility if it went wrong. The models were new; the management and engineering questions surrounding them looked much as they had for years.
This is part of why I have grown skeptical of claims that AI fundamentally remakes software engineering. It clearly reshapes parts of it. What it does not remove is the need for judgment, and if anything judgment becomes more important, because software artifacts are now cheaper to produce while the consequences of poor decisions remain roughly what they always were.
Next in the series: Part 3 - The New Operating Model.
References
- DORA: State of AI-assisted Software Development 2025 — Research on AI-assisted software development, delivery performance, organizational capabilities, and the amplifier effect of AI.
- arXiv: The Impact of AI on Developer Productivity - Evidence from GitHub Copilot — Controlled study reporting a 55.8 percent faster task-completion result for GitHub Copilot users.
- GitHub: Quantifying Copilot's impact in the enterprise with Accenture — Enterprise research context for AI-assisted coding adoption, satisfaction, and productivity variation.
- TechRadar Pro: Observability was built for humans. AI agents need something different — Analysis of why AI agents create new observability requirements beyond traditional human-centric telemetry.
- arXiv: AI Assurance - A Comprehensive Testing Strategy for Enterprise AI Systems — Research framing enterprise AI testing as continuous assurance and risk reduction rather than deterministic verification.
- arXiv: Cryptographic Runtime Governance for Autonomous AI Systems — Runtime-governance architecture research treating policy and legal constraints as execution conditions.
Author
Géza Kuti is a senior Data and AI executive based in Bülach (ZH), Switzerland, focused on data strategy, enterprise architecture, AI governance, hybrid cloud, and regulated delivery.
Related articles
AI Is Global. Government Data Is Not
Frontier AI models are global, but government and regulated data remain jurisdictional. This article explains why sovereign AI is becoming the default architecture across Europe.
Why Most Public Sector AI Strategies Fail During Implementation
Most governments now have AI strategies, principles, and playbooks. The harder question is whether they have the delivery machinery to turn those documents into safe production systems.
AI Changes Software Engineering, Part 3: The New Operating Model
Part 3 of the AI software engineering series: why AI adoption becomes an operating-model problem, what the emerging AI engineering stack looks like, and what leaders should ask Monday morning.