Architecting AI – seven principles to build production grade AI solutions

Artificial Intelligence (Ai) is high on the agenda for many organizations. Many have started to dip their feet into the waters and build AI solutions themselves, which has resulted in a large number of sandboxes, POCs, pilots, hackathons and other experimental efforts. That is positive and a necessary development to begin to harness the qualities of AI but it is not sufficient. This is also why a recent research project I am working on around AI in practice indicates that organizations do not see any large value in AI yet.

It is no surprise that a new technology paradigm requires adjustments to how we architect solutions, the same was the case with desktop and cloud computing, but all focus has been on developing the core AI models not how they are applied in the real world, which is where the value is. For AI solutions to bring significant value in the real world they need to be robust, scalable and production grade. While Ai is in a sense just another technology it has not proven as simple to bring to production as traditional technologies. Concerns about the transparency and ethics of AI, model drift and unpredictability complicate deployment. In addition to that, specific technical skills and governance is often warranted. This is why most AI solutions never get out of beta mode and when they do it is often only nominally. The question is therefore how best to architect robust production grade, scalable and sustainable AI solutions.

For the past 15 years I have worked on developing and implementing AI solutions in a number of different industries and settings. Based on these experiences it is possible to distinguish seven principles that help make AI solutions a success.

Manage the data ecosystem

Data is not an unknown requirement in traditional software development. When developing regular software solutions you also need data for testing of the functionality but for AI data is most often an integral part of the functionality. A computer vision system needs to be trained on a training set that needs to have a specific format and metadata. Preparing and designing this is an important aspect of AI development, which is often overlooked or insufficiently prioritized. Another data aspect is to design the data pipeline that will feed the AI solution. For AI in particular garbage in garbage out applies. Even with the most perfect AI model the result could be disastrous if the data aspect is not handled with sufficient care.

Design for transparency

Transparency is not always important but particularly when it comes to solutions that impact humans a degree of transparency is frequently warranted. When I worked for the City of New York we helped law makers qualify a new law for algorithmic transparency, which became Local Law 49 of 2017. The purpose of this law was to bring transparency to automated decision systems that impact residents. Certain types of algorithms are black boxes and are not transparent. But it is possible to design around this. For example in a system for release recommendation, AI can be used to point out critical factors that lead to recidivism. These factors can then be used for scoring in the context of a probation hearing for the judge to get a transparent recommendation. A Deep Learning or ChatGPT like system used by the judge to get a recommendation would perhaps be more precise but would not evoke the same degree of trust.

Plan for continuous evolution

In a conventional software development project the sequence would be to design, develop, test and deploy. After that the solution is expected to continue behaving as designed and tested until changes are made. But for AI that is not typically the case. Model drift will make the solution degrade slowly and changes in input might also undermine the efficiency. For AI solutions it is necessary to plan for continuous evolution, for example through retraining or continuous adjustments of the model. That means plans must be made for securing data that will help the solution evolve continuously.

Keep a human in the loop

Even when it is technically feasible and functionally superior to use complete AI automation there may be other reasons not to do it. For decades it has been possible to operate planes without pilots but for some reasons passengers prefer to have a pilot in charge at least for take off and landing. The situation is similar with autonomous driving, which is technically feasible and perhaps statistically safer, but resistance remains because accidents will still happen and when they do the entire solution will take the blame. The reason is that even though AI can be trained on the vast majority of possible situations new and unique situations may occur that it does not have a good response for. It is therefore generally a good idea to make sure a human is in the loop somewhere to keep the AI in check. It can be by reviewing output, being ready to take over from the autopilot or selecting between different AI generated options.

Test is monitoring

In the traditional software development life cycle test is done in the end and after that the solution is deployed and work is done. But AI systems do not behave like traditional software systems. When you develop an HR or CRM system it will continue to behave predictably indefinitely, but an AI system might drift and start exhibiting a different behavior after a while. This nonlinearity makes traditional pre-launch tests insufficient and monitoring has to be done on system behavior in a similar manner to how one would approach tests with functional assessments. It is necessary to think of testing, not as a point in time activity but as something continuous. Audits may also be necessary to continuously be certain that the system is on track and behaving within the limits of the expected.

Leverage existing capabilities

When developing a pilot it is common to allow any type of technical development in order to produce a result. This poses a problem when an AI solution has to be made ready for production. To make the solution scalable it may have to be completely re-architected. The preferred technologies of the organization need to be employed. Otherwise there is a risk of cultivating a parallel AI IT estate with its own technological patterns and agendas. Although AI is special it is not that special. It still needs standard technical capabilities like scheduling, data storage, messaging. These should be employed when moving a POC to production because only that way can robustness and scalability be ensured. Code Versioning is usually sufficient to be able to reconstitute the solution after a breakdown or migration, but checking the code in a code repository is not enough since the model and training data is integral to the solution. It might require additional backup of data to be able to redeploy an AI solution.

Understand human psychology

Don’t let tech specialists decide what is appropriate. As the debate around AI has shown, perceptions may differ significantly across the tech divide, with tech savvy engineers at one end and disenfranchised blue collar workers on the other. In addition to that, AI solutions also exhibit curious patterns such as what I have called the technology optimization paradox. The paradox is that a solution may be perceived increasingly favorably as it optimizes a solution but only to a point after which it is increasingly unfavorably viewed as it improves. This is what is behind the uncanny valley phenomenon, where a robot is viewed favorably as long as it does not get too close to being human. There are other quirks of human nature that can quickly undermine the efficiency gains of an AI system and undermine its legitimacy altogether, so effort has to be made to understand any such psychological barriers.

Architecting for AI success

We are now at a point where the next step is to bring the experiments with AI to life. We need to think about how to architect production grade solutions. That is a completely different exercise from building a POC. AI provides challenges that can be managed quite easily although the mitigations do not come naturally when working in a traditional software development framework. This creates pressures to revisit the operating model, the architectural documentation and governance process and in particular to rethink test and monitoring from a risk based approach. We have looked at seven principles that can help shape the focus necessary to start implementing robust production grade AI solutions at scale to begin harnessing the value that AI promises. It is time to get started and dust off all the excellent experiments in the closet and dress them for success.