One of the bugbears in current AI is the inability of systems to explain their results, whether that be the generation of text or the classification of an image, or whatever the task at hand is. Yet natural systems – humans – also suffer from issues in providing explanations. Usually, a person would be expected to provide some logical justification for their answer: the reason this is a duck is because is looks like a duck, swims like a duck and quacks like a duck.
Yet such explanations require an understanding of the world, shared between the explainer and the explainee (the person to whom the explanation is being given). Here, this is an understanding of what a duck is. While humans sometimes follow chains of logic in providing answers to questions, the reality is that these are often post hoc, and may bear little in common with the actual mechanism of the decision. On the other hand, some decisions (such as those of a court of law, or decisions about providing credit to people or businesses) must be explainable, and the explanations must use the legal framework, or an understanding of the risks associated with providing credit. Sometimes there can be dubiety in these chains of reasoning (the law is not designed to be unambiguous, and one can argue that providing credit to a risky but possibly very profitable business proposition1 would be appropriate).
What we do not expect is a justification based on actual brain activity, for example which neurons or sets of neurons are firing, or which parts of the brain are most active, or which neurotransmitters are prevalent in different parts of the brain. While one can argue that this could constitute a basis for an explanation, finding this information is problematic and highly invasive.
Turning to artificial systems, the problem is worse. These systems are not normally set up to provide any form of explanation: they have been trained on vast volumes of digitised information (whether images, or speech or encoded text), and their architecture enables them to pick out complex statistical relationships within this dataset. Textual systems use attention to pick out the important tokens (words or phrases) and multi-head systems can use a number of different sets of attended-to items. Image based systems are trained on pixel-based images, whether static or moving, but the training data is necessarily sparse in the image space2 so that the system is classify receive images that are outside of the (convex hull of) its training data.
Unless the system has been trained to provide explanations, attempts to provide explanations are necessarily limited to interpreting the internal states of the system. This is akin to examining the neurons of a real brain: less invasive in this case, but still difficult. Further, if two deep systems (like transformers) are trained from different startpoints (or even from the same startpoint, but with the training data reordered) they are unlikely to to code their input data in the same way so that even if one found a set of units correlating with certain type of decision in one system, a different system would not provide the same correlation.
Trained transformer systems use very deep (and very wide) architectures, making identifying areas associated with correct or wrong decisions very hard to find. A recent development is the Co4 system (Co4 reference) which can produce good (though not perfect) results with fewer layers, and with fewer attention heads. This suggests that in this architecture, the earlier layers are faster at finding useful representations. These representations (if localised) may provide interpretable features. This suggests that explanation may be easier in these systems – though they remain unable to provide a logical chain of reasoning for their decisions.
In Transformers (and also Co4 systems), the different units in the system receive different inputs, but do not interact with each other across each layer. In actual brains, there is lateral inhibition, mediated by inhibitory interneurons. These result in neighbouring excitatory neurons (layer 5 pyramidal neurons in mammalian neocortex) being less likely to be active simultaneously, resulting in relatively localised representations at each instant. Adding this to Transformers or Co4 systems should make them easier to interpret.
How should one proceed with improving explainability in AI systems? Firstly, one needs to understand the different meanings that explainability has, ranging from identifying particularly active elements of the hardware of the (real or artificial) system to providing a (comprehensible) set of logical steps leading from the premises to the conclusions. Note that comprehensibility is critical here: one can rewrite all the weights in a network as a set of equations and provide that as an explanation. While this is a set of logical steps, it is not comprehensible. Any explanation is likely to be post hoc, (as are human explanations).
This suggests that one solution would be to have an additional system which interprets the original decision-making system to provide the explanation after the decision has been made. This system would need to be trained using the results from the (trained) original systems decisions and classifications, meaning that training this system itself would be time consuming! It is counterintuitive that joining two systems, neither of which can explain its decisions can create a single system that can explain its decisions. But explainability is sufficiently important for investigating this to be worthwhile.
- Once in the early 1980’s a company of which I was a director (Silicon Glen Ltd.) was trying to get funding from a bank for some (early) computer based work. The bank manager (this was a long time ago, when there were bank managers that one was able to talk to!) said he would have understood if we were looking for a loan for a combine harvester, but for computers? He was at a loss. ↩︎
- For an M by N monochrome pixel image with 8 bit element depth, this has ↩︎