Computer vision – Turning an image into a thousand words

Vision is so natural to humans that it seems difficult to see how difficult it could be to make computers see. Nevertheless, understanding how human vision works has taken decades for cognitive scientists to uncover. It is therefore all the more complex to make computers see in the same way as humans. The initial goal was to replicate human vision, since cameras can record anything it seemed straightforward from a technical perspective. But there is a huge difference between recording and understanding something in a visual field. The basic question for computer vision has, therefore, been to convert visual information into categorical information expressed in words, or to put it even more simply how to detect recognizable objects in a visual field.

Turning two-dimensional images into three-dimensional objects

The roots of computer vision can be traced back to the 1960s and 1970s when researchers began experimenting with early image processing techniques. One early milestone was Larry Roberts‘ 1963 Ph.D. thesis “Machine Perception of Three-Dimensional Solids.” In his PhD, Roberts explored the possibility of extracting three-dimensional information from two-dimensional images, which naturally is a crucial problem for any kind of computer vision. This research set the stage for further exploration.

To recognize an object it is also necessary to figure out where an object starts and finishes. That was the focus of subsequent decades, where algorithms for edge detection were developed. The Canny edge detector developed by John F. Canny in 1986, marked a significant milestone toward being able to detect objects. These algorithms allowed computers to interpret images in a more structured way, identifying objects and patterns in images.

Enter the public image datasets

Computer vision, however, had seen only limited progress while the exponential growth of computational power and the advent of more sophisticated algorithms laid the basis for significant leaps ahead. The emergence of new machine learning techniques in the 1990s, which took advantage of the increased accessibility of computational power and data, enabled computers to improve their performance on visual tasks but it was not until the 2010s with the rise of neural networks that computer vision really started to take off. The primary driver was not the increased computational and algorithmic power but the development of comprehensive publicly shared training datasets. Keep in mind that any development of a computer vision algorithm requires the developer to collect and categorize a large amount of image data. This can easily take months. If each research team does that it takes months that could be used for research. The next problem is to compare the relative success of algorithms when different data has been used.

That was the problem facing Fei-Fei Li at the start of the new millennium. She was looking for a way to make large data sets for training computer vision algorithms. She approached Christiane Fellbaum, who had been one of the creators of WordNet to make something similar for images. As an assistant professor at Princeton University Li worked with volunteer collaborators and paid through Amazon’s mechanical turk service, to classify the contents of millions of images in what became the ImageNet database. The vision of ImageNet was to map the entire visual world. At its core, ImageNet ended up being a colossal repository of over 14 million curated and annotated images.

While the data set was valuable in itself, what really helped drive the development of computer vision algorithms was the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which was a competition where teams competed to develop the most efficient algorithm. This allowed experimentation and comparison between different approaches and algorithms, which is crucial to driving knowledge in an evolving field.

But ImageNet was primarily made up of single objects, which posed problems for real-world images. The efficiency of the algorithms dropped significantly, where multiple objects occur in the same image. To remedy this lack, a new project was conceived by Tsung-Yi Lin in the lab of Serge Belongie at Cornell University. The aspiration was to develop a comprehensive data set with a small number of common objects in context In 2014 Tsung-Yi Lin in collaboration with Microsoft and a consortium of universities presented the Common Objects in Context (COCO) dataset. Rather than having a large number of different objects, COCO focussed on only 91 objects but annotated with a square where they appear in the image. This boosted the possibility of developing algorithms that could detect objects in real-world images.

Without the development of these public data sets to provide benchmarks against which AI algorithms could be set, it is difficult to see how computer vision could have developed so fast. Convolutional neural networks (CNNs), transformed computer vision and drastically improved accuracy in image recognition and classification tasks. The availability of large labeled datasets, like ImageNet and COCO, combined with the computational power of GPUs and deep learning algorithms thus propelled computer vision to come very close to human performance when it comes to classifying objects in images.

Making images speak

From the early days, the story of computer vision has been one of turning images into words. The historical development of computer vision has gradually made it possible to identify objects in images with increasing success. The impact of computer vision is profound and multifaceted. It may not be something we encounter directly every day like searches and recommendations, but computer vision systems are ubiquitous, embedded in various aspects of daily life on our phones where it recognizes persons, traffic cameras, and image search on the internet. They power facial recognition systems, enable self-driving cars, facilitate medical image analysis, and enhance security through surveillance systems.

Computer vision has also catalyzed advancements in other domains of AI through the development of increasingly powerful neural networks. The ability of machines to interpret and act upon visual information has opened up new possibilities in automation, augmented reality, and human-computer interaction. Although computer vision has found its use in different areas it has not created industry champions or mega markets like search and recommendation algorithms have. While companies like Tesla, Maxar, and Waymo use computer vision it is not integral to their business model.

The democratization of computer vision technologies through open-source libraries like OpenCV and TensorFlow has empowered a broader community of developers and researchers, fostering innovation and application across industries. As computer vision continues to evolve, it promises to drive further advancements in AI, making machines more perceptive and responsive to the complexities of the real world.

In essence, the trajectory of computer vision illustrates a remarkable convergence of improved training data, technological development, and practical application, cementing its role as a cornerstone of modern artificial intelligence but without any deep market impact. With the advent of generative AI, we are now able to make an image say more than a thousand words.

This was part three of the series The Six Waves of AI in the 21st Century