We’ve covered the use of artificial intelligence in recruiting thoroughly on the RightJobNow blog over the past few years. Definitely since it became a popular buzzword in the industry. But the one thing we haven’t touched on is how much data it actually takes to make a successful AI application.
I’ve mentioned in one of my posts how one Google experiment took a ton of data – something equal to the storage capacity of 16,000 computers – to have an AI model identify cats in 10 million images. And Mark has touched on the kind of data needed in one of his posts. But most business applications don’t require datasets the size of Google’s machine learning experiment.
So how much then is enough? The short answer is, it depends on what you’re trying to accomplish.
Before we explore that further, let’s recap what AI is and then delve into some definitive answers on data requirements.
What Is AI?
Broadly defined, AI is a computer program that is capable of creating its own rules of operation.
Whereas an ordinary computer program relies on someone to provide the input, instructions for the input, and the computer produces the output, Artificial Intelligence relies on input and output in the form of datasets and it then finds the shortest path to link the two. That’s how it “learns.”
The problem with finding the shortest path is that there could be many paths from input to output and the shortest path may not be what the designer wants the AI to learn.
That’s why most successful AI models require lots of data.
The more sophisticated the desired output (like understanding the context of a spoken language) the larger the dataset required. With enough data the AI model learns to reach a desired output consistently and accurately.
A dataset in the above example might include the usage rules of language, say English, including definitions, slang, idioms, dialect, etc. The more data you provide, the better the AI will understand English and the more accurately it will interpret a sentence, a question, a conversation.
The Bottom-Up AI Model
Traditional AI applications are built on the bottom-up approach using simple methods and systems that grow to become more complex as the AI learns.
With this kind of model the smaller systems are linked to a larger subsystem that accounts for all the information necessary.
Because of the tremendous amount of data required for AI to learn, most of that has to be high-quality training data. A high-quality dataset in a typical model is comprised of:
- Completeness – are there missing records?
- Accessibility – can the data be accessed?
- Correctness – are the records free of mistakes?
- Connectivity – can different dataset be joined?
- Quantity – is the amount of data limited?
- Validity – does the data contain outliers?
If the dataset falls short in a critical category, like quantity, problems arise. In the case where the amount of data is limited or finite, the AI will fail to achieve the expected results until the appropriate data is provided.
If you look at something like recruiting chatbots, data quantity would account for answering as many questions in as many ways and varying candidate behaviors as possible. It also has to account for specifics related to each job role available and to those of your organization.
There is also the need for pre-training in some models. Referring back to the chatbot, this would require the language skills mentioned in the last section in order for it to understand the context of open requisitions and related organization info.
If the chatbot comes across a query it is not trained to answer, for example a miscommunication of language, it’s a simple matter of adding the data necessary to close that gap. A chatbot, in that instance, has an expected higher “error tolerance.”
But when you get to more sophisticated AI applications such as video interview analysis, where bias has to be eliminated to prevent the possibility of discrimination, you have to narrow down the error tolerance, which requires more data.
Therefore the complexity of the model determines the scope of the dataset.
So, when you consider the level of training involved for even the simplest AI model, like a chatbot, you start to get an idea of just how much data may be required to create a functioning and successful AI.
The problem is the more high-quality data needed the harder it is to come by and the more expensive it becomes to acquire. That’s why some AI developers have looked to a top-down approach to cutting costs.
The Top-Down Model
A top-down model uses pre-defined data to solve a problem or make a decision. It relies less on data collection and training and instead borrows from existing datasets.
Essentially, this is not machine learning, a label most people think of when they think of artificial intelligence where you have a robot that acts human and makes human decisions.
AI is an umbrella term and machine learning is a subsystem of it. Most business and recruitment AI systems use the top-down approach as they tend to solve basic problems of automation.
Sticking with the chatbot example, the top-down approach is more appropriate, where the AI software is making a decision based on pre-programmed instructions.
It requires minimal training on data related to job queries and organization information. Add in the language presets and you’re good to go.
The thing with a top-down model is it doesn’t necessarily use less data in its model. It means the developer spends less time training it, which cuts down the cost of developing the model.
Top-down AI can actually increase the amount of data needed as data would be added as the AI encounters scenarios for which it is not trained, which again becomes a problem if the data is hard to come by.
So, How Much Data Does A Successful AI Require?
A lot. Or maybe just enough to get the job done. It requires at least thousands of data points. Probably hundreds of thousands. Maybe even millions. It depends on the type of data available. And what you want the AI to accomplish.
The harder the problem, the more data you’ll need.
Does that answer the question?