Training an AI model with your data is easier than you think.
The advent of generally available Generative AI tools like those provided by OpenAI, Microsoft, Google and Amazon have recently raised the hype level on AI, but the previous generation of AI tools, like Machine Learning (ML) have been with us and have been processing our documents and other data for well over a decade.
Whether you are interested in using the new Generative AI, or already using an ML based technology, there is one key thing to remember – when you train AI models, they are only as good as the data they are trained on, whether that is an open source or commercially available Large Language Model (LLM) or a training set of 100 of your invoices, to use an old maxim, even with AI, garbage in means garbage out.
There are steps you can take to prevent this though, and luckily none of them are rocket science, so let’s take a look at how to make an ai. Let’s also take a look at how to ensure that they are high quality AI models that will augment the automation of your business for processes like data discovery, data inventory, and data crawl.
Old School versus the New Wave
Depending on what technologies you are planning on using, there will be different nuances and details to the steps you take, so lets quickly assess the different types of AI you might be looking to use:
- Commercially available LLM based Generative AI – you might want to upload your data to a commercial service that uses an LLM based system like Microsoft’s use of OpenAI’s GPT v4.
- Private, in-house development of an LLM based Generative AI – you may decide to develop your own corporate LLM using widely available open-source tools.
- Commercial or open-source Machine Learning based tools – this is now the ‘old skool’ of AI, where you create your own Model, or use commercially available services, but you train them on a set of your existing data.
The big difference between Generative AI and old school ML, is that LLM’s don’t need training on a set of existing data. For example, the acronym GPT stands for Generative Pre-Trained Transformer, and to simplify things we can say that the Transformer is the very clever AI algorithm, and the Pre-Trained bit means that the Transformer has access to an LLM, which could be a truly enormous amount of text (or images, or audio etc), far bigger broader and more diverse than any training set of data you could easily pull together, and the Transformer undertakes ‘unsupervised learning’ based on that LLM – as opposed to a custom ML model which uses ‘supervised learning’ by you feeding it a curated training set of data.
Now that we have some simple background context, lets look at the steps required to prepare data for ai models so that it can be used with an AI system or tool.
Data Labelling
Data Labelling or Data Annotation is a step at what can be considered the AI pre-processing stage, it is part of preparing your data for use by AI. ML models require labels to help train their Natural Language Processing (NLP) or Machine Vision deep learning algorithms, but even Generative AI can benefit from them, and more on that later.
The benefits of data labelling are:
- More precise predictions – accurate data labelling ensures better quality assurance – preventing the “garbage in – garbage out” scenario
- Better data usability – labelling can also improve usability of the data variables within a mode.
Challenges of data labelling:
- Expensive and time-consuming – either setting up processes and pipelines for automated labelling, or putting people in place for manual labelling
- Prone to human error – QA checks are essential to maintaining data integrity
What exactly are these labels though?
Labels are just metadata, and you may have a lot of fields of it, also commonly referred to as tags, already attached to your data, or you may need more descriptive metadata fields for use as ML training labels.
Now this is where it gets fun, you can use simple (?) AI algorithms to help prepare your data for ingestion into bigger or more complex models!
So, how does this work? Well, for example Shinydocs uses AI capabilities like Named Entity Recognition (NER) and Machine Vision, alongside non-AI techniques like Regular Expressions for pattern matching, to automate the creation of additional metadata tags for all your content, providing those important “labels” for the ML algorithm to work with. Automating the creation of extra tags deals with the issues of it being a time consuming and expensive manual process, leaving time for experts to manually review samples, or all files to ensure quality, because remember garbage in, means a low-quality AI model, which means garbage out when you turn it on in ‘production’ and start feeding it data to process.
But Generative AI doesn’t need training…?
Yes, its true, we noted above that Generative AI systems are pre-trained using potentially massive LLM’s. However, that does not actually mean there is not great benefit to labelling your data, and here is why:
If you want to send your data to commercially available, cloud based Generative AI tools, there are many reasons that you want to be careful which data you allow to be ingested into these tools. You may want to ensure you have identified all files that contain Personally Identifiable Information (PII) or commercially sensitive intellectual property, tagging them with appropriate metadata to ensure they are not added to the AI processing pipeline.
Or if you are using freely available open-source tools to build your organizations own LLM based on your private data, once you have created a ‘foundational model’ you can adapt and customize it to optimize its performance for specific business processes. At this point you may take a “semi-supervised learning” approach, which brings back the use of labels to ensure high quality data is used to tune the LLM based model.
So in summary, whether you are training up an old school ML algorithm to process your invoices, developing your own highly secure corporate foundational LLM, and customizing it process incoming complaints or contracts, or sending ten years of outbound orders to Microsoft Azure Cognitive Services implementation of GPT 4, to look for patterns and insights; you will benefit from well classified, well labelled or tagged information that will prevent the potential pitfalls of garbage being input to your AI equating to garbage being output by it.
In our recent webinar, on how to prepare your data for ai, Shinydocs showed how easy it is to prepare you data, and also how easy it is to use open source AI tools as an illustration of how easy this end to end process can be.
About Shinydocs
Shinydocs automates the process of finding, identifying, and actioning the exponentially growing amount of unstructured data, content, and files stored across your business.
Our solutions and experienced team work together to give organizations an enhanced understanding of their content to drive key business decisions, reduce the risk of unmanaged sensitive information, and improve the efficiency of business processes.
We believe that there’s a better, more intuitive way for businesses to manage their data. Request a meeting today to improve your data management, compliance, and governance.
Did you enjoy this article? Read these next:

