This is the first part of a 3-part piece. These articles are notes from my self-study of Data pre-processing in ML problem solving. Major resource for study: google AI educational resources
Data pre-processing takes more than half of the project time. this is because data can make or mar your model.
the following are the steps to take for data pre-processing:
- Construct the dataset
- Transform the data
Construct the dataset
The following are the steps to take to construct the dataset:
- collect the data
- Identify feature and label sources
- select a sampling strategy
- split the data
Collect the data
The following are the steps to take in collecting the data:
- Size and Quality of your dataset
- Joining Logs
- Label sources
Size and Quality of Data
Take into account the size and quality of your dataset. More often than not, the more the data, the better for your overall model performance.
your model should train on at least an order of magnitude more data than your training parameters. your model is only as good as your data
However, the quality of the dataset is also paramount as bad quality large dataset is just as useless.
Quality though, is a fuzzy term. A good framework for determining what passes for quality data is to:
use data that lets you succeed with the business problem you care about
Data is good if it lets you accomplish the intended task.
Having said that, concrete measures of data quality include:
- Feature Representation
- Minimizing Skew
The following are questions to answer when determining the reliability of your data:
- how common are label errors?
- are your features noisy?
- is the data properly fitted for your problem?
The following are indicators of unreliable data:
- omitted values
- bad labels
- duplicate examples
- bad feature values
In thinking about a new ML problem, 1 or 2 features as a start is always best. Here are some of the questions to answer when mapping data to useful features:
- How is data shown to the model?
- Should you normalize numeric values?
- How should you handle outliers?
Avoiding training or serving skew. This means ensuring that your training data is not so different from your testing data in terms of metrics for results. The more closely your training task matches your prediction task, the better your ML system will perform.
Golden Rule: do unto training what you would do unto prediction
Often, in assembling a training dataset, combining multiple sources of data is necessary. There are 3 types of input data:
- transactional logs
- attribute data
- aggregate statistics
Transactional logs record a specific event. For example a recording of the date and time a post enters a database.
Attribute data contains snapshots of information. It isn't specific to an event or a moment in time but can still be used to make predictions especially predictions that are not tied to a specific event e.g. the number of blemishes in a particular product.
you can create a kind of attribute data by aggregating several transactional logs. This is also known as aggregate statistics
Aggregate statistics is gotten from creating an attribute data from multiple transactional logs e.g frequency of user queries.
Joining logs from different location is necessary often in assembling your training data. Prediction data or test data however have the following sources:
- online sources
- offline sources
The choice between the two options can be made using this framework:
online: latency is a concern, so your system must generate input quickly. therefore attribute data and aggregate statistics may need to be computed or looked up before hand and not on the fly due to the additional latency it gives to the system
offline: you don't have any compute restrictions(e.g. latency) so you can do complex computations similar to training data generation.
Labels are the outputs of your data. Its important to have well defined labels as it enhances the ease of machine learning. There are two types of labels: direct and derived labels. The best label type is direct label.
direct label: user is a fan of X
derived label: user has watched X's video on youtube therefore user is a fan of X
Derived label does not directly measure what you want to predict.
your model will only be as good as the relationship between your derived label and your desired prediction
There are two types of labels:
- direct label for events (did the user click on X)
- direct label for attributes (will the temperature be more than X in the next week)
The output of your model can be either an event or an attribute
DIRECT LABEL FOR EVENTS
Answer the following questions as guide:
- how are your logs structured?
- what is considered an event in your logs?
you would need logs where the events are impressions
DIRECT LABEL FOR ATTRIBUTES
Typically previous days of data is used for prediction in the coming days(take the example will the temperature be more than x in the next week?)
seasonality and cyclical effect should be taken into consideration.
Note: Direct label needs logs of past behaviour. We need historical data to use supervised machine learning
If you do not have a log of past data to use, maybe e.g. your product does not exist yet, you could take one or more of the following actions:
- use a heuristic for a first launch and then train a system based on logged data
- use logs from a similar problem to bootstrap your system
- use human raters to generate data by completing tasks
Note: using human raters or human labelled data have the following pros and cons
- data forces you to have a clear problem definition
- human raters can perform a wide range of tasks
- good data typically require multiple iterations
- the data is expensive for certain domains
IMPROVING QUALITY OF HUMAN RATED DATA
- label the data yourself and compare with your raters'. do not assume your ratings are the correct ones in the event of discrepancies
- help your raters reduce errors by giving them instructions