Data Pre-processing in ML Problem-Solving - part 1

This is the first part of a 3-part piece. These articles are notes from my self-study of Data pre-processing in ML problem solving. Major resource for study: google AI educational resources

Data pre-processing takes more than half of the project time. this is because data can make or mar your model.
the following are the steps to take for data pre-processing:

  1. Construct the dataset
  2. Transform the data

Construct the dataset

The following are the steps to take to construct the dataset:

  1. collect the data
  2. Identify feature and label sources
  3. select a sampling strategy
  4. split the data

Collect the data

The following are the steps to take in collecting the data:

  1. Size and Quality of your dataset
  2. Joining Logs
  3. Label sources

Size and Quality of Data

Take into account the size and quality of your dataset. More often than not, the more the data, the better for your overall model performance.

your model should train on at least an order of magnitude more data than your training parameters. your model is only as good as your data

However, the quality of the dataset is also paramount as bad quality large dataset is just as useless.

Quality though, is a fuzzy term. A good framework for determining what passes for quality data is to:
use data that lets you succeed with the business problem you care about
Data is good if it lets you accomplish the intended task.

Having said that, concrete measures of data quality include:

  1. Reliability
  2. Feature Representation
  3. Minimizing Skew

The following are questions to answer when determining the reliability of your data:

  1. how common are label errors?
  2. are your features noisy?
  3. is the data properly fitted for your problem?

The following are indicators of unreliable data:

  1. omitted values
  2. bad labels
  3. duplicate examples
  4. bad feature values

Feature Representation
In thinking about a new ML problem, 1 or 2 features as a start is always best. Here are some of the questions to answer when mapping data to useful features:

  1. How is data shown to the model?
  2. Should you normalize numeric values?
  3. How should you handle outliers?

Minimizing Skew
Avoiding training or serving skew. This means ensuring that your training data is not so different from your testing data in terms of metrics for results. The more closely your training task matches your prediction task, the better your ML system will perform.
Golden Rule: do unto training what you would do unto prediction

Joining Logs

Often, in assembling a training dataset, combining multiple sources of data is necessary. There are 3 types of input data:

  1. transactional logs
  2. attribute data
  3. aggregate statistics

Transactional Logs
Transactional logs record a specific event. For example a recording of the date and time a post enters a database.

Attribute data
Attribute data contains snapshots of information. It isn't specific to an event or a moment in time but can still be used to make predictions especially predictions that are not tied to a specific event e.g. the number of blemishes in a particular product.

you can create a kind of attribute data by aggregating several transactional logs. This is also known as aggregate statistics

Aggregate statistics
Aggregate statistics is gotten from creating an attribute data from multiple transactional logs e.g frequency of user queries.

Joining logs from different location is necessary often in assembling your training data. Prediction data or test data however have the following sources:

  1. online sources
  2. offline sources The choice between the two options can be made using this framework:

online: latency is a concern, so your system must generate input quickly. therefore attribute data and aggregate statistics may need to be computed or looked up before hand and not on the fly due to the additional latency it gives to the system

offline: you don't have any compute restrictions(e.g. latency) so you can do complex computations similar to training data generation.

Label Sources

Labels are the outputs of your data. Its important to have well defined labels as it enhances the ease of machine learning. There are two types of labels: direct and derived labels. The best label type is direct label.
direct label: user is a fan of X
derived label: user has watched X's video on youtube therefore user is a fan of X
Derived label does not directly measure what you want to predict.
your model will only be as good as the relationship between your derived label and your desired prediction

Label sources
There are two types of labels:

  1. direct label for events (did the user click on X)
  2. direct label for attributes (will the temperature be more than X in the next week)

The output of your model can be either an event or an attribute

Answer the following questions as guide:

  1. how are your logs structured?
  2. what is considered an event in your logs?

you would need logs where the events are impressions

Typically previous days of data is used for prediction in the coming days(take the example will the temperature be more than x in the next week?)
seasonality and cyclical effect should be taken into consideration.

Note: Direct label needs logs of past behaviour. We need historical data to use supervised machine learning

If you do not have a log of past data to use, maybe e.g. your product does not exist yet, you could take one or more of the following actions:

  1. use a heuristic for a first launch and then train a system based on logged data
  2. use logs from a similar problem to bootstrap your system
  3. use human raters to generate data by completing tasks

Note: using human raters or human labelled data have the following pros and cons


  1. data forces you to have a clear problem definition
  2. human raters can perform a wide range of tasks


  1. good data typically require multiple iterations
  2. the data is expensive for certain domains


  1. label the data yourself and compare with your raters'. do not assume your ratings are the correct ones in the event of discrepancies
  2. help your raters reduce errors by giving them instructions

Andrei Kapathy's take on this

No Comments Yet