I recently completed my first imitation learning model. It was a behavioral cloning model which was built for the atari breakout game. For it I had to do a lot of research and reading.

First of all here is the github link

Imitation learning is a technique in ML in which an agent learns from the recorded behavior of an expert or human and tries to replicate the behavior demonstrated by the expert. Behavioral cloning is a type of imitation learning that uses supervised learning to achieve the main purpose of imitation learning.

There were 3 key parts to the implementation of this problem:

- the expert data
- the environment
- the model

This is looking a bit like reinforcement learning, ...maybe but also, no, not really.

The data for this project was obtained according to the process described in this article

The data consists of images of the game being played by the expert as well as the actions taken by the expert for each image. We do not need the reward in our case because we are performing behavioral cloning, a supervised learning approach to imitation learning.

Further Pre-processing of the images in terms of converting them to grey-scale and then reshaping them is perfromed. The following code achieves this for us:

```
def process_obs():
'''
converts the images to grey scale
'''
obs_as_is = []
gray_obs = []
file_list = os.listdir(OBS_IMG_DIR)
new_list = []
for x in file_list:
x = int(x.split('.')[0])
new_list.append(x)
the_list = [str(a) + '.png' for a in sorted(new_list)]
for img in the_list:
path = os.path.join(OBS_IMG_DIR, img)
# convert to grayscale
img_array_gray = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
gray_obs.append(img_array_gray)
return gray_obs
#reshapes the images
X_gray = process_obs()
X_gray = np.array(X_gray).reshape(-1, 84, 84, 1)
pickle_out = open("X_gray_full.pickle", "wb")
pickle.dump(X_gray, pickle_out)
pickle_out.close()
```

The environment here refers to the openAI gym environment, specifically its atari breakout game component. The expert data was obtained from playing the game and recording the actions as well as the corresponding image of game screen.

Subsequently after the model has been developed and the agent is to carry out what it has learnt from training, it does so on this environment.

You could either run the gym environment on locally or another option is to use it on colab. Local installation of the gym environment on different operating systems are fairly trivial.

Google colab provides an inbuilt OpenAI gym environment that can be accessed the following way:

- The installation and import

```
!pip install gym
!apt-get install python-opengl -y
!apt install xvfb -y
!pip install gym[atari]
!pip install pyvirtualdisplay
!pip install piglet
#sets up the virtual display for the game
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()
#we need all these other modules because we're running it on colab
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) # error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
```

- The recording of the display screen

```
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""
def show_video():
mp4list = glob.glob('video/*.mp4')
if len(mp4list) > 0:
mp4 = mp4list[0]
video = io.open(mp4, 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''<video alt="test" autoplay
loop controls style="height: 400px;">
<source src="data:video/mp4;base64,{0}" type="video/mp4" />
</video>'''.format(encoded.decode('ascii'))))
else:
print("Could not find video")
def wrap_env(env):
env = Monitor(env, './video', force=True)
return env
```

The above two sections are provided to us by default by colab. A simple search for how to use gym on colab will bring up a colab file containing the above two sections (with the code provided).

This is the only section required when working with gym locally. This is also the 3rd section required when working with it on colab. This is where the environment is actually called and used:

```
"""
where the model is used
"""
import gym
import matplotlib.pyplot as plt
env = gym.make("Breakout-v0")
env = wrap_env(env)
while True:
env.render()
# your agent goes here
prediction = model.predict_classes(prepare(observation))
observation, reward, done, info = env.step(prediction)
if done:
break;
env.close()
show_video()
```

For the breakout game, there are four possible actions that can be taken: start(fire ball), left, right, do nothing. These actions have been encoded with numbers from 0-3. Our model predicts the right actions for a given image /game screen and the agent executes the predicted action.

```
#helper fucntionf for greyscale conversion
def grayConversion(image):
grayValue = 0.07 * image[:,:,2] + 0.72 * image[:,:,1] + 0.21 * image[:,:,0]
gray_img = grayValue.astype(np.float64)
return gray_img
#prepares game current game screen to suit the type of the images the model was trained on
def prepare(obs):
IMG_SIZE = 84
# img_array = cv2.imread(filepath, cv2.IMREAD_GRAYSCALE)
obs = obs/255
gray_obs = grayConversion(obs)
print("shape of gray_obs{}".format(gray_obs.shape))
new_obs = cv2.resize(gray_obs, (IMG_SIZE, IMG_SIZE))
print("shape of new_obs{}".format(new_obs.shape))
return new_obs.reshape(-1, IMG_SIZE, IMG_SIZE, 1)
```

I decided to use the architecture described in deepmind's nature paper with 3 convolutional layers, 1 fully connected and 1 output layer.

The following code shows exactly what the architecture looks like:

```
#----------------------------------------#
# create model#
#----------------------------------------#
model = Sequential()
# conv 1
model.add(Conv2D(64, (3, 3), input_shape=X[0].shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# conv 2
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# conv 3
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# fc 1
model.add(Flatten())
model.add(Dense(64))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# fc 2
model.add(Dense(64))
model.add(Activation('relu'))
#out==logits
model.add(Dense(4))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
history = model.fit(x=X, y=y, batch_size=128, epochs=25,
validation_split=0.2)
```

- The entire dataset of over 500,000 examples could not be used due to limitations(i.e ram) of colab. so I restricted the examples used to just 100,000
- The model achieved an accuracy of about 65-70% for epoch of about 10
- Its unclear whether the architecture of the model is too dense or not for the data because not all of the data was used.
- The model has not learned. It tends to overfit beyond epoch of 5-8.

I obviously needed to use a clustering algorithm on the data to get a sense of the intuition behind why its deemed to contain insights on edge server placements, service migration etc. And being a noob, I wanted to start from the absolute scratch. so I decided on a partitioning clustering algorithm, ergo --> k-means.

I cleaned the data to now have only 4 features from which I ended up using only 2 for my first clustering algorithm.

plotting the data and removing outliers provided me with the graph below:

Then I determined the optimum value of k using scikitlearns inbuilt Kmeans module.

```
k_rng = range(1,10)
sse = []
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['long', 'lat']])
sse.append(km.inertia_)
```

Then I plotted the sum of squared errors(sse)(or within cluster sum of squares(wss)) against the range of k values to obtain the following graph. This way of determining k is called **the elbow method**

I decided to use a k value of 4. Again, using the scikitlearn library:

```
km = KMeans(n_clusters=4)
y_predicted = km.fit_predict(df[['long','lat']])
df['cluster'] = y_predicted
```

So I created a new feature "clusters" which maps each datapoint to its cluster.

Then I identified my clusters and cluster centers:

Then plotted:

**the code** is available here

👩🏾🚀*decided to start my dl journey in parallel with my ml journey. so I've been abit distracted. implementing the kmeans algo from scratch in a bit(where the math behind all of the above will be explained in-depth). excited to do knn, decision trees and random forests next*

sources:

codebasics youtube channel

simplilearn youtube channel

classical ml algorithms include those used for supervised, unsupervised and reinforcement learning. supervised learning algorithms include those for regression and classification.
The linear regression algorithm deals with numbers. Its aim is to find the relationship between an independent variable(or multiple independent variables) and a dependent variable.

The equation of a line The four cases of finding the best fit for the linear model

The equation of a line is shown above. The purpose of the linear regression model is to find the best line that represents the data. We do this by repeatedly modifying a random line through *rotation and translation*.

rotation --> modifying *m*

translation --> modifying *c*

The different ways/rules for modifying these two variables depend on the location of the data-point in the x-y coordinate as shown in the diagram above:

The algorithm goes thus:

- start with a random line
- pick the epoch(number of iterations)
- pick the learning rate(a really small number)
- LOOP(repeat number of epoch times):
- pick a random point(data-point)
- if the point is above the line and to the right of the y-axis, rotate counterclockwise and translate up. i.e
*↑m↑c* - if the point is above the line and to the left of the y-axis -->
*↓m↑c* - if the point is below the line and to the right of the y-axis -->
*↓m↓c* - if the point is below the line and to the left of the y-axis -->
*↑m↓c*

- enjoy your fitted line❣

```
def firstlinregmodel(x, y, epoch, lr, x_test):
"""
lr = learning rate
epoch = num of batches
x : array-like, shape = [n_samples, n_features]
Training samples
y : array-like, shape = [n_samples, n_target_values]
Target values
both x and y are series. x is assumed to be just one series
****each datapoint has a vertical and horizontal distance
"""
# y = mx + c
# pick a random line
# which is to say pick random numbers for m and c
m = 0
c = 0
# make the values into data points with vertical and horizontal distance
data = dict(zip(x, y))
for _ in range(epoch):
# to pick a random point
x_random, y_random = random.choice(list(data.items()))
# if the point is above the line and to the right of the y axis
if (m*x_random)+c-y_random > 0 and x_random > 0:
m += lr
c += lr
# if the point is above the line and to the left of the y axis
elif (m*x_random)+c-y_random > 0 and x_random < 0:
m -= lr
c += lr
# if the point is below the line and to the left of the y axis
elif (m*x_random)+c-y_random < 0 and x_random < 0:
m -= lr
c -= lr
# if the point is below the line and to the right of the y axis
elif (m*x_random)+c-y_random < 0 and x_random > 0:
m += lr
c -= lr
```

The code for this algorithm is available here

Testing out this algorithm to confirm that it works, we use a student performance data :

```
data = pd.read_csv('stuperf.csv')
train_data = data[0:700]
test_data = data[700:]
y = train_data['math score']
# y = y.values.reshape((700, 1))
x = train_data['reading score']
# x = x.values.reshape((700, 1))
x_test = test_data['reading score']
print(firstlinregmodel(x, y, 1000, 0.001, x_test))
```

and we obtain this graph:

Gradient descent algorithm is one that is applicable across different parts of ML and even in neural networks. It takes advantage of the difference between the predicted values and the actual values(i.e the residuals/cost) and tries to minimize them as much as possible.

In linear regression, The gradient descent curve is a concave curve representing *the cost vs the slope/intercept* in a linear regression model.

In linear regression, the loss function can be determined using the *mean squared error*

**Least squares**

The fastest way to determine the minima of a linear regression algorithm *i.e where the error wrt the slope is exactly 0*

Gradient descent does the same thing but in a series of iterative steps (the learning rate). The steps are big when far from the minima and small when close to it.

In linear regression, gradient descent is taken wrt both the intercept and the slope i.e *we calculate the minima wrt both m and c*

So we find the derivative of the loss function (in this case the mean squared error) wrt both m and c.

Then we update the values of m and c by essentially taking the steps toward the minima.

So here's the linear regression algorithm using gradient descent:

- pick initial values for m and c
- determine the predicted values with the initial values of m and c
- pick the values for the epoch and learning rate.
- calculate the partial derivatives wrt m and c using the equations above
- update the values of m and c using the equation above
- enjoy your fitted line

The code:

```
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import random
from sympy import Symbol, Derivative
def mygradientdescent(X, Y, epoch, L):
m = 0
c = 0
n = float(len(X)) # Number of elements in X
# Performing Gradient Descent
for _ in range(epoch):
Y_pred = m*X + c # The current predicted value of Y
D_m = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt m
D_c = (-2/n) * sum(Y - Y_pred) # Derivative wrt c
m = m - L * D_m # Update m
c = c - L * D_c # Update c
#print(m, c)
Y_pred = m*X + c
plt.scatter(X, Y, s=10)
plt.plot(x, Y_pred, color='r') # predicted
plt.xlabel('reading score') # x
plt.ylabel('math score') # y
plt.show()
return
data = pd.read_csv('stuperf.csv')
train_data = data[0:700]
test_data = data[700:]
y = train_data['math score']
# y = y.values.reshape((700, 1))
x = train_data['reading score']
# x = x.values.reshape((700, 1))
x_test = test_data['reading score']
mygradientdescent(x, y, 1000, 0.00001)
```

we obtain this graph:

The code for this algorithm is also available here

Gradient descent is useful when it is not possible to solve for where the derivative of the curve is 0 like in least squares. so I imagine in cases where there are many minima and its difficult to determine the best one -- neural networks

NOTES for me 🧾:

- gradient descent deals with the negative derivative of the tangent to a point on the curve.
- step size = slope of the current point * learning rate
- new intercept = old intercept - step size
- gradient descent depends on the loss function
- the gradient descent curve is a graph of
*the sum of squared loss/residuals vs the slope/intercept(or the determining factor(s)/variables for the fit of your model)*

for our linear regression model above, the gradient descent curves wrt both slope and intercept are:

References:

]]>In the previous post, I 'discovered' and became aware of the fact that I was dealing with time-series data which refined my mindset about the nature of the data etc. I also created an additional feature called *duration* which was obtained from the existing *start and end time series*.

I mentioned that there were some missing values within the data that I wasn't sure whether to replace by *data imputation*, or get rid of entirely.

The missing values were only in one of the columns, the *location* column as seen vaguely in the image above. However to really be sure, I checked the missing values by:

```
#checking the missing values
data.isna().sum()
```

This produced the following output:

The location column has about 47000 missing values.

So, we have about 47000 examples with missing values making up approximately 7.7% of the total dataset.

A rule of thumb for determining whether to delete examples with missing data or to use data imputation methods is to first check that the missing data is 5% or less than the total data available in the dataset.

In my case, this is not so. Therefore, I opted to delete the examples with missing values.

- I converted the
*location*series to strings even though it has numerical values. This was informed by the knowledge of the domain and the nature of the data (longitudes and latitudes).

```
data['location']=data['location'].apply(lambda v: str(v) if str(v) != 'nan' else None).tolist()
```

The *NaNs* were replaced by *None* values as expected.

- I converted the dataset into tuples and looped through while checking for empty strings or NaSs.

```
blank = []
nas = []
for index, month, date, start_time, end_time,location, user_id, duration in data.itertuples():
if type(location)==str:
if location.isspace():
blank.append(index)
else:
nas.append(index)
```

- I found 0 blank spaces and the missing values(now stored in the list called
*nas*) were 47592 in total. Then I dropped them from the dataset.

```
data.drop(nas, inplace=True)
```

- checking for missing values again produced:

- The total examples in the dataset reduced to about 564000

Overall, I learnt a lot of new things while I was trying to clean up the missing values from this dataset. I learnt about GAN(generative adversarial networks) and how I might apply them in cases where I don't have access to alot of data that I require. I learnt about data imputation and the use of measures of central tendency(mean median and mode) for numeric and categorical data imputation decision making.

I discovered jordan harrod's youtube channel, this very helpful piece by okoh anita. I discovered DJ sankar and this article about categorical data transformation and encoding which in fact, is more of what i will be doing next.

I will be performing more feature engineering and transformation and encoding of the features to then be able to see the data distribution of each feature and finally, (hopefully) identify the label that I am interested in.

Can't wait!🤸🏾♀️

I decided to do some personal ML projects separate from the tutorials and the AI resources I am using for self-study. I learnt its better to do one complex in-depth project than to do several small projects that take a few days on kaggle. So, I thought about what I care about the most right now(apart from ML problem-solving ofcourse), -- decentralized processing(edge computing). So I sought edge computing data. Couldn't find one, but found this: some china telecom data that apparently could provide insights to edge computing researchers wrt edge server placement, service migration, service recommendation etc .

Anyways, what got me was the 'telecom'. From the fields in the data, at this moment, I'm not exactly sure what I want to predict.

I was able to get access to the short version of the data (2 weeks worth) of about 600000 examples even though, the whole data has about 7million examples and is 6 months worth of data collection.

*this is the first part of the pre-processing stage of this data*

I am in the *identify feature and label sources* stage of this project.

I thought about the possibility of using the duration a person spends in a location to predict things like: how often people will spend time in that location, how long people will spend time in that location etc.

At this juncture its imperative to show some images:

I decided to add an extra column; duration, the difference between the start and end time.

This was when I realised I was dealing with time series data :D

:d I realise how silly that sounds given that the fields *start time* and *end time* are so glaringly obvious.

In trying to create the extra column, *duration*, I struggled a little bit with pandas datetime and timestamp manipulation methods until I found this.

A couple of stackoverflow searches later plus the aforementioned help, I have my additional field and a full awareness of the kind of data I am handling :D so just for the heck of it, I'm going to explain how I did it.

- I converted each of the time series to
*str*

```
data['start time'] = [str(x) for x in data['start time']]
data['end time'] = [str(x) for x in data['end time']]
```

- Then I converted them back to Timestamp

```
data['start time'] = pd.to_datetime(data['start time'], infer_datetime_format=True)
data['end time'] = pd.to_datetime(data['end time'], infer_datetime_format=True)
```

- Then I evaluated the difference to create my extra column

```
data['duration(min)']=(data['end time']-data['start time']).astype('timedelta64[m]')
```

the data with my duration column in minsI already am seeing some NaN values in my jupyter notebook. some examples in the data are missing some fields(*the location field*)

I will decide on whether to replace those fields with 0 or to remove those examples entirely.
Then I will perform more cleaning, transformation, sampling and splitting(in the right order):D

*alright!, just couldn't resist the urge to document the moment I knew I was dealing with my first time series data :D*

The data in this example is a small dataset from the Automobile data set with 205 examples. The exercise attempts to predict the price of a car using its features

Since the goal of this example is to examine modeling and data transformations and because of the small size of the data, splitting into training, evaluation and test data was skipped.

The following transformations were done to the data;

- made sure all numeric data are actually numeric through coercion using
`tf.to_numeric`

- filled missing values with 0

```
car_data.fillna(0, inplace=True)
```

without normalisation, modified the model to achieve the lowest eval loss.

Here, poor hyperparameter choices(mainly the choice of optimizer) caused there to be NaN losses during training.

Fixed this by using**Adagrad optimiser**. Because of the small size of the data, pretty much any other solution didnt work.

*recall: Adagrad and Adam optimisers are built-in tf optimisers just like the Gradient descent optimiser which unlike the latter creates separate effective learning rates per feature*visualized the model's predictions using scatter plots. Highlights of this step for me was the predict_input_fn

```
predict_input_fn = tf.estimator.inputs.pandas_input_fn(
x=x_df,
batch_size=batch_size,
shuffle=False)
#similar to the training and evaluation input functions
predictions = [
x['predictions'][0]
for x in est.predict(predict_input_fn)
]
```

- attempted to add normalisations to the numeric features.
*z-score*and*scale to a range*normalisations did not work as NaN losses were still present

**visualising each fetaure in histogram showed that most were approximately normal distributions, a few had crazy outliers, not crazy enough for clipping I guess, as the instructor also used z-score first**`#my 'scale to a range' normalisation model_feature_columns = [ tf.feature_column.numeric_column(feature_name, normalizer_fn=lambda val: (val - x_df[feature_name].min())/(x_df[feature_name].max()-x_df[feature_name].min()) ) for feature_name in numeric_feature_names ]`

attempted to make a better model using only the categorical features. using the Gradient descent optimiser also flagged NaN losses which were again corrected with either the Adam or Adagrad optimisers.

same behaviour was seen when both the categorical data and the numeric data were used together.

**overall**:

Getting more familiar with the input functions syntax, tf.Estimator function and the feature column APIs were the highlights of this example.

*The model used a dense neural network algorithm for predictions*

The first two parts of these series has focused on data preparation. Now we focus on **feature engineering**, a term for the process of determining which features might be useful in training a model and then creating those features by transforming raw data found in log files and other resources.

**The following are the reasons for data transformation**:

**Mandatory Transformations**for data compatibility such as:- converting non-numeric features into numeric e.g a string to numeric values
- resizing inputs to a fixed size e.g some models (feed-forward neural networks, linear models etc) have a fixed number of input nodes etc, therefore, the data must always have the same size

**Optional quality transformations**that can help the model perform better e.g.- normalized numeric features
- Allowing linear models introduce non-linearity into the feature space
- lower-casing of text features

*Strictly speaking, quality transformations are not necessary--your model could still run without them. But using these techniques may enable the model to give better results*

**Transforming prior to Training**: this code lives separately from your machine learning model

**pros**

- computation is performed only once
- computation can look at entire dataset to determine the transformation
**cons** - transformations need to be reproduced at prediction time therefore prone to skew

*Skew is more dangerous for cases involving online serving. In offline serving, you might be able to reuse the code that generates your training data. In online serving, the code that creates your dataset and the code used to handle live traffic are almost necessarily different, which makes it easy to introduce skew* - any transformation changes leads to rerunning data generation leading to slower iterations.

**Transforming within the model**: the model will take in untransformed data as input and will transform it within the model.

**pros**

- easy iterations. if you change the transformations and you can still use the same data files.
- you're guaranteed the same transformations at training and prediction time
**cons** - expensive transforms can increase model latency
- transformations are per batch

**considerations for transformations per batch**

Suppose you want to normalize a feature by its average value--that is, you want to change the feature values to have mean 0 and standard deviation 1.

- while transforming inside a model, the normalization will have access to only one batch of data, not the full dataset
- you can either normalize by the average value within a batch. (not good if the batches are highly variant)
- you could pre-compute the average and fix it as a constant within the model.(better option)

Before transforming the data and during collecting and constructing the data, its important to explore and clean up the data by:

- examine several rows of data
- check basic statistics
- fix missing numerical entries

Visualizing the data is important because your data can look one way in the basic statistics and another when graphed.

Before you get too far into analysis, look at your data graphically, either via scatter plots or histograms

View graphs not only at the beginning of the pipeline but also through out the transformation.

Visualizations will help you check your assumptions and see the effects of any major changes.

There are two types of transformations done to numeric data:
**normalising** and **bucketting**

normalising: transforming numeric data to the same scale as other numeric data

bucketting: transforming numeric data(usually continuous data) to categorical data

Normalisation is necessary:

- if you have very different values within the same feature such that without normalising your training could blow up with NaNs if your gradient update is too large
- or you have two different features with widely different ranges, this will cause the
*gradient descent*to*bounce*and slow down convergence. Optimisers like Adagrad and Adam can be used in this case by creating a separate effective learning rate per feature but in the case of a wide range of values within a feature, you need to normalise

The goal of normalisation is to transform features to be on the same scale in order to stabilise training and for ease of convergence
There are four common normalisation techniques:

- scaling to a range
- clipping
- log scaling
- z-score

**scaling to a range**:

x' = (x - xmin)/(xmax -xmin)

scaling to a range is a good choice when:

- you know the approximate min and max values of the data with few or no outliers
- your data is approximately uniformly distributed across that range

A good example for scaling is age. a bad example is income.

**Feature Clipping**:

if your data contains outliers, feature clipping is appropriate. take for example, you might decide to cap all temperatures above 40 to be exactly 40. Feature clipping caps all feature values above a certain value to a fixed value.

feature clipping can be done before or after normalisations.

you could clip by z-score, i.e +-Nσ e.g. ( limit to +-3σ).

**log scaling**:

x' = log(x)

log scaling is appropriate if a handful of your values have many points while most other values have few points i.e. a *power law distribution*.

log scaling improves the performance of linear models

**z-score**:

x' = (x-μ)/σ

z-score represents the number of standard deviations away from the mean.

you would use z-score to ensure your features have mean=0 and standard deviation=1(similar to transformation by batch)

z-score is useful when there are a few outliers but not so much that you need clipping

*z-score squeezes raw values that have a very large range to a very small range(see image)*

how to decide whether to use z-score:

- if you're not sure wheter the outliers are truly extreme, start with z-score unless you have feature values that you dont want the model to learn.

finally:

is this idea that sometimes the relationship between a feature and a label might not be linear although they are related e.g. the relationship between latitude and housing values. Therefore the feature values are broken down into buckets.

In the case of the latitude v housing values example, we break down latitudes into buckets to learn something different about housing values for each bucket

this means that we will be transforming numeric features into categorical features. this is called **bucketting**

In the example of latitude v housing values(see image), the buckets are evenly spaced.

In the above picture, all the buckets are of the same space even though they do not all capture the same amount of data capacity/number of points. This results in waste.

True equality of the buckets comes from ensuring they all capture the same number of points/data capacity. this is the idea behind **quantile bucketting**

In all, there are two types of buckets:

- buckets with equally spaced boundaries and
- quantile buckets in which equality comes from considering the number of points in each bucket

*Oftentimes, you should represent features that contain integer values as categorical data instead of as numerical data*

If the number of categories of a data field is small, such as the day of the week or a limited palette of colors, you can make a unique feature for each category

A model can then learn a separate weight for each category and the features can be indexed(mapped to numeric values) thereby creating a vocabulary

**one-hot encoding** is commonly employed to encode categorical data to transform them into numeric data.

Most implementations in ML will use a sparse representation to keep from storing too many zeros in memory.

**OOV out of vocabulary**: is used to represent the outliers in categorical data.

**Hashing**: can also be used instead of creating a vocabulary. It involves hashing every string into your available index space. It often causes collisions because you are relying on your model to create a representation of the category in the same index that works well for the problem. not having to create a vocab is advantageous especially if the feature distribution changes heavily over time.

**hybrid of hashing and vocab**:

we can combine hashing with a vocab.

*All the above transformations can be stored on a disk*

embeddings are categorical features represented as a continuous value feature. Deep models usually convert the the indices to an embedding.

*embeddings cannot be stored on a disk because its trained therefore its part of the model. They are trained with other model weights and functionally are equivalent to a layer of weights*

**pretrained embeddings are still typically modifiable during training, ergo they are still technically part of the model**

Still on data collection; we've addressed the steps to take, how to ensure quality, how to label data, the different sources of both training and prediction data. Now we address the second part of **constructing the data set:**

This involves selecting a subset of available data for training in cases where there is too much available data.

The decision on how to select that subset ultimately depends on the problem: how do we want to predict? what features do we want?

**NOTE: **If your data contains **Personally Identifiable Information (PII)**, it may be important to filter it from your data e.g. to remove infrequent features. This filtering will skew your distribution because you will lose information at the tail (the part of the distribution with very low values, far from the mean). Note that the dataset will be biased towards the head queries because of the skew. Beware of it during your analysis

At serving time, you may decide to use the tail you removed.

*imbalanced*: a classification dataset with skewed class proportions.

*majority classes*: classes that make up a huge proportion of the dataset.

*minority classes*: classes that make up a smaller proportion of the data set.

**what counts as imbalanced data?**

**why do we look out for imbalanced data?**

Because you may need to apply **downsampling and upweighting**.

**Downsampling and Upweighting**

*Downsampling*: training on the disproportionately low subset of the majority class data i.e extract random examples from the dominant class

*Upweighting*: adding an *example weight* to the downsampled class equal to the factor by which you downsampled. Example weights means counting an individual example more importantly during training. An example weight of 10 means the model treats the example as 10 times as important (when computing loss) as it would an example of weight 1

*example weight* = *original example weight* x *downsampling factor*

Consider this example of a model that detects fraud. Instances of fraud happens once in 200 transactions in the dataset, so in the true distribution, about 0.5% of the data is positive.

This is problematic because the training model will spend most of its time on negative examples and not learn enough from positive ones.

The recommendation is to first train on the **true distribution**

In the fraud dataset of 1 positive to 200 negatives:

- downsampling can be done by a factor of 20. So 20 positives to 200 negatives or 1 to 10 negatives. Now about 10% of our data is positive.
- we upweight the downsampled class by addding example weights(data weights) to the downsampled class. since we downsampled by a factor of 20, we add 20 weights.

**Effects of Downsampling and Upweighting**

**faster convergence**during training because we will see the minority class more often**Disk space management**is enhanced because by consolidating the majority class into fewer examples with larger weights, we spend less disk space storing them. This savings allows more disk space for the minority class, so we can collect a greater number and a wider range of examples from that class.**calibration**is ensured through upweighting. the outputs can still be interpreted as probabilities.

After sampling, the next step is split the data into:

- Training sets
- Validation sets.
- Testing sets

Often, splitting is done randomly because it is the best approach for many ML problems. However, this is not always the best solution especially for datasets in which the examples are naturally clustered into similar examples.

Consider an example in which we want our model to classify the topic from the text of a news article:

Random split will be problematic because news stories appear in clusters. multiple stories about the same topic are often published around the same time. random splitting wouldn't work because all the stories will come in at the same time, so doing the split like this would cause skew.

A simple approach to fixing this problem would be to split our data based on when the story was published, perhaps by day the story was published.

With tens of thousands or more news stories, a percentage may get divided across the days. That's okay, though; in reality these stories were split across two days of the news cycle.

Alternatively, you could throw out data within a certain distance of your cutoff to ensure you don't have any overlap. For example, you could train on stories for the month of April, and then use the second week of May as the test set, with the week gap preventing overlap

So you could collect 30 days of data, split the data by time, train on data from days 1-29 and and evaluate on data from day 30.

Time-based splits work best with very large datasets (in the order of 10s of millions of examples).

In projects with less data, the distributions end up quite different between training, validation, and testing.

You could end up with a skew, your data can be training on information it would not necessarily have access to at prediction time.

**Domain knowledge can inform you how to split your data**

**Never train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.**

This refers to making your data generation piepline reproducible. of your data by making sure any randomisation in your data can be made deterministic.

Say you want to add a feature to see how it affects model quality. For a fair experiment, your datasets should be identical except for this new feature. If your data generation runs are not reproducible, you can't make these datasets.

This can be applied during both sampling and splitting the data. It can be done in the following ways:

**seed your random number generators**: Seeding ensures that the RNG outputs the same values in the same order each time you run it, recreating your dataset**use invariant hash keys**: hashing involves mapping data of arbitrary size to fixed-size values. The values returned by a hash function are called*hash values*,*hash codes*etc. You can hash each example, and use the resulting integer to decide in which split to place the example.**The inputs to your hash function shouldn't change each time you run the data generation program. Don't use the current time or a random number in your hash, for example, if you want to recreate your hashes on demand.**

**Note:** In hashing, **hash on query + date**, which would result in a different hashing each day as opposed to hashing on just the query which could lead to:

- Your training set will see a less diverse set of queries
- Your evaluation sets will be artificially hard, because they won't overlap with your training data. In reality, at serving time, you'll have seen some of the live traffic in your training data, so your evaluation should reflect that

Data pre-processing takes more than half of the project time. this is because data can make or mar your model.

the following are the steps to take for data pre-processing:

- Construct the dataset
- Transform the data

The following are the steps to take to construct the dataset:

- collect the data
- Identify feature and label sources
- select a sampling strategy
- split the data

The following are the steps to take in collecting the data:

- Size and Quality of your dataset
- Joining Logs
- Label sources

Take into account the size and quality of your dataset. More often than not, the more the data, the better for your overall model performance.

*your model should train on at least an order of magnitude more data than your training parameters. your model is only as good as your data*

However, the quality of the dataset is also paramount as bad quality large dataset is just as useless.

Quality though, is a fuzzy term. A good framework for determining what passes for quality data is to:

*use data that lets you succeed with the business problem you care about*

Data is good if it lets you accomplish the intended task.

Having said that, concrete measures of data quality include:

- Reliability
- Feature Representation
- Minimizing Skew

**Reliability**

The following are questions to answer when determining the reliability of your data:

- how common are label errors?
- are your features noisy?
- is the data properly fitted for your problem?

The following are indicators of unreliable data:

- omitted values
- bad labels
- duplicate examples
- bad feature values

**Feature Representation**

In thinking about a new ML problem, 1 or 2 features as a start is always best.
Here are some of the questions to answer when mapping data to useful features:

- How is data shown to the model?
- Should you normalize numeric values?
- How should you handle outliers?

**Minimizing Skew**

Avoiding training or serving skew. This means ensuring that your training data is not so different from your testing data in terms of metrics for results. The more closely your training task matches your prediction task, the better your ML system will perform.

**Golden Rule**: *do unto training what you would do unto prediction*

Often, in assembling a training dataset, combining multiple sources of data is necessary. There are 3 types of input data:

- transactional logs
- attribute data
- aggregate statistics

**Transactional Logs**

Transactional logs record a specific event. For example a recording of the date and time a post enters a database.

**Attribute data**

Attribute data contains snapshots of information. It isn't specific to an event or a moment in time but can still be used to make predictions especially predictions that are not tied to a specific event e.g. the number of blemishes in a particular product.

*you can create a kind of attribute data by aggregating several transactional logs. This is also known as aggregate statistics*

**Aggregate statistics**

Aggregate statistics is gotten from creating an attribute data from multiple transactional logs e.g frequency of user queries.

Joining logs from different location is necessary often in assembling your training data. Prediction data or test data however have the following sources:

- online sources
- offline sources
The choice between the two options can be made using this framework:

**online**: latency is a concern, so your system must generate input quickly. therefore attribute data and aggregate statistics may need to be computed or looked up before hand and not on the fly due to the additional latency it gives to the system

**offline**: you don't have any compute restrictions(e.g. latency) so you can do complex computations similar to training data generation.

Labels are the outputs of your data. Its important to have well defined labels as it enhances the ease of machine learning. There are two types of labels: direct and derived labels. The best label type is direct label.

*direct label*: user is a fan of X

*derived label*: user has watched X's video on youtube therefore user is a fan of X

Derived label does not directly measure what you want to predict.

*your model will only be as good as the relationship between your derived label and your desired prediction*

**Label sources**

There are two types of labels:

- direct label for events (
*did the user click on X*) - direct label for attributes (
*will the temperature be more than X in the next week*)

*The output of your model can be either an event or an attribute*

*DIRECT LABEL FOR EVENTS*

Answer the following questions as guide:

- how are your logs structured?
- what is considered an event in your logs?

*you would need logs where the events are impressions*

*DIRECT LABEL FOR ATTRIBUTES*

Typically previous days of data is used for prediction in the coming days(take the example *will the temperature be more than x in the next week?*)

*seasonality and cyclical effect should be taken into consideration*.

**Note: Direct label needs logs of past behaviour. We need historical data to use supervised machine learning**

If you do not have a log of past data to use, maybe e.g. your product does not exist yet, you could take one or more of the following actions:

- use a heuristic for a first launch and then train a system based on logged data
- use logs from a similar problem to bootstrap your system
- use human raters to generate data by completing tasks

**Note: using human raters or human labelled data have the following pros and cons**

**pros**

- data forces you to have a clear problem definition
- human raters can perform a wide range of tasks

**cons**

- good data typically require multiple iterations
- the data is expensive for certain domains

**IMPROVING QUALITY OF HUMAN RATED DATA**

- label the data yourself and compare with your raters'. do not assume your ratings are the correct ones in the event of discrepancies
- help your raters reduce errors by giving them instructions

Soft computing is a body of knowledge that consists of a group of computational techniques that are applied to complex computations that seek to achieve as close to a human solution as possible. These techniques are based on artificial intelligence[1] and they consist of : evolutionary computations, probabilistic reasoning, neural networks and fuzzy logic.

The term was first coined by *Zadeh L.A(1994)[2]*.

Soft computing techniques are employed for complex computations such as optimizations.

*This piece is the first article of this blog. This blog will document my AI/ML journey, therefore starting with a piece about soft computing is apt given that it was my first foray into artificial intelligence.*

You’ll find this piece interesting if you’re a noob in AL/ML like me and have a generous disposition towards a fellow noob sharing what they hope passes for knowledge, or you just generally are curious about stuff and generally have a generous disposition.

I recently completed my masters degree project which was on intelligent scheduling in fog computing. The project examined the process of scheduling within a domain(fog computing) and uses a soft computing heuristic to optimize its scheduling process with respect to the total energy consumption of the domain system.

The soft computing heuristic employed is called a genetic fuzzy rule based system (FRBS). This heuristic makes use of a genetic algorithm based learning technique called **Pittsburgh**[3] to produce optimized rule bases for the main fuzzy inference system algorithm.

Learning techniques for FRBS are based on algorithms such as evolutionary algorithms(GA and particle swarm optimizations) and neural networks.

Karr(1991) proposed the pioneer work in genetic learning for FIS.

The concept of a learning technique for fuzzy rule based systems stems from the concept of linguistic modelling which is considered to be the most important application of fuzzy logic in academia.

In soft computing, fuzzy logic is used instead of binary logic to achieve the aim of coming as close to human solutions as possible.

The various component parts of soft computing are as follows:

- Evolutionary computations
- Fuzzy logic
- Probabilistic Reasoning
- Neural Networks

All four of these parts may interplay in groups of twos or threes to form wholly new heuristics that make up individual user's required problem solving algorithm.

In the case of my masters degree, we combined fuzzy logic and evolutionary computations for our heuristic. The heart of the heuristic is the fuzzy inference system while the brain of the heuristic is the learning technique for the fuzzy inference which so happens to be a genetic algorithm.

Fuzzy inference systems are used in various applications to make as close to human decisions as possible[4]. Their most prevalent applications include those involving control systems. However, they do not have any self-learning abilities for the design of their knowledge bases, hence why learning techniques are essential and used.

The most common learning techniques are those involving evolutionary computations particularly genetic algorithm. Particle swarm optimizations and simulated annealing algorithms are also used.

Apart from the foregoing, neural networks and interpolation methods are also used for FIS knowledge base learning.

There are 3 genetic algorithm based learning techniques: Pittsburgh, Michigan and Iterative Rule learning techniques.

In pittsburgh, each chromosome is a whole rulebase or knowledge base.

A rulebase is a collection of fuzzy rules.

For example, consider a FIS with 2 inputs and 1 output with 3 independent linguistic terms(at the input) as shown below:

inputs and output with their linguistic terms

The fuzzy rules for this FIS could include the following:

*if x1 is low and x2 is medium then y is poor*

*if x1 is medium and x2 is medium then y is medium*

*if x1 is high and x2 is medium then y is high*

A chromosome within the population of the genetic algorithm could encode rules like the one above in the following way:

each of the fuzzy rules above is encoded in this one chromosome like so

After encoding the fuzzy rulesbases in each chromosome, the genetic algorithm is then employed for the optimization of the rulebases for the FIS.

- Decide what your ultimate end-goal is. In the case of optimizations, decide the parameters you want to optimize. This will inform the decision on the right inputs and outputs for the FIS.
- Encode the fuzzy rules in individual rulebases for each chromosome.
- choose a learning technique.
- optimize your rulebase with your learning technique.
- Integrate your heuristic into your desired point of optimization/domain environment.

**References**

Choudhury, B., & Jha, R. (2016). Soft Computing Techniques. In Soft Computing in Electromagnetics: Methods and Applications (pp. 9-44). Cambridge: Cambridge University Press. doi:10.1017/CBO9781316402924.003

Zadeh, L., 1994. Fuzzy logic, neural networks, and soft computing. Communications of the ACM, 37(3), pp.77-84.

Alcala, R., Casillas, J., Cordon, O., Herrera, F. and Zwir, I., 1970. Techniques for Learning and Tuning Fuzzy Rule-Based Systems for Linguistic Modeling and their Application.

Vaščák J. (2013) Automatic Design and Optimization of Fuzzy Inference Systems. In: Zelinka I., Snášel V., Abraham A. (eds) Handbook of Optimization. Intelligent Systems Reference Library, vol 38. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30504-7_12

<

]]>