did not realize that I was dealing with time series data at first :D
I decided to do some personal ML projects separate from the tutorials and the AI resources I am using for self-study. I learnt its better to do one complex in-depth project than to do several small projects that take a few days on kaggle. So, I thought about what I care about the most right now(apart from ML problem-solving ofcourse), -- decentralized processing(edge computing). So I sought edge computing data. Couldn't find one, but found this: some china telecom data that apparently could provide insights to edge computing researchers wrt edge server placement, service migration, service recommendation etc .
Anyways, what got me was the 'telecom'. From the fields in the data, at this moment, I'm not exactly sure what I want to predict.
I was able to get access to the short version of the data (2 weeks worth) of about 600000 examples even though, the whole data has about 7million examples and is 6 months worth of data collection.
this is the first part of the pre-processing stage of this data
I am in the identify feature and label sources stage of this project.
I thought about the possibility of using the duration a person spends in a location to predict things like: how often people will spend time in that location, how long people will spend time in that location etc.
At this juncture its imperative to show some images:
the initial data
I decided to add an extra column; duration, the difference between the start and end time.
This was when I realised I was dealing with time series data :D
:d I realise how silly that sounds given that the fields start time and end time are so glaringly obvious.
In trying to create the extra column, duration, I struggled a little bit with pandas datetime and timestamp manipulation methods until I found this.
A couple of stackoverflow searches later plus the aforementioned help, I have my additional field and a full awareness of the kind of data I am handling :D so just for the heck of it, I'm going to explain how I did it.
- I converted each of the time series to str
data['start time'] = [str(x) for x in data['start time']] data['end time'] = [str(x) for x in data['end time']]
- Then I converted them back to Timestamp
data['start time'] = pd.to_datetime(data['start time'], infer_datetime_format=True) data['end time'] = pd.to_datetime(data['end time'], infer_datetime_format=True)
- Then I evaluated the difference to create my extra column
data['duration(min)']=(data['end time']-data['start time']).astype('timedelta64[m]')
the data with my duration column in mins
I already am seeing some NaN values in my jupyter notebook. some examples in the data are missing some fields(the location field)
I will decide on whether to replace those fields with 0 or to remove those examples entirely. Then I will perform more cleaning, transformation, sampling and splitting(in the right order):D
alright!, just couldn't resist the urge to document the moment I knew I was dealing with my first time series data :D