Post

Machine Learning Project with Orbital Insight

Before We Start Talking about the Project…


This Machine Learning Project was a part of the Break Through Tech AI Program hosted by Cornell Tech. It was a one-semester program (Fall 2022), and in this project, I worked in a team of four students with the company “Orbital Insight.” I will talk about this in more detail later.

This was my **first** machine learning project and also my first group project. Not only that, but while doing this project, I noticed that I really enjoy analyzing data. It’s always fun to visualize the data and the most interesting part of it is pulling out meaningful insights from the massive data. Therefore, before this project, I really did not know what kind of software developer I wanted to be. My major is computer science, and I kind of like studying the mathematical part of computer science, but whenever I studied other software engineering fields such as web/app development, computer graphics, parallel computing, and so on, I never really “clicked” with those fields. However, data analysis and all ML/AL studying were different. When I first started learning this field, I noticed that

“Wow, this is my field.”

It’s a little bit late (since I am already a senior now), but I always think it’s better to notice it now and start doing something about it. And I started doing what I can do right now. I got a machine learning specialist position at NYU VIP Team - AI for Scientific Research - and will work on a project from this spring semester and joined several programs. I might apply for an on-campus ML engineer job and hopefully, I can work there.

Anyway, during the fall 2022 semester, many things impacted and changed me. This project was one of the factors that changed me a lot.

Team Orbital Insight: The Effect of Typhoons on Shanghai Shipments


Orbital Insight is a *geospatial analytics* company. They analyze all kinds of geospatial data such as satellite, drone, plane, vessel, vehicle, and cell phone signal data to find out business and other strategic insights. In our project, our team decided to analyze vessel data and built a model that can predict "**dwell times**" of ships during and outside of the typhoon season, specifically in the Shanghai area.

Group Work Experience


This was a team-based project, so our challenges were not only building our ML model successfully, but also achieving other “soft skills” like communication, time management, collaboration, leadership, and so on.

Communication with Challenge Advisor and AI Studio TA


Mostly, our team communicated with our challenge advisors and AI Studio TA through Slack. Whenever we had obstacles in our project, we asked for help from our advisor and she always connected us to various data scientists at Orbital Insight. By discussing and solving obstacles together with currently working data scientists, we were able to approach our project in a more professional way and think in many different ways. This experience taught me how to work with other data scientists and to ask for help whenever we encountered technical issues or obstacles.

This is part of our slack messages with our challenge advisor and data scientist of Orbital Insight

1

This is part of our slack messages also with our TA

2

We did not have a specific place and time to work together. Therefore, we always checked time availability using “When2meet” and mostly met on Zoom (or google meeting). Sometimes, we could have meetings at the Orbital Insight office.

This is Orbital Insight’s Office photo!

3

Team Communication and Workload


Our team also started communicating using Slack. However, we had a little bit of trouble communicating using Slack, so after half of our project, we switched our communication platform to text messaging.

Even though we did not have a physical space to work together, we tried to meet in person and work together. Whenever we needed to meet *in person*, we met at a public library. However, towards the end of the semester, it was a bit difficult to meet each other, so we mainly met through Zoom.

We did not divide our workload strictly. Sometimes we did, but *mostly we worked all together at the same time*. This may appear to be inefficient, but the advantages of this method were: first, everyone could be on the same pace and there was no one who fell behind. Second, whenever there were obstacles and challenges, we could discuss them right then and solve them together. This actually reduced the amount of time we spent on our project.

Industrial Scale Project Experience


This project let me experience a “real-world” problem. From choosing the topic to finding a solution, all steps were different from what I’ve experienced so far. In the past, the topic was already given and there was always an “ideal way” to solve the problems. The data was also neat. However, in this project, every step needed to be discussed and thought about critically. For example, when we chose the topic, we needed to rethink whether or not the topic was feasible.

Our team was lost at first because none of us knew what to do, but we managed well after.

About the Project

This project description is based on our team’s final presentation slide and plus my own detail description on data preprocessing. Our team presented the project results in front of 20+ professional people from Orbital Insight and Cornell Tech.

This is the link of our final presentation video:

https://youtu.be/bGd3z2A_Sko

AI Studio Project Overview


Team Challenge Goal

To build and test a model that will accurately predit dwell times of ships heading to Shanghai during and outside of typhoon season

What is dwell time? Time spent in same position, area, stage of process

Business Impact

  • Why we chose this goal?

    • Merchants and businesses need to know when shipments arrive

    • Shanghai is one of the busiest ports in the world
      • Delays here impact the global supply chain, making this important to understand
    • Ship dwell times are difficult to predict in typhoon-affected waters
    • See the effect that worsening typhoons (caused by climate change) could have on shipping
  • Utilize geo-spatial data rather than typical dataset, AOI (drawing polygons, QGIST)

  • Important economic impact, provides insight to businesses that can only be drived from AI

4

Our Approach

5

Resources we used for this project

6

At first, to decide our project topic and look around vessels’ activities, we used Orbital Insight’s own program “Orbital Insight Go”. Based on that, we asked Olivia to gather AIS data and used QGIST to visualize it.

To work on our project, we used Google Colab to collaborate with and used Pandas and Scikit-Learn for implementing data preprocessing and making a machine learning model.

Data Preprocessing and EDA


First of all, we had to deal with Null values, drop unnecessary values, and change String values to some dummy values.

1
df.drop("imo", axis = 1, inplace = True)
1
df.drop("vessel_name", axis = 1, inplace = True)
1
df.drop(df.loc[df['nav_status_code'] == 16].index, inplace=True)
1
2
3
df.drop(df.loc[df['length'] == 0].index, inplace=True)
df.drop(df.loc[df['width'] == 0].index, inplace=True)
df.drop(df.loc[df['draught'] == 0].index, inplace=True)

First, we dropped the “imo” column. IMO stands for “IMO ship identification number” and this number is used as a unique ship identifier. However, in our data, there were many 0 values and we already had the “device_id” column for identifying unique ships, so we decided to drop it.

We also dropped the “vessel_name” column for the same reason. This column was a string value and we had to either drop it or change it to a dummy value, but we already had the "**device_id**" column that could be used instead of "vessel_name".

Next, we had to deal with “null values”. The first null values that we dealt with were “df[‘nav_status_code’] == 16”. “nav_status_code” means AIS Navigational Status and this is a form of signal reported by the vessel, which describes **the status of the vessel** in real-time. The codes have their own meanings from 0 to 15, but we noticed that some of our data had 16 as its value. After discussing with our advisor, we decided that this did not have any meaning, so we dropped the rows that had 16 as its “nav_status_code” value.

The next null values we had to take care of were “df[‘length’] == 0, df[‘width’] == 0, df[‘draught’] == 0”. The columns ‘length’, ‘width’, ‘draught’ do not have meaning separately, but if you multiply all of them (length x width x draught), it becomes “tonnage” (the capacity of the ships in tons, in our case, the vessel’s capacity under the water). To be valid vessel data, this "tonnage" value should be larger than 0, but if one of the values is 0, “tonnage” value becomes 0, so we decided to drop these values. The reason why we just dropped them rather than substituting with other values was the amount of data that had invalid values was way smaller than the original value. So, we decided to drop it considering our time constraints.

Next, we tried to calculate the tonnage of the vessel under the water (ocean) because we thought that the vessel’s tonnage under the ocean might have a connection with predicting the “dwell time” of the ships.

1
2
3
4
# Calculate Tonnage
df['tonnage'] = 0
df['tonnage'] = df['length'] * df['width'] * df['draught']
df = df[df['tonnage'].notna()]

Now, using the “unixtime” of our AIS data, we first tried to prove our assumption that a *typhoon would affect a vessel's journey, especially the dwell time*. If there is a big difference before and after the typhoon period, we can say that a typhoon can affect a vessel's journey. Since we know that a typhoon can have a big impact on our vessel’s journey, it would also affect the vessel's dwell time. Therefore, when we make our model that can predict the “dwell time” of a ship, we can tell the model:

Oh, if vessels are doing a journey when a typhoon happens, the dwell time would be different than normal time. The vessels’ dwell time during the typhoon season are like this. Please be aware of this when you calculate the dwell time.

This was our plan, so now let’s begin with data analysis.

We converted “unixtime” to “datetime” for our better understanding.

1
2
3
# Convert Unixtime
df['unix_to_date'] = 0
df['unix_to_date'] = pd.to_datetime(df['unixtime'], unit='s').apply(lambda x: x.to_datetime64())

And then, among our AIS data, we grabbed “df[‘nav_status_code’] == 1”. As I mentioned before, we want to check whether or not a typhoon would affect a vessel’s journey, especially the dwell time. There must be a fewer number of “anchored” vessels before the typhoon and an unusual huge number of “anchored” vessels after the typhoon if a typhoon really “affects” a vessel’s journey. Therefore, to only get anchored ships, we only have to get the “anchored” vessel’s data and *"1" value for "nav_status_code" means in AIS data means that the vessel is anchored*.

1
df_anchored = df[(df['nav_status_code']==1)].copy()

With this anchored vessel’s data, we tabulated unique vessels by hours and date to prove our assumption and it actually was true. Our data analysis shows that there is a huge difference in the number of vessels that are anchored at a port before and after a typhoon.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Function we used for "tabulating" unique vessels by hours and date

def hourlyData(df):
    df['hour'] = (df['unixtime'] / 3600).astype(int)
    # re-index hours to 1
    df['hour'] = df['hour'] - df['hour'].min() + 1
    return df.groupby('hour')['device_id'].agg('nunique')
  
  def dailyData(df):
    df['day'] = (df['unixtime'] / 86400).astype(int)
    # re-index hours to 1
    df['day'] = df['day'] - df['day'].min() + 1
    return df.groupby('day')['device_id'].agg('nunique')
1
2
3
# tabulate unique ships by hour
hourly_anchored = hourlyData(df_anchored)
hourly_anchored
1
df_anchored['num_anchored_hour'] = df_anchored['hour'].apply(lambda x: hourly_anchored[x] if (x > 0) else 0)
1
2
3
#tabulate unique ships by day
daily_anchored = dailyData(df_anchored)
daily_anchored
1
df_anchored['num_anchored_day'] = df_anchored['day'].apply(lambda x: daily_anchored[x] if (x > 0) else 0)

This is correlation analysis of our data

7

After that, we started calculating the “dwell time” of the ships. This step was the hardest because **dwell time** is calculated by "**vessels leaving - vessels arriving**". It looks simple, but in our AIS data, some of the vessels had multiple trips. Therefore, we needed to differentiate between the **end of a first trip** and the **start of another trip**.

To differentiate the multiple trips, first, we got all the time data (‘hour’ columns in our dataset) of each unique vessel and sorted all the time by its value in ascending order. After that, if there was a **big time difference** between the trip, we **considered that as another trip** and saved it in a separate list to use it to get multiple dwell times.

How we differentiate multiple trips

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# All anchored time for each unique ships
  current_time = sorted_anchored.loc[sorted_anchored['device_id'] == i][['hour']]
  array_hour = current_time.values.tolist()
  hour_len = len(array_hour)
  biggest_dif = 0
  # it contains index of list_hour(list unique ship's "hour" column) that seems to start new trips
  trip_change = [0]
  list_hours = []

  for x in array_hour:
    list_hours.append(x[0])

  for j in list_hours:
    next_index = list_hours.index(j) + 1
    if(next_index < hour_len):
      biggest_dif = list_hours[next_index] - j
    # This means this ships had done multiple trips
    if(biggest_dif > 12):
      trip_change.append(next_index)

Get the dwell time of the multiple trips

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
	dwell = []
  start_times = []

  for h in range(len(trip_change)):
    # This "if" condition for avoiding index range error
    if((len(array_hour) != trip_change[h])):
      # the case when there are only two different "trip changes"
      if(len(trip_change) == 2 and h == 1):
        dwell.append((list_hours[trip_change[h]-1] - list_hours[trip_change[h-1]]) + 1) #trip 1
        start_times.append(list_hours[trip_change[h-1]])

        dwell.append(list_hours[-1] - list_hours[trip_change[h]] + 1) #trip 2
        start_times.append(list_hours[trip_change[h]])

      #Last index
      elif(h + 1 == len(trip_change)):
        dwell.append(list_hours[-1] - list_hours[trip_change[h]] + 1)
        start_times.append(list_hours[trip_change[h]])
        
      # Other index
      elif (h != 0):
        dwell.append((list_hours[trip_change[h]-1] - list_hours[trip_change[h-1]]) + 1)
        start_times.append(list_hours[trip_change[h-1]])
  
  dwell_times[i] = dwell
  trip_start[i] = start_times

We got “dwell time” by using two data points. One for the “vessel arrival” time and one for the “vessel leave” time. And eventually, both indicate the same trip. Since our data is duplicated, if we keep both, we decided to keep the data for "vessel arrival" only and assigned the dwell time to that data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
new_sorted_anchored = sorted_anchored.copy()
# key == vessel's device id
dwell_times_keys = dwell_times.keys()
for i in dwell_times_keys:
  # Get all trips of one unique vessels
  target = sorted_anchored.loc[sorted_anchored['device_id'] == i]
  # Get the 'hour' data of that unique vessel
  target_hours = target['hour']
  hours_array = np.array(target_hours)
  for j in hours_array:
    # basically in our AIS data, 
    if that hour is in "trip_start" column of that vessel, add "dwell time" column,
    # s
    if (j in trip_start[i]):
      # get the row index of that target data
      row_num = (target[target['hour'] == j].index)[0]
      # get the dwell time index of that target data
      dwell_idx = trip_start[i].index(j)
      #df.loc[rowIndex, 'New Column Title'] = "some value"
      new_sorted_anchored.loc[row_num, 'dwell_time'] = dwell_times[i][dwell_idx]
1
2
# Only keep one data
final_sorted_anchored = new_sorted_anchored[new_sorted_anchored['dwell_time'] > 0]

Now, we need another column that can tell our machine learning model that this is the dwell time data of the vessels when a typhoon happened. Let’s create a “typhoon” column that can indicate whether a typhoon happened at this time period or not.

1
2
final_sorted_anchored['typhoon'] = 0
final_sorted_anchored['typhoon'] = final_sorted_anchored['unixtime'].apply(lambda x: 1 if (x <= 1565459100 and x >= 1565372700) else 0)

Actually, we ran our machine learning model after this, but our models seemed like they were overfitting. Therefore, with advice from our advisor, we did a little bit more feature engineering.

Here what we’d tried after

  1. Drop the data that has unusual dwell time

    1
    
    df = df.loc[df['dwell_time'] <= 600
    
  2. We only wanted to care about “Cargo” and “Tanker” type of vessels. and to only get “Cargo” and “Tanker” vessels type, our data’s “vessel_type_code” should between 70 to 89.

1
   df = df[(df['vessel_type_code'] >= 70) & (df['vessel_type_code'] <= 89)
  1. Add a new column that indicate number of ships in backlog at time of each ship’s first anchored ping

    1
    2
    
    # sort pings by device_id and time
    df = df.sort_values(by=['device_id', 'unixtime'], ascending=True)
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    # max time between pings to consider part of the same AOI visit
    max_gap = 3600*6
       
    # flag whether successive pings are from the same vessel
    df['same_vessel'] = (df['device_id'].astype(int).diff() == 0)
       
    # find time gap between pings
    df['time_gap'] = (np.abs(df['unixtime'].astype(int).diff()) > max_gap)
       
    # flag as new trip when time since previous ping is greater than max_gap (or previous ping is from a different vessel)
    df['new_trip'] = (np.invert(df['same_vessel']) | df['time_gap'])
       
    # add unique trip id
    df['trip_id'] = df['new_trip'].cumsum()
    
    1
    2
    3
    4
    5
    
    # get first and last ping for each trip and it also require a ~0 cog (course-over-ground) to filter out the small subset of pings that erroneously report that the ship is ancored/moored while it is still moving
    # by first filtering down to anchored/moored pings, then groupby trip_id with first/last agg
    df_anchored = df[df['cog']<0.1]
    first_anchored = df_anchored.groupby('trip_id').agg('first')
    last_anchored = df_anchored.groupby('trip_id').agg('last')
    
    1
    2
    3
    
    # only fields we care about from last anchored pings are trip_id and unixtime
    df_trips = pd.merge(first_anchored, last_anchored[['unixtime']], suffixes=('_first', '_last'), left_index=True, right_index=True)
    df_trips.sort_values('unixtime_first', inplace=True)
    
    1
    2
    3
    4
    5
    
     # calculate number of ships in backlog at time of each ship's first anchored ping
    df_trips['new_trip'] = True
    df_trips['cumulative_ships_arrived'] = df_trips.sort_values('unixtime_first')		['new_trip'].cumsum()
    df_trips['cumulative_ships_departed'] = [len(df_trips[df_trips['unixtime_last']<=arrival]) for arrival in df_trips['unixtime_first'].values]
    df_trips['n_ships_anchored'] = df_trips['cumulative_ships_arrived'] - df_trips['cumulative_ships_departed']
    

Model Selection and Evaluation


Model Selection Process

  • Features:
    • Vessel type code
    • Latitude & Longtitude
    • Typhoon Occurence
    • Number of ships anchored

Some people might notice that we didn’t use “tonnage” as one of our features! It’s because we assumed that it might cause overfitting, so we just dropped it at the end!

  • Label:
    • Dwell Time
  • Problem:
    • Regression - Supervised learning problem where the label is real number
  • Our Chosen Models:
    • Linear Regression
    • Random Forest
    • Gradient Boosting

Model Comparison

Model NameDescriptionResultsProsCons
Linear regressionModel where straight line represents the data and proximity of points to line shows relationship between feature and labelMean absolute error: 26.36
R^2: 0.01
- Faster, more efficient, and simple
- Overfitting can be reduced
- Prone to underfitting
- Sensitive to outliers
- Assumes data is independent
Random forest regressorRegression model that involves multiple decision trees contained within one set and combined into an ensemble methodMean absolute error: 29.55
R^2: -0.34
- High accuracy
- Easy data preparation
- Can handle larger data
- Biased towards more complex variables
- Little control over the model
Gradient boosting regressorIterative model that progressively refines its predictions by combining multiple decision treesMean absolute error: 26.34
R^2: 0.00
- Generally more accurate compare to other models
- Lots of flexibility
- Prone to overfitting
- Hard to interpret the final models

Model Visualization

8

Insights and Key Findings

  • Final Model Selection
    • Gradient Boosting Regressor
      • Why? Lowest mean absolute error (only by a negligible amount)
  • Mean Absolute Error is very high
    • Overfitted Model
    • Features may not lead to accurate predictions
  • Potential Solutions
    • Spend more tinkering with features
    • Try using different and more datasets
    • Potentially look at approaching as classification problem

Final Thought


Trial and Error

Lessons Learned:

  • How to decide our reasonable thesis and scale of our project
  • How to deal with AIS dataDeeper understanding of python libraries (Numpy, Pandas, Scikit-Learn) & ML tools (Google Colab)
  • How important the data preparation process is
  • How to format the data to be suitable for our Machine Learning model
  • Why correct understanding of input and output feature is important
  • How to work as a teamImportance of understanding the business/economic impact of our project

Takeaways:

  • Exposure to a new side of machine learning
  • Gaining hands on experience for Python in a quick-paced learning environment
  • To not be afraid to ask questions!

Obstacles & Potential Next Steps

Obstacles:

  • Scheduling a time when we are all available
  • Limited in-person meetings
  • Beginner knowledge in Python
  • New exposure to machine learning and data science
  • Never done any industrial scale project before
  • Geospatial data vs clean data

Next Steps:

  • Scale up to different region or different typhoon
  • Apply this model to other natural disaster
  • Apply this model to different transportation such as truck, airplane, train, and so on
  • Focus on a specific company? (Amazon, Ebay, etc)