Running Code and Models Against Datasets

After you’ve identified the public dataset(s) you want to code or create a model against (see Discover and view public, yours or your team-mate datasets) or have uploaded your own dataset(s) (see Uploading your own dataset) MiPasa SDK provides you a special module that’s used to interact with its datasets quickly and effortlessly.

The typical actions that require the MiPasa SDK are:

  • Fetching a dataset
  • Decoding the chosen dataset’s country / state code values into a universal country / state codes by fetching its respective Unbounded Taxonomy Representation (UTR) entry

We will go over both below, using an existing code entry as an example.

To load the MiPasa SDK and request the dataset, add the following lines:

import mipasa
client = mipasa.Client()
feed = client.get_feed_by_name('The COVID Tracking Project')
df_usa = feed.get_file('Output_CovidTracking_Data_Positive.csv').get_as_dataframe()

After this, you will have a Pandas DataFrame containing the contents of the file called “Output_CovidTracking_Data_Positive.csv” from the COVID Tracking dataset (the dataset itself can be seen here: The COVID Tracking Project).

In our Output files, the states are always encoded. To decode a state, the process is very simple:

# get UTR (Unbounded Taxonomy Representation) table for States
# this is needed to translate state ID to name
utr_states = client.get_utr('states')

# csv header
utr_states_id = utr_states[0].index('id')
utr_states_name = utr_states[0].index('name')

state_names = {}
for row in utr_states[1:]:
  state_names[row[utr_states_id]] = row[utr_states_name]
df_usa['stateId'] = df_usa['stateId'].map(lambda x: state_names[x])

This will convert state identifiers such as S00003956 to readable names like “New York”.
If you run the code now and try to print the DataFrame, you should be able to see excerpt of the data:


Plotting a chart using Code editor

MiPasa supports multiple libraries to plot charts, such as Matplotlib, Plotly, or Seaborn.
For this tutorial, we will create a simple chart using Pandas and Matplotlib.

Note: For this, we will reuse the code created in the previous chapter.
We’re assuming that you already have imported the MiPasa SDK and have a df_usa DataFrame with the United States data.

The chart will be showing the latest known state of COVID-19 contamination within the US.

For this, we need to find the current date, or rather, the latest date available in the dataset. The data is generally up-to-date, however, due to time zones and update times (for example, some feeds update at 4pm, some update at 12am…) it’s safer to not assume a specific date to be present.

import dateutil

# filter by date
df_usa = df_usa.sort_values('date').groupby('stateId').tail(1)

# this is to show the date visually
latest_date = df_usa['date'].max()
latest_date_visual = dateutil.parser.isoparse(latest_date).strftime('%d %h %Y')

The date that we now have as latest_date_visual will be used in the chart — we are directly specifying the date that the chart was created for, to not confuse and misinform people.

Now that everything is ready, we can plot the chart! Simply add the following code:

import matplotlib.pyplot as plt

f = plt.figure(figsize=(10, 5))

pd_input = df_usa.groupby(["stateId"]).sum().sort_values('positive')["positive"]

plt.barh(pd_input.index[-10:], pd_input.values[-10:], color="darkcyan")
plt.tick_params(size=5, labelsize=13)
plt.xlabel("Confirmed Cases (for %s)"%latest_date_visual, fontsize=18)
plt.title("Top 10 States: USA (Confirmed Cases)", fontsize=20)
plt.ticklabel_format(axis='x', style='plain')
plt.subplots_adjust(left=0.18, right=0.95, bottom=0.15, top=0.9)

Now, if we run the code, you should be able to see your new chart on the right: