Portfolio

Introduction

We will split our analysis into a series of steps, which will be identifiable via header titles. We begin by importing all the modules we will need in our analysis.

from itertools import izip_longest, chain
import heapq
import scipy.stats
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.covariance import EllipticEnvelope
from ggplot import *
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib

matplotlib.interactive(True)
SAMPLE_FRAC = 0.1
RANDOM_STATE = 1
CONTAMINATION = 0.1

Data Overview

Next, we identify the number of categorical and continuous variables, as this will effect how we preprocess the data before we input it into a predictive model. We will ignore the huid data, as this is merely a user signifier, and has no relation to the actual user. Hence, it will have no predictive power, and so we can drop it. Also, we will do our preprocessing with the target_commits (the variable we wish to predict) appended to our feature space. This is for ease of coding, and computation.

First, we isolate the continuous and categorical features, and get a general overview of our data. Note that the huid is merely a user signifier, and has no relation to the actual user. Hence, it will have no predictive power, and we can drop it.

# read data
train_data_orig = pd.read_csv("data/train_data.tsv", delimiter='\t')
test_data_orig = pd.read_csv("data/test_data.tsv", delimiter='\t')

# drop huid
train_data = train_data_orig.drop('huid', axis=1)
test_data = test_data_orig.drop('huid', axis=1)

# identify types of features
text_feats = train_data.columns[train_data.dtypes == 'object']
continuous_feats_old = train_data.columns[train_data.dtypes != 'object']

An inspection of this data indicates that, with the exception of the huid, we only have continuous data, and textual data, which as we shall see, can also be mapped to a continuum of values. That is, after dropping the huid, there will be no categorial data here (Of note is that we treat ‘is_org_member’ as a continuous variable, since it represents the lower bound for the number or organizations. It may seem n-ary, but it is not).

Now that we have classified our features, we next get an overview of nulls in our data.

def na_plotter(data=None, title=None):
    bar_data = data.isnull().sum().to_frame(name=title)
    display(bar_data)
    fig = bar_data.plot.barh(figsize=(8, 8))
    plt.xlabel('Count', fontsize=15)
    plt.ylabel('NaNs', fontsize=15)
    plt.title(title)
    return fig


tr_null_fig = na_plotter(train_data, 'Training Data Nulls')
plt.show()
te_null_fig = na_plotter(test_data, 'Test Data Nulls')
plt.show()
Training Data Nulls
user_age_days 0
is_org_member 0
total_repos 0
total_followers 0
total_stars_given 0
total_commit_comments 0
days_since_commit_comment 32169
total_commits 0
total_inline_pr_comments 0
days_since_inline_pr_comment 24129
total_issue_comments 0
days_since_issue_comment 1788
total_issues 0
issue_titles 0
days_since_issue 6843
total_prs 0
pr_titles 0
days_since_pr 5526
total_pushes 0
days_since_push 6313
target_commits 0

png

Test Data Nulls
user_age_days 0
is_org_member 0
total_repos 0
total_followers 0
total_stars_given 0
total_commit_comments 0
days_since_commit_comment 3643
total_commits 0
total_inline_pr_comments 0
days_since_inline_pr_comment 2668
total_issue_comments 0
days_since_issue_comment 237
total_issues 0
issue_titles 0
days_since_issue 738
total_prs 0
pr_titles 0
days_since_pr 589
total_pushes 0
days_since_push 708

png

Data Munging

Since our goal in this exercise is to evaluate against a test set, and obtain output for all huids in that test set, we must eliminate nulls by dropping features and not samples.

Note: Filling the NaNs is in itself a prediction problem: in order to fill NaN for one huid, we need to know its relationship to the other huids that do not have nulls. This in itself is a prediction problem. Hence, given time limitations for this exercise, it makes more sense to drop the NAs, rather than fill.

# drop features with nulls column
cols_to_drop = train_data.columns[train_data.isnull().sum() > 0]
train_data_drop_col = train_data.drop(cols_to_drop, axis=1)
test_data_drop_col = test_data.drop(cols_to_drop, axis=1)
non_text_feats = [col for col in train_data_drop_col.columns if
                  col in continuous_feats_old]

# display
train_data_drop_col.describe()
user_age_days is_org_member total_repos total_followers total_stars_given total_commit_comments total_commits total_inline_pr_comments total_issue_comments total_issues total_prs total_pushes target_commits
count 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.00000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 5813.135860 2.825700 16.541420 47.352540 106.088360 3.31368 316.511180 42.065280 151.985820 20.167980 36.553500 284.865900 25.676840
std 8630.662233 3.651989 37.012601 345.532015 319.739966 18.23674 3707.591458 166.376516 467.072237 48.157965 868.074846 4236.626801 316.804479
min 365.000000 0.000000 1.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1458.000000 1.000000 5.000000 2.000000 4.000000 0.00000 22.000000 0.000000 9.000000 2.000000 3.000000 11.000000 0.000000
50% 2745.000000 2.000000 10.000000 8.000000 19.000000 0.00000 109.000000 1.000000 30.000000 6.000000 8.000000 76.000000 9.000000
75% 6729.750000 4.000000 18.000000 26.000000 81.000000 1.00000 329.000000 14.000000 119.000000 19.000000 27.000000 255.000000 25.000000
max 279124.000000 124.000000 5485.000000 30038.000000 13569.000000 1239.00000 792665.000000 5602.000000 43325.000000 3923.000000 188194.000000 790770.000000 64538.000000

Identifying Collinearity in Non-Textual Variables

Many of the non-textual variables should, intuitively, be correlated: for example, ‘total_stars_given’ and ‘total_commit_comments’. To confirm this, we use a PCA and examine the explained variance of the components.

# setup PCA so it identifies the dimensionality
pca = PCA(random_state=RANDOM_STATE)
pca_fit = pca.fit(train_data_drop_col[non_text_feats[:-1]])
print({"Explained Variance Ratio": pca_fit.explained_variance_ratio_})
{'Explained Variance Ratio': array([  6.94804924e-01,   2.72357926e-01,   2.17991968e-02,
         6.95808983e-03,   1.98187673e-03,   1.03073412e-03,
         8.63584292e-04,   1.77038588e-04,   1.79374166e-05,
         5.86099604e-06,   2.81604665e-06,   1.51831640e-08])}

Note that the first two components explain almost 98% of the variance. It is clear that the dimensionality of the non-textual variables is really two.

Note: we will use this later when fitting our model to provide additional perspective on the feature importances the model implies.

Outlier Detection

# display overview of features
display(train_data[text_feats].head(20))
display(train_data[non_text_feats].head(20))


#
# Glancing at the above, it seems our data has a number of
# outliers. To see this, observe the
# large relative discrepancy (i.e. + 3 std) between the 2nd quartile
# in some of our features, and the maximum value in that feature.
# Furthermore, some feature max values, such as total_commits, are clearly
# bad data (792665 commits a year comes out to an average of 2177 a day).

# To corroborate, we scale the data in each feature between 0
# and 1, and use boxplots to gain an idea of the distribution of the
# data--and to visually see the outliers.
issue_titles pr_titles
0 [Change how documentation is built, New Change... [Don't create welcome pages without webroots, ...
1 [PHP Warning when converting] [Profile api, Fix Phone Number Order, Email te...
2 [Web connectivity test contains null in blocki... [Ui/redesigned, Result view, App v0.1.0 , Info...
3 [Profile CeCity, Add D3.js To The Toolbox, Sea... [Fixing The Dead Link]
4 [Nightscout test fails if url ends with /, His... [make the same, replaced references to old daw...
5 [Prevent nodes with duplicate names being regi... [Enable barman-wal-restore.py to pass -c/--con...
6 [Allow DNS domain on API request, Specify DNS ... [Fix comment typo on CarrierCore, Applied fixe...
7 [Wrong Projected Point, 32bit build fails, Loc... [Kk lesslua, trusty updates, pass args to inst...
8 [refactor code and released a new version, Per... [Support for AWS Elasticache Autodiscovery, Bu...
9 [Community suggestion: email should sequence t... [Fix link to Docker's user guide, Release 0.9....
10 [Use a SUID/sudo-approved shim to create names... [Note in the readme that this is in linux 3.17]
11 [Allow referencing from ERB templates other fi... [Document how to enable SSI, Do not delaycompr...
12 [Remove CommonContext, Icons on buttons, More ... [Add support for Gnome 3.20, Implement list me...
13 [Leaf Monocle, Switching to the Most Recent Ta... []
14 [Are updates thread-safe?, Running jpylyzer on... [Added an initial Dockerfile that can build an...
15 [Fastboot, Silent fail when network is big, Ad... [Temporary loggin on cluster node, Slurm defau...
16 [Control defaults per html demo pages, Add API... [A couple of additions to Hanbyuls lrm branch,...
17 [Resolve missing $error_occurred var, New "rep... [Fix typo and sync readme content with config ...
18 [Support blackfire on PHP 7, Middle-click on a... [Fix the handling of preincrement and predecre...
19 [Footnotes Extension?, activeElement can be nu... [Add observe method to Kefir.Observable, Assig...
user_age_days is_org_member total_repos total_followers total_stars_given total_commit_comments total_commits total_inline_pr_comments total_issue_comments total_issues total_prs total_pushes target_commits
0 7744 4 104 151 189 8 2003 190 969 13 489 985 55
1 959 1 7 0 4 0 711 31 78 1 61 549 15
2 6920 4 11 6 3 0 80 43 54 14 12 149 65
3 160776 56 118 307 152 2 66571 0 29 3 1 73006 1308
4 2024 0 19 5 51 0 821 5 17 5 24 763 85
5 1311 1 6 9 16 0 346 0 124 14 2 404 9
6 8515 5 164 71 450 34 677 873 2218 186 340 1994 38
7 6309 3 21 13 5 13 1044 259 433 60 222 1609 17
8 6236 2 23 1028 720 0 231 20 120 4 24 192 16
9 4280 4 10 38 6 1 861 435 1539 175 211 962 37
10 2771 1 9 4 12 0 2 0 14 5 1 0 28
11 4586 2 28 17 86 0 6 1 64 27 9 1 0
12 8825 5 27 27 89 0 481 10 140 67 12 396 9
13 4436 2 17 71 1625 1 306 0 25 7 0 226 14
14 22224 8 40 59 67 6 3127 6 181 72 15 2603 227
15 4696 2 9 4 36 0 202 0 22 12 8 1 41
16 1776 3 16 3 4 0 688 15 58 8 123 699 32
17 2296 1 26 20 175 5 421 131 638 15 252 460 11
18 27228 12 261 1241 188 69 336 2112 3438 110 133 172 33
19 7998 6 50 44 33 2 614 30 273 40 35 345 44
def plot_outliers(feat_df):
    data_for_plot = MinMaxScaler(feature_range=(0, 1), copy=True).fit_transform(
        feat_df)
    df_for_plot = pd.DataFrame(data=data_for_plot, columns=feat_df.columns)
    fig = df_for_plot.plot(kind='box', vert=False)
    return fig


out_fig = plot_outliers(train_data_drop_col[non_text_feats])
plt.show()

png

To reduce the presence of the outliers, we test both an Isolation Forest and an Elliptic Envelope on our dataset. The boxplots suggest that assuming that at least 10% of the samples have outliers in one or more than one feature is a reasonable starting point (the density of outlier points in many feature colums remains high for a distance of 0.1 units). Note that we may opt later on to abandon filtering of outliers if our model has predictive power without it, as outlier filtering has the risk of introducing bias into our model, which in turn could increase the RMSE of our model against a test set. However, for now, we include outlier filtering, as the presence of a few extreme outliers that are due to noise and are somehow not ‘true’ data may skew our model towards fitting the noise a bit better, at the expense of the non-extreme points closer to the center of the data, and so increase the variance of the model and, potentially, the RMSE.

def outlier_filter_ee(feat_df, feats_to_consider):
    env = EllipticEnvelope(contamination=CONTAMINATION)
    df_to_fit = feat_df[feats_to_consider]
    fitted = env.fit(df_to_fit)
    results = fitted.predict(df_to_fit)
    results_bool = pd.Series(results).apply(bool).values
    fin = feat_df[results_bool]
    return fin


def outlier_filter_if(feat_df, feats_to_consider):
    iso_forest = IsolationForest(max_samples=1.0,
                                 max_features=len(feats_to_consider),
                                 contamination=CONTAMINATION, n_estimators=1000)
    iso_forest.fit(X=feat_df[feats_to_consider])
    bool_ints = iso_forest.predict(feat_df[feats_to_consider])
    bools = [True if item == 1 else False for item in bool_ints]
    feat_df_filt = feat_df[bools]
    return feat_df_filt


out_fig_ee = plot_outliers(
    outlier_filter_ee(train_data_drop_col,
                      non_text_feats)[non_text_feats])
out_fig_if = plot_outliers(outlier_filter_if(
    train_data_drop_col,
    non_text_feats)[non_text_feats])
plt.show()

png

png

Comparing these two boxplots with our original outlier boxplot shows that the Isolation Forest is doing a better job of filtering outliers (the outlier density in the features after the Isolation Forest is applied is generally less than the outlier density after the Elliptic Envelope is applied).

Hence, we opt for using the Isolation Forest on this dataset.

Feature Extraction, and the Model Pipeline Strategy

We now seek to convert this textual data into numerical data that preserves the semantic relationship of works in each issue_title and pr_title. A TfidfVectorizer is an appropriate approach.

We make sure to drop certain english words in our model (the ENGLISH_STOP_WORDS) that will have very low tf-idf and would prove to needlessly increase the time it takes to run our model if we left them in.

Then, we combine our newly generated text features with our continuous features. This is in general non-trivial (the text features and continuous feature data will not have the same shape, for one), however scikit-learn provides the FeatureUnion to simplify this task.

Lastly, we run our union of features through a model. Since our data is continuous, a LinearRegression is chosen, due to its high interpretability, and due to the unavailability of of Negative Binomial Regressors in scikit-learn, which are often used for data with dependent variables consisting of count data.

However, this may well not be an impediment: if the fit of the Linear Regressor is good, its coefficients will serve as a measure of how much each feature impacts the number of target_commits for a user. We remark that since the dependent data (target_commits) is count data, the LinearRegression may predict negative values. To counteract this, we will choose to run a RandomForestForest at the very end of our analysis to predict target_commits. This is more of a black-box model (which is why we use the LinearRegression first to gain insight into our features), but has the advantage that, since it essentially operates by, per tree, piping samples to bins consisting of dependent variable values and averaging over them in each bin, and then averaging over all trees, to predict the target_commits for a given sample, the output will always be non-negative for a non-negative dependent variable.

We choose both LinearRegressor and RandomForests with a set of sensible parameters, to possibly be altered later. We use normalization in the linear regression so that all the coefficients are all on the same ‘scale’, eliminating the need to divide each of them by their standard error in order to compare them with each other. Also, we set the random_state to 1 in the RandomForest, so that on repeated runs the same output is generated; i.e. this analysis is reproducible.

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]


class ContinuousFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, keys):
        self.keys = keys

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.keys]


def model(cont_feat_labels=None, regressor='lr'):
    issue_titles_tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS,
                                         lowercase=True,
                                         ngram_range=(1, 1))
    pr_titles_tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS,
                                      lowercase=True,
                                      ngram_range=(1, 1))
    cont_var_thresh = VarianceThreshold(threshold=0.0)
    if regressor == 'lr':
        model_tuple = ('lr', LinearRegression(normalize=True))
    elif regressor == 'rf':
        model_tuple = ('rf', RandomForestRegressor(n_estimators=100,
                                                   random_state=RANDOM_STATE,
                                                   criterion="mse"))
    else:
        model_tuple = None

    combined_features = Pipeline(
        [('union',
          FeatureUnion(
              transformer_list=[("continuous_data",
                                 Pipeline([('selector',
                                            ContinuousFeatureExtractor(
                                                keys=cont_feat_labels)),
                                           ('var_thresh',
                                            cont_var_thresh)])),
                                ("issue_titles",
                                 Pipeline([('selector',
                                            ItemSelector(
                                                key='issue_titles')),
                                           ('tfidf',
                                            issue_titles_tfidf)])),
                                ("pr_title",
                                 Pipeline([('selector',
                                            ItemSelector(
                                                key='pr_titles')),
                                           ('tfidf',
                                            pr_titles_tfidf)])),
                                ],
              transformer_weights={"continuous_data": 1,
                                   "issue_titles": 1,
                                   "pr_titles": 1,
                                   })),
         model_tuple])

    return combined_features

The Model

def output_test_train_prediction(train_xy, test_x, cont_feat_labels,
                                 sample_frac, regressor='lr'):
    # Part1: fit the model
    train_xy = train_xy.sample(frac=sample_frac)
    combined_features = model(cont_feat_labels, regressor=regressor)
    combined_features.fit(X=train_xy.ix[:, :-1], y=train_xy.ix[:, -1])

    # Part2: Return regression coefficients and ranks for the features.
    # Specifically,
    # return ranks of continuous features, and ranks of top 50 (metric is
    # magnitude of regression coefficient)
    if regressor == 'lr':
        coefs = combined_features.named_steps['lr'].coef_
    elif regressor == 'rf':
        coefs = combined_features.named_steps['rf'].feature_importances_
    else:
        raise ValueError("Please supply a valid regressor.")
    union = combined_features.named_steps['union']
    cont_data_info = union.transformer_list[0][1].named_steps['var_thresh']
    issue_titles_info = union.transformer_list[1][1].named_steps['tfidf']
    pr_titles_info = union.transformer_list[2][1].named_steps['tfidf']
    cont_data_features = [item for item, boolean in
                          zip(cont_feat_labels,
                              cont_data_info.get_support().tolist())
                          if boolean is True]
    issue_title_features = issue_titles_info.get_feature_names()
    pr_title_features = pr_titles_info.get_feature_names()
    # chain all features into one iterator, where iterator consists of feature,
    # type tuples
    all_features = chain(izip_longest(cont_data_features, [],
                                      fillvalue='non_text'),
                         izip_longest(issue_title_features, [],
                                      fillvalue='issue_title'),
                         izip_longest(pr_title_features, [],
                                      fillvalue='pr_title'))
    # lowest rank is strongest, i.e. highest |coeff|
    ranks = map(lambda x: len(coefs) - x,
                scipy.stats.rankdata(
                    np.absolute(coefs),
                    method='ordinal'))
    feature_coeff_rank_pairs = [{'feat': item[0], 'type': item[1],
                                 'coeff': coeff,
                                 'rank': rank}
                                for item, coeff, rank in
                                zip(all_features, coefs, ranks)]
    best_feat_with_coef_ = heapq.nsmallest(1000, feature_coeff_rank_pairs,
                                           key=lambda x: x['rank'])
    best_feat_with_coef_df = pd.DataFrame(best_feat_with_coef_)
    cont_ranks_with_coef_ = feature_coeff_rank_pairs[:len(cont_data_features)]
    cont_ranks_with_coef_df = pd.DataFrame(cont_ranks_with_coef_)

    # Part3: create prediction dataframes

    cv = KFold(n_splits=3,
               random_state=RANDOM_STATE)  # random_state for reproducibility
    #  of analysis
    # important note--we pass this same object to all functions below, in order
    # to have an identical collection of folds for all functions
    test_prediction = combined_features.predict(test_x)
    test_pred_df = pd.DataFrame(data={'target_commits': test_prediction})
    train_actual = train_xy.ix[:, -1]

    # cross validated prediction--see how our model performs against 3
    # folds generated from the training set, in order to approximately see how
    # our model will perform on
    # when run against our test set.

    train_prediction = cross_val_predict(combined_features, train_xy.ix[:, :-1],
                                         train_xy.ix[:, -1], cv=cv)
    train_pred_and_actual_df = pd.DataFrame(
        {'target_commits_train_actual': train_actual,
         'target_commits_train_pred': train_prediction})
    # Part4: error computation.
    rmse = np.sqrt(mean_squared_error(train_actual, train_prediction))
    return [test_pred_df, train_pred_and_actual_df, rmse,
            best_feat_with_coef_df,
            cont_ranks_with_coef_df]


# model outputs:
# 1. Test prediction,
# 2. Training set with hat(y) and actual y. Note: hat(y) = hat(y, \theta) =
#    mean(hat(y, \theta_1), hat(y, \theta_2), hat(y, \theta_3), where theta_i
#    is the vector of model parameters generated via model fitting on a fold.
#    Hence, plots of hat(y) vs y will be a sensible approximation of the
#    behavior of the real test target_commits vs our predictions
# 3. Mean of cross-val RMSEs
# 4. best features with ranks regression coefs,
# 5. continuous features, with ranks and regression coefficients
test_pred, train_pred_actual, \
rmse1, best_feat_with_coef, \
cont_ranks_with_coef = output_test_train_prediction(
    outlier_filter_if(
        train_data_drop_col, feats_to_consider=non_text_feats),
    test_data_drop_col, non_text_feats[:-1],
    SAMPLE_FRAC)

Model Evaluation

Linear Regression With Outlier Filtering

Our model has generated an RMSE on the training data, a DataFrame ranking the features in terms of their impact on the number of target_commits (metric used is the regression coefficient), and a DataFrame of the ranks of the non_text features. We now display them, respectively.

print({'RMSE': rmse1})
{'RMSE': 31.872082664174258}
display(best_feat_with_coef)
display(cont_ranks_with_coef)
coeff feat rank type
0 353.408477 html5validator 0 pr_title
1 319.178148 rewritten 1 issue_title
2 270.399477 hc 2 pr_title
3 264.217394 sindhi 3 pr_title
4 264.217394 get_currentuserinfo 4 pr_title
5 261.216512 dkms 5 pr_title
6 261.216512 bitdefender 6 pr_title
7 258.149520 pardot 7 pr_title
8 258.149520 fieldaware 8 pr_title
9 215.343756 sphincs 9 issue_title
10 215.343756 forwardref 10 issue_title
11 215.343756 consists 11 issue_title
12 215.343756 appcomponent 12 issue_title
13 210.631906 whiskers 13 pr_title
14 210.631906 tenders 14 pr_title
15 210.631906 nr 15 pr_title
16 210.631906 jslinting 16 pr_title
17 210.631906 infowindow 17 pr_title
18 210.631906 freetext 18 pr_title
19 210.631906 ebid 19 pr_title
20 210.631906 bids 20 pr_title
21 210.631906 bidperiod 21 pr_title
22 210.631906 awards 22 pr_title
23 205.526881 stepic 23 pr_title
24 205.526881 educational 24 pr_title
25 198.872260 subpages 25 pr_title
26 193.042166 antivirus 26 pr_title
27 191.495169 sindhi 27 issue_title
28 191.495169 gp_translation 28 issue_title
29 184.099065 cit_magic3 29 issue_title
... ... ... ... ...
970 62.443719 gravityview_edit_entry_render 970 issue_title
971 62.443719 gravityflow 971 issue_title
972 62.443719 edd_post_add_to_cart 972 issue_title
973 62.443719 edd_payment 973 issue_title
974 62.443719 edd_get_payment 974 issue_title
975 62.443719 edd_get_ip 975 issue_title
976 62.443719 edd_get_cart_discounts_html 976 issue_title
977 62.443719 date_created 977 issue_title
978 61.831541 happened 978 pr_title
979 61.742316 finishes 979 pr_title
980 61.356828 updatelinteruidebouncer 980 pr_title
981 61.356828 quickfixview 981 pr_title
982 61.356828 mainpaths 982 pr_title
983 61.356828 getfixesatcursorposition 983 pr_title
984 61.356828 elmjutsu 984 pr_title
985 61.192136 cilium 985 pr_title
986 -61.124326 verbosely 986 pr_title
987 -61.124326 upgreade 987 pr_title
988 -61.124326 troubleshoot 988 pr_title
989 -61.124326 syslinux 989 pr_title
990 -61.124326 ssllabs 990 pr_title
991 -61.124326 s6 991 pr_title
992 -61.124326 radclient 992 pr_title
993 -61.124326 logspout 993 pr_title
994 -61.124326 jordan 994 pr_title
995 -61.124326 hostfile 995 pr_title
996 -61.124326 freeradius 996 pr_title
997 -61.124326 dmi 997 pr_title
998 -61.124326 autostager 998 pr_title
999 -61.124326 2i 999 pr_title

1000 rows × 4 columns

coeff feat rank type
0 -0.000029 user_age_days 84101 non_text
1 -0.061859 is_org_member 83125 non_text
2 0.013421 total_repos 83666 non_text
3 0.003082 total_followers 84037 non_text
4 -0.001266 total_stars_given 84066 non_text
5 0.033636 total_commit_comments 83444 non_text
6 0.004059 total_commits 84023 non_text
7 0.001781 total_inline_pr_comments 84058 non_text
8 0.001247 total_issue_comments 84067 non_text
9 0.005186 total_issues 83995 non_text
10 0.002839 total_prs 84048 non_text
11 0.003126 total_pushes 84036 non_text

The RMSE, relative to the mean and spread of the total_commits, is quite low, which is promising.

To get a better overview of our model’s fit (without outlier filtering), we take, for each index, the cartesian product of actual vs predictions (the x-axis value is the actual, whiel the y-axis value is the prediction) and plot. A perfect prediction would correspond to all values in the plot below being on the line y=x.

def plot_pred_vs_actual(y_actual, y_pred):
    pred_vs_actual = pd.concat([y_actual.reset_index(drop=True),
                                y_pred.reset_index(drop=True)], axis=1)
    pred_vs_actual.columns = ['Actual_Target_Commits', 'Predicted']
    pred_vs_actual.reset_index(inplace=True)
    p = ggplot(pred_vs_actual,
               aes(x='Actual_Target_Commits', y='Predicted')) + \
        ggtitle("Predicted Target Commits Vs. Actual") + geom_point(
        color='blue') + \
        geom_abline(slope=1, intercept=0, color='red') + \
        xlim(0, 100) + ylim(0, 100)
    return p


p = plot_pred_vs_actual(train_pred_actual['target_commits_train_actual'],
                        train_pred_actual['target_commits_train_pred'])

print(p)

png

<ggplot: (280999409)>

Note that our model does poorly predicting larege target_commit numbers, indicating a linear model may not be best for predicting on our dataset.

Recall that that the magnitude of the regression coefficients for the non_text features are close to 0, and were dwarfed by the magnitude of some of the text features we’ve extracted from both the ‘pr_titles’ and issue_titles’ features. Since we asked scikit-learn to run a normalized Linear Regression (please see the LinearRegressor passed into the model pipeline defined earlier) and since the sample size is large, the coefficients are of the same ‘scale’ (i.e. we do not need to divide them by their standard errors before comparing them with each other).

However, this may all be moot: since the model doesn’t fit the large target_commits (the mean and above) as well as the smaller ones, we wlll have to take the meaning of the regression weights with a bit of salt. Nonlinear interactions between the continuous features perhaps aren’t being captured by our model. In addition, the model may not be capturing what precisely is driving large target_commit numbers. A more complex model, such as a RandomForest, may be a better fit.

RandomForest With Outlier Filtering

Proceeding, we now run a RandomForest.

Important Note

In experiments, I ran the model on the full data set set. It turned out to be too computationally heavy while debugging the code (either the computation did not terminate at all without user intervention, or did not terminate in a reasonable amount of time). The machine it was running on had the following specifications:

  • Model Name: MacBook Pro
  • Model Identifier: MacBookPro8,1
  • Processor Name: Intel Core i5
  • Processor Speed: 2.3 GHz
  • Number of Processors: 1
  • Total Number of Cores: 2
  • L2 Cache (per Core): 256 KB
  • L3 Cache: 3 MB
  • Memory: 8 GB
  • Boot ROM Version: MBP81.0047.B32
  • SMC Version (system): 1.68f99
  • Serial Number (system): C17G72S8DRJ7
  • Hardware UUID: B8EB34E9-952B-535D-8618-F5B2695F7257
  • Sudden Motion Sensor:
  • State: Enabled

To avoid this difficulty, I randomly subsampled the dataset while debugging, before passing it to our model. After some iteration and debugging, runtime improved significantly for all model runs (to the point where I could set their sampling_frac to 1), with the exception of this final one. I remark that the goal to not resample the data was due to the RMSEs improving with the use of more samples.

# model output

test_pred_d, train_pred_actual_d, \
rmse_d, best_feat_with_coef_d, \
cont_ranks_with_coef_d = \
    output_test_train_prediction(
        outlier_filter_if(
            train_data_drop_col, feats_to_consider=non_text_feats),
        test_data_drop_col, non_text_feats[:-1],
        SAMPLE_FRAC, regressor='rf')

Again, recall that our model generates an RMSE on the training data (with nulls eliminated by dropping features), a DataFrame ranking the features in terms of their influence on the predictive model, and a DataFrame of the ranks of the non_text features. We now display them, respectively.

print({'RMSE': rmse_d})
{'RMSE': 23.494051688317477}
display(best_feat_with_coef_d)
display(cont_ranks_with_coef_d)
coeff feat rank type
0 0.436139 total_commits 0 non_text
1 0.008839 total_repos 1 non_text
2 0.007887 user_age_days 2 non_text
3 0.007818 bug 3 issue_title
4 0.007503 total_followers 4 non_text
5 0.007166 daemon 5 issue_title
6 0.006656 interface 6 issue_title
7 0.006283 implement 7 issue_title
8 0.005377 method 8 issue_title
9 0.004871 issues 9 issue_title
10 0.004393 laser 10 pr_title
11 0.003812 variable 11 issue_title
12 0.003736 total_issue_comments 12 non_text
13 0.003525 web 13 issue_title
14 0.003466 total_stars_given 14 non_text
15 0.003024 converting 15 issue_title
16 0.002989 complete 16 issue_title
17 0.002801 total_pushes 17 non_text
18 0.002681 document 18 issue_title
19 0.002680 add 19 issue_title
20 0.002640 uninstall 20 issue_title
21 0.002591 stuff 21 pr_title
22 0.002572 fix 22 pr_title
23 0.002561 crash 23 issue_title
24 0.002545 custom 24 issue_title
25 0.002399 xdebug 25 pr_title
26 0.002387 list 26 issue_title
27 0.002222 total_prs 27 non_text
28 0.002214 options 28 issue_title
29 0.002169 api 29 issue_title
... ... ... ... ...
970 0.000112 specific 970 issue_title
971 0.000112 module 971 issue_title
972 0.000111 istanbul 972 pr_title
973 0.000111 simplified 973 pr_title
974 0.000111 filterer 974 pr_title
975 0.000111 effect 975 issue_title
976 0.000111 recognize 976 pr_title
977 0.000111 super 977 issue_title
978 0.000111 md 978 issue_title
979 0.000111 showing 979 pr_title
980 0.000110 pattern 980 pr_title
981 0.000110 theme 981 issue_title
982 0.000110 functions 982 issue_title
983 0.000110 innerheightの値になっていた問題を修正 983 pr_title
984 0.000109 causes 984 issue_title
985 0.000109 unnecessary 985 issue_title
986 0.000109 documents 986 issue_title
987 0.000109 character 987 issue_title
988 0.000109 config 988 issue_title
989 0.000109 run 989 issue_title
990 0.000109 nullpointerexception 990 issue_title
991 0.000109 speed 991 pr_title
992 0.000109 filters 992 issue_title
993 0.000109 recipe 993 pr_title
994 0.000108 like 994 issue_title
995 0.000108 model 995 issue_title
996 0.000108 component 996 issue_title
997 0.000108 summaries 997 issue_title
998 0.000108 glossary 998 pr_title
999 0.000108 built 999 issue_title

1000 rows × 4 columns

coeff feat rank type
0 0.007887 user_age_days 2 non_text
1 0.001606 is_org_member 48 non_text
2 0.008839 total_repos 1 non_text
3 0.007503 total_followers 4 non_text
4 0.003466 total_stars_given 14 non_text
5 0.001367 total_commit_comments 61 non_text
6 0.436139 total_commits 0 non_text
7 0.001514 total_inline_pr_comments 54 non_text
8 0.003736 total_issue_comments 12 non_text
9 0.001835 total_issues 38 non_text
10 0.002222 total_prs 27 non_text
11 0.002801 total_pushes 17 non_text

We plot the actual vs the prediction:

plot_pred_vs_actual(train_pred_actual_d['target_commits_train_actual'],
                    train_pred_actual_d['target_commits_train_pred'])

png

<ggplot: (289403329)>

Conclusion, and Test Set Prediction Output

It is clear the fit has improved from previous models, and is more balanced in the sense that it has more predictive power with respect to larger target_commits. Furthermore, observe that ‘total_commits’ is the most important feature, and a full two orders of magnitude more important than the next important feature, and even greater than the next important non_textual feature. This result accords with the PCA analysis we did earlier.

We address the difference of the driving features here, and in the Linear Regression Model, as follows: Intuitively, total_commits for a user should be strongly positively correlated with their target_commits–however, data for users who have large target_commits is sparse relative to users who have a low number of target_commits. Due partially to this, the LinearRegression failed to capture the significance of this behavior (a RidgeRegression might do better, by penalizing the large text coefficients, allowing the non_text components to have more impact).

We conclude by writing this model’s predictions on the test set to disk.

# Recall, we only dropped columns from test_data_orig in
# our final iteration
# of our model, so the test prediction
# order and the huids from our original test data set will line up correctly.
# Very important
test_pred_d.insert(0, column='huid', value=test_data_orig['huid'])
test_pred_d.to_csv("test_output.csv", index=False)

With More Time

Given more time and greater processing power for this data challenge, I would: 1. Rerun the final model, but without subsampling. This should improve the RMSE, due to more sample data being inputted into the model. 2. Experiment with many non-parametric models by putting them in a scikit-learn ensemble –this would allow me to quickly hypertune models via a cross-validated grid-search, evaluate the hypertuned models on the data, and choose the model with the best RMSE score. 3. Investigate fitting a Ridge Regression, or writing and applying a Negative Binomial Regressor. 4. Investigate why certain text features have some influence on the number of target commits, on a causal level. How are they related to the ‘total_commits’? What do certain text features have to do with the nature of GitHub’s business, and what do they tell us about the types of products GitHub should be providing, or abandoning? 5. Try to feature-engineer another useful feature from the non-textual data. One approach would be to use the PolynomialFeatures module in scikitlearn to generate polynomial combinations of the existing features. 6. Extract combinations of words as features, i.e. have the TFIDF Vectorizer vectorize over not just 1-grams, but n-grams.