The purpose of this project is to create a model which, given an Instagram user’s profile, predicts their gender as accurately as possible. The motivation for this undertaking is to be able to target for marketing purposes Instagram users of specific demographics. The model is trained using labeled text-based profile data passed through a tuned logistic regression model. The model parameters are optimized using the AUROC metric to reduce variability in the precision and recall of predictions for each gender. The resulting model achieves 90% overall accuracy from a dataset of 20,000, though it deviates substantially in the recall of each gender.

Introduction¶

All supporting files for this project can be found in its GitHub repository.

This project write-up assumes that the reader has a basic understanding of machine learning and statistics concepts including logistic regression, word encodings including bag-of-words, and the terminology surrounding false/true positives/negatives.

The following high-level details are of note:

Instagram profiles are mostly text. The modeling methods that typically perform best for text classification are regressions and neural nets. This project employs logistic regression as its model of choice.
The model is designed to perform equally well on both genders. This stipulation was a business constraint of which the most significant consequence was the replacement of the cross-validation optimization metric of accuracy with AUROC.
The data pipeline makes heavy use of n-grams and both word and character encodings. The use of these more complicated bag-of-words encodings improves results substantially beyond simpler 1-gram word-based encodings.

Due to the difficulty of obtaining reliable data about genders other than male and female, and the lack of marketing value in these smaller demographics, the following analysis eschews these additional labels. Rest assured, this omission is for economic as opposed to political or social reasons.

For reasons including logistical difficulty and the data constituting business trade secrets, the labeled profiles cannot be posted publicly. One way in which the results of this project can be replicated is by querying the Instagram User API and then labeling the data using Amazon Mechanical Turk.

Data Engineering and Cleaning¶

The code in the following section loads, organizes, and formats the labeled training data in such a way that it can later be passed into an off-the-shelf model.

In [39]:

"""
Jupyter notebook boilerplate setup code.
"""

%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [40]:

import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

"""
Files from which to load datasets and labels. In this example, the labels
are separate from the rest of the user profile data, and these data are
related using a dictionary.
"""
DATA_FILES = {
    'doug_labeled_user_batch.json': 'doug_labels.json',
    'doug_finaly_labeled_cleaned_batch_2.json': 'doug_labels_batch_2.json'
}

"""
Load example data using Pandas.
"""
datasets = []
for profiles_file, labels_file in DATA_FILES.items():
    datasets.append({
        'profiles': pd.read_json(profiles_file, encoding='ISO-8859-1'),
        'labels': pd.read_json(labels_file, encoding='ISO-8859-1')
    })

"Loaded %d datasets" % len(DATA_FILES.items())

Out[40]:

'Loaded 2 datasets'

Check the loaded data¶

It’s best to examine the loaded data to verify that it is in the expected format.

In [41]:

total_examples = 0
for dataset in datasets:
    total_examples += len(dataset['profiles'])
"%d total examples" % total_examples

Out[41]:

'20460 total examples'

In [42]:

datasets[0]['profiles'].head()

Out[42]:

	biography	blocked_by_viewer	connected_fb_page	country_block	external_url	external_url_linkshimmed	followed_by	followed_by_viewer	follows	follows_viewer	…	has_requested_viewer	id	is_private	is_verified	media	profile_pic_url	profile_pic_url_hd	requested_by_viewer	saved_media	username
0	Just me. 19\nsnapchat: abbipandi	False	NaN	False	http://twitter.com/abiigaildg?s=09	http://l.instagram.com/?u=http%3A%2F%2Ftwitter…	{‘count’: 256}	False	{‘count’: 642}	False	…	False	255372732	True	False	{‘count’: 182, ‘nodes’: [], ‘page_info’: {‘end…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	abigail13d
1	Just a 23 year old living in Milwaukee	False	NaN	False	None	None	{‘count’: 169}	False	{‘count’: 223}	False	…	False	899493065	True	False	{‘count’: 18, ‘nodes’: [], ‘page_info’: {‘end_…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	jkst0329
2	✖️⚠️Follow @billz433⚠️✖️	False	NaN	False	http://www.thiscrush.com/~nicobonta	http://l.instagram.com/?u=http%3A%2F%2Fwww.thi…	{‘count’: 111}	False	{‘count’: 52}	False	…	False	5566432352	False	False	{‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	nicobonta18
3	None	False	NaN	False	None	None	{‘count’: 71}	False	{‘count’: 511}	False	…	False	5416238465	False	False	{‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	sunnykumar7094
4	Don’t worry be happy☺\n🎉wish me 23 Feb🎂\nBigge…	False	NaN	False	None	None	{‘count’: 116}	False	{‘count’: 1143}	False	…	False	5679439304	False	False	{‘count’: 2, ‘nodes’: [{‘__typename’: ‘GraphIm…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	purbashamalakar

5 rows × 22 columns

In [43]:

datasets[0]['labels'].head()

Out[43]:

	gender	id
0	female	255372732
1	female	899493065
2	male	5566432352
3	male	5416238465
4	male	5679439304

Notice the presence of the id field in both datasets. This field is the one upon which the profile and label data will be merged.

Combine the profiles with the labels¶

In this step, the profiles and labels are merged on the id field present in each. The datasets are then concatenated to produce one large set of examples.

In [44]:

for dataset in datasets:
    dataset['merged'] = pd.merge(left=dataset['profiles'],
                                 right=dataset['labels'],
                                 left_on='id',
                                 right_on='id')

data = pd.concat(map(lambda dataset: dataset['merged'], datasets))
data.fillna(value='', inplace=True)
data.head()

Out[44]:

	biography	blocked_by_viewer	country_block	external_url	external_url_linkshimmed	followed_by	followed_by_viewer	follows	follows_viewer	…	id	is_private	is_verified	media	profile_pic_url	profile_pic_url_hd	requested_by_viewer	saved_media	username	gender
0	Just me. 19\nsnapchat: abbipandi	False	False	http://twitter.com/abiigaildg?s=09	http://l.instagram.com/?u=http%3A%2F%2Ftwitter…	{‘count’: 256}	False	{‘count’: 642}	False	…	255372732	True	False	{‘count’: 182, ‘nodes’: [], ‘page_info’: {‘end…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	abigail13d	female
1	Just a 23 year old living in Milwaukee	False	False			{‘count’: 169}	False	{‘count’: 223}	False	…	899493065	True	False	{‘count’: 18, ‘nodes’: [], ‘page_info’: {‘end_…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	jkst0329	female
2	✖️⚠️Follow @billz433⚠️✖️	False	False	http://www.thiscrush.com/~nicobonta	http://l.instagram.com/?u=http%3A%2F%2Fwww.thi…	{‘count’: 111}	False	{‘count’: 52}	False	…	5566432352	False	False	{‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	nicobonta18	male
3		False	False			{‘count’: 71}	False	{‘count’: 511}	False	…	5416238465	False	False	{‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	sunnykumar7094	male
4	Don’t worry be happy☺\n🎉wish me 23 Feb🎂\nBigge…	False	False			{‘count’: 116}	False	{‘count’: 1143}	False	…	5679439304	False	False	{‘count’: 2, ‘nodes’: [{‘__typename’: ‘GraphIm…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	https://instagram.flju1-1.fna.fbcdn.net/t51.28…	False	{‘nodes’: [], ‘page_info’: {‘end_cursor’: None…	purbashamalakar	male

5 rows × 23 columns

Format `gender` to be compatible with AUROC¶

The encoding of the gender field is as a string with the value “male” or “female”. The AUROC optimization metric requires a binary (0 or 1) value. The code below re-encodes gender into a new gender_enc field in which males are encoded with the value 0 and females with the value 1.

In [45]:

data['gender_enc'] = data.apply(
    lambda x: 0 if x['gender'] == 'male' else 1,
    axis=1)
data[['gender', 'gender_enc']].head()

Out[45]:

	gender	gender_enc
0	female	1
1	female	1
2	male	0
3	male	0
4	male	0

Prepare the `writing_example` field¶

One of the fields provided in the dataset is media, which is an array of metadata about each user’s photos including the captions. The caption field provides a point of leverage because it is the one section of the user’s profile in which they can write freeform text. This field is also high leverage for the project because the way in which the data are prepared affects the results substantially.

Intuitively, there are two methods to vectorize the captions:

Each caption could be encoded in isolation with each one being treated as an entirely different field.
The captions could all be concatenated and encoded together.

Option #2 makes the most sense because there is no natural ordering of captions; for any given two users, there should be nothing in common between each of their first photo captions, or each of their second photo captions, etc.

It is possible that information can be lost by combining captions as in option #2. One example of this loss of data is when two photo captions with completely different sentiments are concatenated; however, the gender of the user who wrote the captions remains constant, which is the important detail.

In [46]:

def extract_writing_example(row):
    captions = []
    for medium in row.media['nodes']:
        if 'caption' not in medium:
            continue
            
        caption = medium['caption']
        if caption is not None:
            captions.append(caption)

    return ' '.join(captions)
        
data['writing_example'] = data.apply(lambda x: extract_writing_example(x), axis=1)

data[['username', 'writing_example']].head()

Out[46]:

	username	writing_example
0	abigail13d
1	jkst0329
2	nicobonta18	Se tocchi la squadra giuro non torni a casa!😈😤…
3	sunnykumar7094
4	purbashamalakar	Some people ask me this is my fake profile but…

Prepare the `hash_tags` field¶

This section may seem unnecessary because hash tags will already be detected and appropriately prioritized by the writing_example vectorizer. The impetus for synthesizing the hash_tags field is to be able to apply additional constraints on its vectorizer. One example of a beneficial tuning is binarizing the field instead of retaining the amount of times that the hash tag exists in a given writing_example.

In [47]:

import re

def extract_hash_tags(row):
    hash_tags = re.findall('#[a-zA-Z]*', row['writing_example'])
    
    hash_tags = [hash_tag.replace('#', '') for hash_tag in hash_tags]
    
    return ' '.join(hash_tags)

data['hash_tags'] = data.apply(lambda x: extract_hash_tags(x), axis=1)

data[data['hash_tags'].apply(lambda x: len(x) > 0)][['writing_example', 'hash_tags']].head()

Out[47]:

	writing_example	hash_tags
4	Some people ask me this is my fake profile but…	trust
5	Super excited to move to a new city, but not l…	plantlife plantlady plantlife diyjewelry dewyn…
6	I think my cat wants to go on a vacation 😅😅😅 #…	cats catsofinstagram cats catsoninstagram vaca…
7	Vegan Peanut Butter Banana Split! Organic vega…	hurricaneirma hurricaneirma hurricaneirma vega…
8	Nyder solnedgang på hørhus kollegiet #hollywoo…	hollywood walkoffame favoritehero grandcanyon …

Prepare the `first_name` field¶

The full_name field is deceivingly unideal for use with bag-of-words encoding. Recall that 1-gram bag-of-words encoding discards information about where each word occurs in the text. Furthermore, people often have last names which could function as first names for the opposite gender, e.g. “Patricia James.” This scenario would be particularly bad if the weighting for the association of “James” with “male” were stronger than “Patricia”‘s association with “female.”

To solve this problem, the first_name field is extracted and encoded separately as a crude approximation for the loss of positional information when encoding with bag-of-words. The use of n-grams does not solve this problem because very few people have the same full_name, so the model would be overfitted to the train data. The reason that the first_name field does not entirely replace the full_name field is that it may contain emojis or middle names with which predictions can be improved.

In [48]:

def extract_first_name(row):
    return row['full_name'].split(' ', 1)[0]

data['first_name'] = data.apply(lambda x: extract_first_name(x), axis=1)
data[['full_name', 'first_name']].head()

Out[48]:

	full_name	first_name
0	Abigail Diaz Gamio 13🍁	Abigail
1	Jaronsa Taylor	Jaronsa
2	Nico Bontà😈	Nico
3	Sunny Kumar	Sunny
4	purbasha…..	purbasha…..

Data Exploration¶

The section that follows will more deeply explore the dataset to identify the data’s features and trends. The goal of this investigation is to identify which encodings are optimal for analysis.

Sample the fully formatted data¶

Looking at a subset of the dataset with all of the fields properly formatted is the best way to spot obvious relationships and potential paths forward.

In [49]:

data = data[['username', 'first_name', 'full_name', 'biography', 'writing_example', 'hash_tags', 'gender', 'gender_enc']]
data.head(10)

Out[49]:

	username	first_name	full_name	biography	writing_example	hash_tags	gender	gender_enc
0	abigail13d	Abigail	Abigail Diaz Gamio 13🍁	Just me. 19\nsnapchat: abbipandi			female	1
1	jkst0329	Jaronsa	Jaronsa Taylor	Just a 23 year old living in Milwaukee			female	1
2	nicobonta18	Nico	Nico Bontà😈	✖️⚠️Follow @billz433⚠️✖️	Se tocchi la squadra giuro non torni a casa!😈😤…		male	0
3	sunnykumar7094	Sunny	Sunny Kumar				male	0
4	purbashamalakar	purbasha…..	purbasha…..	Don’t worry be happy☺\n🎉wish me 23 Feb🎂\nBigge…	Some people ask me this is my fake profile but…	trust	male	0
5	chelsea_a_bear	Chelsea	Chelsea Hebert		Super excited to move to a new city, but not l…	plantlife plantlady plantlife diyjewelry dewyn…	female	1
6	jimsansa	Jim	Jim Van Mourik		I think my cat wants to go on a vacation 😅😅😅 #…	cats catsofinstagram cats catsoninstagram vaca…	male	0
7	jnlfunfitfoodie	Jennifer	Jennifer Nicole Lee	“Happiest Woman Alive!” Blessed to motivate al…	Vegan Peanut Butter Banana Split! Organic vega…	hurricaneirma hurricaneirma hurricaneirma vega…	female	1
8	peter_jordt	Peter	Peter Jordt		Nyder solnedgang på hørhus kollegiet #hollywoo…	hollywood walkoffame favoritehero grandcanyon …	male	0
9	_d.kolobova_	Darina	Darina Kolobova		Огромное спасибо за эти 2 незабываемых недели😘…	malenkayastrana malenkayastrana scetchbook ma…	female	1

Several patterns are immediately apparent in this dataset:

Many users make liberal use of emojis. One path forward is to perform character vectorization of all fields in which users may enter emojis.
Users write freeform text with context. It may be beneficial to encode any user-inputted fields as n-grams rather than the default 1-grams to retain this context.
Not everyone has a writing_example, but almost everyone has filled in at least one text field. This observation is good news for the model; users without any user-entered text fields filled in are typically less useful for marketing purposes.

These observations could be validated by checking their correctness statistically instead of visually from the small sample of ten examples.

Chart the frequency of word counts for each field and each gender¶

Following observation #3 above, charting the frequency with which users of each gender fill in text-based fields will give some idea as to how reliable the use of these fields would be.

In [50]:

field_lens_to_plot = ['writing_example', 'biography', 'full_name', 'hash_tags']

for field in field_lens_to_plot:
    for gender in [0, 1]:
        plt.figure()
        gender_data = data[data['gender_enc'] == gender]
        gender_data["%s_len" % field] = gender_data[field].apply(
            lambda x: len(re.findall(r'\w+', x)))
        gender_data["%s_len" % field].plot(
            title="Frequency of %s for %s" % (field, 'males' if gender == 0 else 'females'),
            kind='hist', color='blue' if gender == 0 else 'red')

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

The only field with a substantial discrepancy between genders is biography—females seem to be more likely to enter more text. Unfortunately, this finding was a red herring; when the trained model included the length of the biography field as a predictor, there was no significant difference in predictive power.

Check for imbalances in the genders¶

A substantial imbalance in the data may require intervention.

In [51]:

"Number of males: %d; Number of females: %d" % (
    len(data[data['gender_enc'] == 0]),
    len(data[data['gender_enc'] == 1])
)

Out[51]:

'Number of males: 8909; Number of females: 11709'

There is an imbalance in gender representations within the dataset, but the lopsidedness is insufficient to warrant drastic measures. One way in which the analysis could be made more robust is by using the AUROC metric in place of accuracy for model optimization. This technique is typically used to compensate for acute asymmetry in the data, but it can also be employed for less extreme corrections. One challenge in the use of AUROC is that it is limited to binary classification, which limits the ability of the model to be extended later to support more than the binary genders.

Plot the most predictive words for each field¶

In this section, untuned logistic regression models will be trained on each field in isolation, and the most extreme weights outputted in graphical format. This illustration is not particularly useful or actionable, but it is interesting.

In [52]:

from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

from model_performance_plotter import plot_coefficients

plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Biography Most Predictive Terms',
                  data['biography'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Writing Example Most Predictive Terms',
                  data['writing_example'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Full Name Most Predictive Terms',
                  data['full_name'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Hash Tags Most Predictive Terms',
                  data['hash_tags'], data['gender_enc'])

Model Training and Validation¶

The final step is to train and validate the model. In practice, this section took many iterations to get to where it now is.

Split the main dataset into training and test datasets¶

SciKit will automatically sample from the training dataset to create a cross-validation dataset, so only the test dataset must be created manually.

In [53]:

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)
print("Train data set size: %d" % len(train))
print("Test data set size: %d" % len(test))

Train data set size: 16494
Test data set size: 4124

In [54]:

parameters = {
    # 'clf__solver': ['liblinear', 'lbfgs', 'newton-cg', 'saga']
    # 'clf__loss': ['squared_hinge', 'hinge'],
    # 'clf__penalty': ['l1', 'l2'],
    # 'clf__C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
    # 'clf__C': [10, 15, 20, 25, 30, 35, 50, 80, 100, 120, 150],
    # 'clf__dual': [False, True],
    # 'clf__class_weight': [None, 'balanced'],
}

In [56]:

from sklearn.pipeline import Pipeline

from sklearn.pipeline import FeatureUnion

from transformers import ItemSelector
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction import DictVectorizer

tfidf_transformer = TfidfTransformer()

encoding_args = {
    'decode_error': 'replace',
    'strip_accents': 'unicode',
}

word_vectorizer_args = {
    **encoding_args,
    'ngram_range': (1, 2)
}

char_vectorizer_args = {
    **encoding_args,
    'analyzer': 'char',
    'ngram_range': (1, 3)
}

word_vectorizer = CountVectorizer(**word_vectorizer_args)
char_vectorizer = CountVectorizer(**char_vectorizer_args)

transformers = {
    'username': {
        'char': char_vectorizer
    },
    'biography': {
        'word': word_vectorizer,
        'char': char_vectorizer
    },
    'full_name': {
        'word': CountVectorizer(**encoding_args),
        'char': char_vectorizer
    },
    'first_name': {
        'word': CountVectorizer(**encoding_args)
    },
    'hash_tags': {
        'word': CountVectorizer(**encoding_args, binary=True),
        'char': CountVectorizer(**char_vectorizer_args, binary=True)
    },
    'writing_example': {
        'word': word_vectorizer,
        'char': char_vectorizer
    }
}

transformer_list = []
for key, transformer_types in transformers.items():
    for transformer_type, transformer in transformer_types.items():
        transformer_list.append(
            ("%s_%s" % (key, transformer_type), Pipeline([
                ('selector', ItemSelector(key=key)),
                ('vect', transformer),
                ('tfidf', tfidf_transformer)
            ]))
        )

pipeline = Pipeline([
    ('union', FeatureUnion(transformer_list=transformer_list)),
    ('clf', LogisticRegression(C=150))
])

In [57]:

from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=10, scoring=scoring, refit='AUC')
grid_search.fit(train, train['gender_enc'])

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

score = grid_search.score(test, test['gender_enc'])
print("Test score: %f" % score)

y_pred = grid_search.predict(test)
print("Test accuracy: %f" % accuracy_score(test['gender_enc'], y_pred))

# Use this to assess the probability of each classification.
# grid_search.predict_proba(test)

from sklearn.metrics import classification_report
print(classification_report(test['gender_enc'], y_pred))

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV]  , AUC=0.9606773914601812, Accuracy=0.8974358974358975, total= 1.1min
[CV]  , AUC=0.9595042119976971, Accuracy=0.891778828664969, total= 1.1min
[CV]  , AUC=0.9536360530299655, Accuracy=0.8863016190649445, total= 1.1min

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min finished

Best score: 0.958
Best parameters set:
Test score: 0.959340
Test accuracy: 0.892338
             precision    recall  f1-score   support

          0       0.90      0.85      0.87      1821
          1       0.88      0.93      0.91      2303

avg / total       0.89      0.89      0.89      4124

In [58]:

from model_performance_plotter import plot_learning_curve, \
                                      plot_roc_curve, \
                                      plot_precision_recall_curve

title = 'Gender Classifier'

plot_roc_curve(title, y_pred, test['gender_enc'])

plot_precision_recall_curve(title, test['gender_enc'], grid_search.decision_function(test))

plot_learning_curve(grid_search.best_estimator_, title, train, train['gender_enc'])

Out[58]:

<module 'matplotlib.pyplot' from '/usr/local/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Examine cases where the model makes correct predictions¶

It is good practice to verify that the model is making reasonable predictions and that the labels were accurate.

In [59]:

test[test['gender_enc'] == y_pred].sample(10)

Out[59]:

	username	first_name	full_name	biography	writing_example	hash_tags	gender	gender_enc
1365	_kim_law	stacey	stacey	🇯🇲Proud Jamaican…Island girl🌴🇯🇲			female	1
4085	katiebiancaxox	Katie	Katie Bianca	3 soon to be 4 ❤️			female	1
3968	jon.bon.jovi_always	immortal	immortal rock 🎸☇🔥🤘	“Shot through the heart \nAnd you’re to blame …	Jon’s original vocals only, isolated from the …	jonbonjovi bonjovi rock rockbands rockmusic ha…	male	0
9788	ylimenarod	Emily	Emily Doran	悲しい女の子 ( ͡° ͜ʖ ͡°)\n@intotheshade_	Looks like one eyed Kenny\n#35mm #minoltax700 …	minoltax agfa minoltax agfa	female	1
6936	sacredlotus17	Melissa	Melissa Pattinson		Here’s a sneak peek of what I’m working on at …		female	1
5076	angelfesh744	Felicia	Felicia				female	1
5356	prescott127	Prescott	Prescott	活在當下！！！ live in present!!!			male	0
8681	tjmacca	Thomas	Thomas McKenzie				male	0
2099	kellyhosy	Kelly	Kelly Ho	Join The Jobless Club and be fabulous	Sexercise the wall \n@verxniques 🤰🏻#rockclimbi…	rockclimbing exercise sexercise booty rockclim…	female	1
1664	alfonso3892	Alfonso	Alfonso Martinez				male	0

Examine cases where the model makes incorrect predictions¶

It is also good practice to investigate the cases for which the model makes incorrect predictions. Note that in the list below, the gender field is the true label, and the opposite of this label is what the model predicted. The majority of these mistakes are due to incorrect labels.

In [62]:

test[test['gender_enc'] != y_pred].sample(10)

Out[62]:

	username	first_name	full_name	biography	writing_example	hash_tags	gender	gender_enc
9734	yigitsun97			Sanity is in the eye of the beholder.	Philoxene Iskeleden bir cisim yaklasiyor kapta…		female	1
10102	ztingle17	Zachry	Zachry Ray Tingle	Texas A&M University Class of 2018 👍🏻 Basketba…	Great weekend with the family! Glad my cousins…	NationalBestFriendDay NationalSiblingDay MyGor…	female	1
7412	gorjuszj	Shae	Shae	SNAP ME 👻 [GORJUSZ]. ♈️3/25. Raising a Princes…	♥️ New hair who dis ? 📞!\n#autumnhair #fallhai…	autumnhair fallhair naturalhair bob dallasstyl…	male	0
4982	khor_meng_yang107	Khor	Khor Meng Yang	🏣SMK MIHARJA \n💒†FGA CYC\n🎂1007\nWC :khor1234…	Today 🌝#04132017 Today SPAM吗😎		female	1
1992	pouria_.rs			#BhMn\n#Я§ons\n❤👑			male	0
3058	dinarinus	Din	Din Arinus				male	0
621	viv.ek.5203	Vîv	Vîv Ek	Simply luvable			male	0
3963	santannasheneal	Signature	Signature by Santanna Sheneal	Makeup Artist/Esthetician 🎨\nXtreme Lash Speci…	The moment I’ve been waiting for. 😍😍😍\nThe onl…	Beauty Bar Supply MakeupArtist Braiders Beauti…	male	0
1607	tmongram	Gabriele	Gabriele Beddoni △	Rome,Paris⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Illustr…	⠀⠀⠀ 1 3 windows ⠀⠀⠀⠀ ⠀⠀⠀⠀⠀ #picoftheday #inst…	picoftheday instagood instamood goodvibes sony…	female	1
441	symiko70	💖Symiko💖	💖Symiko💖	👬Loving My 2 Boys👬			female	1

In [20]:

from sklearn.externals import joblib

MODEL_FILE = 'ig_gender_classifier.pkl'
joblib.dump(grid_search, MODEL_FILE)

Out[20]:

['ig_gender_classifier.pkl']

Conclusion¶

The model achieves 90% accuracy with 90% precision and 85% recall for males, and 88% precision and 93% recall for females; therefore, it is slightly superior at picking out females.

In the future, this project could be improved in the following ways:

Investigating why the model performs better on females than males. One possible cause for this discrepancy is that there are more females in the dataset, so the model has more data with which to identify females.
Translating non-English text to English and then passing that through the model. One way to look at translation is that it is a poor man’s form of PCA; the model could share the weights of English terms rather than being spread thin on every input language. This experiment was attempted but it was found to be too slow due the need for a web request for every example.
Redoing the project with a neural net instead of logistic regression. Neural nets typically require at least 50,000 to 100,000 examples to perform substantially better than classical models. This experiment was attempted early on in the project, but failed due to an insufficient number of examples.
Incorporating user photos into the model via ensemble methods. Computer vision is expensive and slow, so this addition is unlikely to add substantial value to the end result.

3 thoughts on “Classifying Instagram profiles by gender”

Ben Graves says:

May 7, 2019 at 5:11 am

very useful article, thanks for sharing!
Monica Laborde says:

August 11, 2020 at 5:26 pm

YOU NEED POTENTIAL CUSTOMERS THAT BUY FROM YOU ?

I’m talking about a better promotion method than all that exists on the market right now, even better than email marketing.
Just like you received this message from me, this is exactly how you can promote your business or product.

Do you want more details or do you want to receive a TEST ?
CHECK HERE=> https://bit.ly/Good_Promotion
Bobby Babb says:

August 11, 2020 at 8:05 pm

BEST PROGRAM FOR ADVERTISEMENT!

XRumer is the best program for advertisement!
It’s have CAPTCHA recognizer, email verificator, and a lot of other functions…

This software will help to increase traffic to website to hundreds, thousands times.
Program have a rich seven year history, which use experience of professionals in search engine optimization.
Appreciate and use a truly unique and powerful XRumer program, can both professionals and beginners.
MORE INFO=> https://bit.ly/39RzWR4

Comments are closed.