The purpose of this project is to create a model which, given an Instagram user’s profile, predicts their gender as accurately as possible. The motivation for this undertaking is to be able to target for marketing purposes Instagram users of specific demographics. The model is trained using labeled text-based profile data passed through a tuned logistic regression model. The model parameters are optimized using the AUROC metric to reduce variability in the precision and recall of predictions for each gender. The resulting model achieves 90% overall accuracy from a dataset of 20,000, though it deviates substantially in the recall of each gender.
Introduction¶
All supporting files for this project can be found in its GitHub repository.
This project write-up assumes that the reader has a basic understanding of machine learning and statistics concepts including logistic regression, word encodings including bag-of-words, and the terminology surrounding false/true positives/negatives.
The following high-level details are of note:
- Instagram profiles are mostly text. The modeling methods that typically perform best for text classification are regressions and neural nets. This project employs logistic regression as its model of choice.
- The model is designed to perform equally well on both genders. This stipulation was a business constraint of which the most significant consequence was the replacement of the cross-validation optimization metric of accuracy with AUROC.
- The data pipeline makes heavy use of n-grams and both word and character encodings. The use of these more complicated bag-of-words encodings improves results substantially beyond simpler 1-gram word-based encodings.
Due to the difficulty of obtaining reliable data about genders other than male and female, and the lack of marketing value in these smaller demographics, the following analysis eschews these additional labels. Rest assured, this omission is for economic as opposed to political or social reasons.
For reasons including logistical difficulty and the data constituting business trade secrets, the labeled profiles cannot be posted publicly. One way in which the results of this project can be replicated is by querying the Instagram User API and then labeling the data using Amazon Mechanical Turk.
Data Engineering and Cleaning¶
The code in the following section loads, organizes, and formats the labeled training data in such a way that it can later be passed into an off-the-shelf model.
"""
Jupyter notebook boilerplate setup code.
"""
%matplotlib inline
%load_ext autoreload
%autoreload 2
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"""
Files from which to load datasets and labels. In this example, the labels
are separate from the rest of the user profile data, and these data are
related using a dictionary.
"""
DATA_FILES = {
'doug_labeled_user_batch.json': 'doug_labels.json',
'doug_finaly_labeled_cleaned_batch_2.json': 'doug_labels_batch_2.json'
}
"""
Load example data using Pandas.
"""
datasets = []
for profiles_file, labels_file in DATA_FILES.items():
datasets.append({
'profiles': pd.read_json(profiles_file, encoding='ISO-8859-1'),
'labels': pd.read_json(labels_file, encoding='ISO-8859-1')
})
"Loaded %d datasets" % len(DATA_FILES.items())
Check the loaded data¶
It’s best to examine the loaded data to verify that it is in the expected format.
total_examples = 0
for dataset in datasets:
total_examples += len(dataset['profiles'])
"%d total examples" % total_examples
datasets[0]['profiles'].head()
datasets[0]['labels'].head()
Notice the presence of the id
field in both datasets. This field is the one upon which the profile and label data will be merged.
Combine the profiles with the labels¶
In this step, the profiles and labels are merged on the id
field present in each. The datasets are then concatenated to produce one large set of examples.
for dataset in datasets:
dataset['merged'] = pd.merge(left=dataset['profiles'],
right=dataset['labels'],
left_on='id',
right_on='id')
data = pd.concat(map(lambda dataset: dataset['merged'], datasets))
data.fillna(value='', inplace=True)
data.head()
Format gender
to be compatible with AUROC¶
The encoding of the gender
field is as a string with the value “male” or “female”. The AUROC optimization metric requires a binary (0
or 1
) value. The code below re-encodes gender
into a new gender_enc
field in which males are encoded with the value 0
and females with the value 1
.
data['gender_enc'] = data.apply(
lambda x: 0 if x['gender'] == 'male' else 1,
axis=1)
data[['gender', 'gender_enc']].head()
Prepare the writing_example
field¶
One of the fields provided in the dataset is media
, which is an array of metadata about each user’s photos including the captions. The caption
field provides a point of leverage because it is the one section of the user’s profile in which they can write freeform text. This field is also high leverage for the project because the way in which the data are prepared affects the results substantially.
Intuitively, there are two methods to vectorize the caption
s:
- Each
caption
could be encoded in isolation with each one being treated as an entirely different field. - The
caption
s could all be concatenated and encoded together.
Option #2 makes the most sense because there is no natural ordering of captions; for any given two users, there should be nothing in common between each of their first photo captions, or each of their second photo captions, etc.
It is possible that information can be lost by combining captions as in option #2. One example of this loss of data is when two photo captions with completely different sentiments are concatenated; however, the gender of the user who wrote the captions remains constant, which is the important detail.
def extract_writing_example(row):
captions = []
for medium in row.media['nodes']:
if 'caption' not in medium:
continue
caption = medium['caption']
if caption is not None:
captions.append(caption)
return ' '.join(captions)
data['writing_example'] = data.apply(lambda x: extract_writing_example(x), axis=1)
data[['username', 'writing_example']].head()
Prepare the hash_tags
field¶
This section may seem unnecessary because hash tags will already be detected and appropriately prioritized by the writing_example
vectorizer. The impetus for synthesizing the hash_tags
field is to be able to apply additional constraints on its vectorizer. One example of a beneficial tuning is binarizing the field instead of retaining the amount of times that the hash tag exists in a given writing_example
.
import re
def extract_hash_tags(row):
hash_tags = re.findall('#[a-zA-Z]*', row['writing_example'])
hash_tags = [hash_tag.replace('#', '') for hash_tag in hash_tags]
return ' '.join(hash_tags)
data['hash_tags'] = data.apply(lambda x: extract_hash_tags(x), axis=1)
data[data['hash_tags'].apply(lambda x: len(x) > 0)][['writing_example', 'hash_tags']].head()
Prepare the first_name
field¶
The full_name
field is deceivingly unideal for use with bag-of-words encoding. Recall that 1-gram bag-of-words encoding discards information about where each word occurs in the text. Furthermore, people often have last names which could function as first names for the opposite gender, e.g. “Patricia James.” This scenario would be particularly bad if the weighting for the association of “James” with “male” were stronger than “Patricia”‘s association with “female.”
To solve this problem, the first_name
field is extracted and encoded separately as a crude approximation for the loss of positional information when encoding with bag-of-words. The use of n-grams does not solve this problem because very few people have the same full_name
, so the model would be overfitted to the train data. The reason that the first_name
field does not entirely replace the full_name
field is that it may contain emojis or middle names with which predictions can be improved.
def extract_first_name(row):
return row['full_name'].split(' ', 1)[0]
data['first_name'] = data.apply(lambda x: extract_first_name(x), axis=1)
data[['full_name', 'first_name']].head()
Data Exploration¶
The section that follows will more deeply explore the dataset to identify the data’s features and trends. The goal of this investigation is to identify which encodings are optimal for analysis.
Sample the fully formatted data¶
Looking at a subset of the dataset with all of the fields properly formatted is the best way to spot obvious relationships and potential paths forward.
data = data[['username', 'first_name', 'full_name', 'biography', 'writing_example', 'hash_tags', 'gender', 'gender_enc']]
data.head(10)
Several patterns are immediately apparent in this dataset:
- Many users make liberal use of emojis. One path forward is to perform character vectorization of all fields in which users may enter emojis.
- Users write freeform text with context. It may be beneficial to encode any user-inputted fields as n-grams rather than the default 1-grams to retain this context.
- Not everyone has a
writing_example
, but almost everyone has filled in at least one text field. This observation is good news for the model; users without any user-entered text fields filled in are typically less useful for marketing purposes.
These observations could be validated by checking their correctness statistically instead of visually from the small sample of ten examples.
Chart the frequency of word counts for each field and each gender¶
Following observation #3 above, charting the frequency with which users of each gender fill in text-based fields will give some idea as to how reliable the use of these fields would be.
field_lens_to_plot = ['writing_example', 'biography', 'full_name', 'hash_tags']
for field in field_lens_to_plot:
for gender in [0, 1]:
plt.figure()
gender_data = data[data['gender_enc'] == gender]
gender_data["%s_len" % field] = gender_data[field].apply(
lambda x: len(re.findall(r'\w+', x)))
gender_data["%s_len" % field].plot(
title="Frequency of %s for %s" % (field, 'males' if gender == 0 else 'females'),
kind='hist', color='blue' if gender == 0 else 'red')
The only field with a substantial discrepancy between genders is biography
—females seem to be more likely to enter more text. Unfortunately, this finding was a red herring; when the trained model included the length of the biography
field as a predictor, there was no significant difference in predictive power.
Check for imbalances in the genders¶
A substantial imbalance in the data may require intervention.
"Number of males: %d; Number of females: %d" % (
len(data[data['gender_enc'] == 0]),
len(data[data['gender_enc'] == 1])
)
There is an imbalance in gender representations within the dataset, but the lopsidedness is insufficient to warrant drastic measures. One way in which the analysis could be made more robust is by using the AUROC metric in place of accuracy for model optimization. This technique is typically used to compensate for acute asymmetry in the data, but it can also be employed for less extreme corrections. One challenge in the use of AUROC is that it is limited to binary classification, which limits the ability of the model to be extended later to support more than the binary genders.
Plot the most predictive words for each field¶
In this section, untuned logistic regression models will be trained on each field in isolation, and the most extreme weights outputted in graphical format. This illustration is not particularly useful or actionable, but it is interesting.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from model_performance_plotter import plot_coefficients
plot_coefficients(LogisticRegression(), CountVectorizer(),
'Biography Most Predictive Terms',
data['biography'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
'Writing Example Most Predictive Terms',
data['writing_example'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
'Full Name Most Predictive Terms',
data['full_name'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
'Hash Tags Most Predictive Terms',
data['hash_tags'], data['gender_enc'])
Model Training and Validation¶
The final step is to train and validate the model. In practice, this section took many iterations to get to where it now is.
Split the main dataset into training and test datasets¶
SciKit will automatically sample from the training dataset to create a cross-validation dataset, so only the test dataset must be created manually.
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)
print("Train data set size: %d" % len(train))
print("Test data set size: %d" % len(test))
parameters = {
# 'clf__solver': ['liblinear', 'lbfgs', 'newton-cg', 'saga']
# 'clf__loss': ['squared_hinge', 'hinge'],
# 'clf__penalty': ['l1', 'l2'],
# 'clf__C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
# 'clf__C': [10, 15, 20, 25, 30, 35, 50, 80, 100, 120, 150],
# 'clf__dual': [False, True],
# 'clf__class_weight': [None, 'balanced'],
}
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from transformers import ItemSelector
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction import DictVectorizer
tfidf_transformer = TfidfTransformer()
encoding_args = {
'decode_error': 'replace',
'strip_accents': 'unicode',
}
word_vectorizer_args = {
**encoding_args,
'ngram_range': (1, 2)
}
char_vectorizer_args = {
**encoding_args,
'analyzer': 'char',
'ngram_range': (1, 3)
}
word_vectorizer = CountVectorizer(**word_vectorizer_args)
char_vectorizer = CountVectorizer(**char_vectorizer_args)
transformers = {
'username': {
'char': char_vectorizer
},
'biography': {
'word': word_vectorizer,
'char': char_vectorizer
},
'full_name': {
'word': CountVectorizer(**encoding_args),
'char': char_vectorizer
},
'first_name': {
'word': CountVectorizer(**encoding_args)
},
'hash_tags': {
'word': CountVectorizer(**encoding_args, binary=True),
'char': CountVectorizer(**char_vectorizer_args, binary=True)
},
'writing_example': {
'word': word_vectorizer,
'char': char_vectorizer
}
}
transformer_list = []
for key, transformer_types in transformers.items():
for transformer_type, transformer in transformer_types.items():
transformer_list.append(
("%s_%s" % (key, transformer_type), Pipeline([
('selector', ItemSelector(key=key)),
('vect', transformer),
('tfidf', tfidf_transformer)
]))
)
pipeline = Pipeline([
('union', FeatureUnion(transformer_list=transformer_list)),
('clf', LogisticRegression(C=150))
])
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=10, scoring=scoring, refit='AUC')
grid_search.fit(train, train['gender_enc'])
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
score = grid_search.score(test, test['gender_enc'])
print("Test score: %f" % score)
y_pred = grid_search.predict(test)
print("Test accuracy: %f" % accuracy_score(test['gender_enc'], y_pred))
# Use this to assess the probability of each classification.
# grid_search.predict_proba(test)
from sklearn.metrics import classification_report
print(classification_report(test['gender_enc'], y_pred))
from model_performance_plotter import plot_learning_curve, \
plot_roc_curve, \
plot_precision_recall_curve
title = 'Gender Classifier'
plot_roc_curve(title, y_pred, test['gender_enc'])
plot_precision_recall_curve(title, test['gender_enc'], grid_search.decision_function(test))
plot_learning_curve(grid_search.best_estimator_, title, train, train['gender_enc'])
Examine cases where the model makes correct predictions¶
It is good practice to verify that the model is making reasonable predictions and that the labels were accurate.
test[test['gender_enc'] == y_pred].sample(10)
Examine cases where the model makes incorrect predictions¶
It is also good practice to investigate the cases for which the model makes incorrect predictions. Note that in the list below, the gender
field is the true label, and the opposite of this label is what the model predicted. The majority of these mistakes are due to incorrect labels.
test[test['gender_enc'] != y_pred].sample(10)
from sklearn.externals import joblib
MODEL_FILE = 'ig_gender_classifier.pkl'
joblib.dump(grid_search, MODEL_FILE)
Conclusion¶
The model achieves 90% accuracy with 90% precision and 85% recall for males, and 88% precision and 93% recall for females; therefore, it is slightly superior at picking out females.
In the future, this project could be improved in the following ways:
- Investigating why the model performs better on females than males. One possible cause for this discrepancy is that there are more females in the dataset, so the model has more data with which to identify females.
- Translating non-English text to English and then passing that through the model. One way to look at translation is that it is a poor man’s form of PCA; the model could share the weights of English terms rather than being spread thin on every input language. This experiment was attempted but it was found to be too slow due the need for a web request for every example.
- Redoing the project with a neural net instead of logistic regression. Neural nets typically require at least 50,000 to 100,000 examples to perform substantially better than classical models. This experiment was attempted early on in the project, but failed due to an insufficient number of examples.
- Incorporating user photos into the model via ensemble methods. Computer vision is expensive and slow, so this addition is unlikely to add substantial value to the end result.
very useful article, thanks for sharing!
YOU NEED POTENTIAL CUSTOMERS THAT BUY FROM YOU ?
I’m talking about a better promotion method than all that exists on the market right now, even better than email marketing.
Just like you received this message from me, this is exactly how you can promote your business or product.
Do you want more details or do you want to receive a TEST ?
CHECK HERE=> https://bit.ly/Good_Promotion
BEST PROGRAM FOR ADVERTISEMENT!
XRumer is the best program for advertisement!
It’s have CAPTCHA recognizer, email verificator, and a lot of other functions…
This software will help to increase traffic to website to hundreds, thousands times.
Program have a rich seven year history, which use experience of professionals in search engine optimization.
Appreciate and use a truly unique and powerful XRumer program, can both professionals and beginners.
MORE INFO=> https://bit.ly/39RzWR4