Online advertising is a rapidly growing industry. The business models change very fast, posing a lot of challenges to engineering as well as machine learning community. And since the technology solutions impact the business directly, it is very important to improve on factors like scalability, optimization performance etc.
Komli has developed a very robust real time ad serving platform which performs better than* the most widely used system available in the market. An important component of the platform is the decisioning engine to choose the best ad per impression request in order to optimize the revenue. The engine uses a prediction system to predict the click through rate/conversion rate of a campaign based on publisher, advertiser and user level attributes. The model is trained on the historical data of these attributes and their performance. This article focuses on one of the experiments performed for feature selection before building the model.
Feature Selection: As in any machine learning task, one of the most important steps before training the model is feature selection. There are many features which could be used for ad optimization. Those features are nothing but variables which you think affects the CTR/CVR and hence would help in prediction, let’s call them predictor variables. Following is the list of predictor variables which we considered for our optimization problem. But since the data required by any statistical model increases exponentially as you increase the number of variables (without degrading the performance), the best strategy is to base your prediction on a subset of available variables. The below subset matters the most in prediction.
|
sr. no. |
Variable |
|
1 |
Country |
|
2 |
City |
|
3 |
DMA |
|
4 |
Creative height |
|
5 |
Creative width |
|
6 |
Publisher Id |
|
7 |
Site Id |
|
8 |
Section Id or ad slot Id |
|
9 |
Advertiser Id |
|
10 |
Creative Id |
|
11 |
Hour of the day |
|
12 |
Refererurl |
|
13 |
Publisher url |
|
14 |
Placement id/fold position |
|
15 |
Different user attributes obtained from a cookie |
|
16 |
Page context, content category |
|
17 |
Source Ad Network |
|
18 |
day of the week (0-6) |
|
19 |
4 hour (which 4 hour slot it is 0-5) |
|
20 |
8 hour (which 8 hour slot it is 0-2) |
|
21 |
Language |
|
22 |
Social (high, low, average) |
|
23 |
Demographic segments like Gender, Age, Income etc. |
|
24 |
lifestyle (equivalent to interest) |
|
25 |
Landing page url |
|
26 |
Root domain id of the publisher |
|
27 |
Root domain id of the landing page |
|
28 |
Resolution of the user’s screen |
|
29 |
External events – proximity to these events |
|
30 |
Frequency of the user on the publisher |
|
31 |
Frequency of the user on the creative |
|
32 |
Frequency of the user on the ad network |
|
33 |
Recency – length of time between exposures to this creative |
|
34 |
Device of the user (mobile, laptop etc) |
|
35 |
Browser of the user |
|
36 |
Recent search history of the user |
|
37 |
Type of the page the ad is on (can it be pop-up etc) |
|
38 |
Whether the user is a member of the site |
|
39 |
Publisher category |
|
40 |
Metadata for the creative – product category, type of the creative etc. |
There are a number of feature selection methods available in store. Methods like correlation analysis, PCA (for dimensionality reduction) or feature ranking methods based on scores like mutual information, information gain, gain ratio etc. Our existing analysis of importance of predictor variables and their correlation gave us an idea of which variables are correlated to each other and which are not. This article does not focus on that discussion.
We will mainly discuss the feature ranking performed by us and the results. We used mutual information as the scoring function to rank the variables. The mutual information of two discrete random variables can be defined as [http://en.wikipedia.org/wiki/Mutual_information]:

where p(x,y) is the joint probability distribution function of X and Y, and p1(x) and p2(y) are the marginal probability distribution functions of X and Y respectively.
The graph below shows the relative mutual information scores for a small sample of attributes. Most of the attributes are the identifiers of some of the physical entities like creative, publisher etc. The attribute CTR is derived from the click performance of CPA campaigns and frequency is derived from the number of exposures of a creative to a user.

This simple experiment gives us an idea on which variables are important and which are not. The results here are only for a sample, our overall results confirm that the variables with high value of mutual information do get a high importance in prediction.
The next steps could include performing bucket tests (A/B testing) in order to gauge the performance of different models built on different subsets of variables.
*as per our benchmarking experiments.
