Github link to full code + dataset

Dataset

Shape of DataFrame is (8378, 195). Each participant have a unique ‘iid’, each line of code represents a speed date between two participants.

Screen Shot 2020-08-13 at 12.09.18 AM.png

Screen Shot 2020-08-13 at 12.01.27 AM.png

Row 20-29 are all the dates iid 3 went on, pid is partner id. Note how row 29 and 192 have a relationship. These two rows are actually the same date, however row 29 have information from iid 3 perspective and row 192 is from iid 20 perspective. Since each date is represented twice, I had to manipulate the dataset so there are no repeating information between rows for machine learning models and visualizations.

Age

Histogram of female is more right skewed than male. This means female average age is lower than male average age in this speed dating experiment.

Attribute Analysis by Visual

Next I want to take a look at what attributes people think are most important. Participants were given a questionnaire before the speed date rounds. There is a section where participants are asked what they look for in the opposite sex. They are asked to allocated 100 points between 6 characteristics, this shows the difference in importance between attributes for people.

intel1_1 : intelligence fun1_1 : fun

sinc1_1 : sincere amb1_1 : ambition

attr1_1 : attractive shar1_1 : shared interest

Screen Shot 2020-08-14 at 3.22.48 PM.png

Screen Shot 2020-08-14 at 3.22.56 PM.png

WOAH, interesting findings. At first glance the chart on the left shows the general population thinks attractiveness is the most important in a significant other. However after separating the dataset by gender we can tell the the value for attractive was heavily affected by men. Men values attractiveness as most important compared to women who values intelligence as most important. According to this chart men are extremely shallow and overlook all other qualities, women values the qualities more well rounded. I can confidently say I am quite different from the average male, I think intelligence and ambition is quite important in a girl.

I want to see another interesting comparison. In the questionnaire, participants were asked what they think the OPPOSITE sex looks for. I can compare these results to what men and women ACTUALLY wants.

Screen Shot 2020-08-14 at 3.43.33 PM.png

Screen Shot 2020-08-14 at 3.43.49 PM.png

Both genders underestimated how much people actually value intelligence and sincerity. In addition, they overestimated how people view attractiveness. This is great news for all the ugly people out there! The person you’re trying to pursue doesn’t care about your physical attraction as much as you think, instead they care about how fast you can figure out that math equation. But with all seriousness, what the opposite gender wants compared to what we think they want don’t quite align. Hopefully this clears the air!

Attribute Analysis by Statistics

I want to show a statistical representation on how each characteristic effects the success rate of a participant. Success rate is calculated by number of people that say yes to the participant divided by total number of dates participant went on. I plotted each participants success rate to the average score given for each characteristic. The jointplots show the distribution of success rate and characteristic as well. The takeaway I got from these plots are attraction and level of fun have strong linear relationship to success rate. The rest of the characteristics are also linear to success rate at a low extent.

amb_o : ambition score given by partner. intel_o : intelligence score given by partner since_o : sincerity score given by partner

attr_o : attraction score given by partner shar_o : shared interest score given by partner fun_o : fun score given by partner

Screen Shot 2020-08-14 at 6.46.27 PM.png

Screen Shot 2020-08-14 at 6.46.00 PM.png

Screen Shot 2020-08-14 at 6.46.11 PM.png

Screen Shot 2020-08-17 at 12.39.06 PM.png

Since we have success rate calculated for each participant already, let’s see the relationship with other variables. I plotted a box plot relative to participants field of study. The field of study is numbered from 1 to 18, the number representation is explained in the attached word document on my gitHub. Although it seems like people who study architecture (16.0) have high success rate, the sample size is rather small so this is a biased claim. We can noticeably tell people who study languages (4.0) does have a higher average success rate. I guess this makes sense because people who can speak multiple languages are automatically classified as cool in our society today.

Participant Confidence Analysis

I wanted to check how confidence may effect dec_o (decision by partner for a second date). To calculate confidence I divided self rating by average rating given by partner. Self rating is an average of attribute score given to self. If the confidence proportion is 1, that means self rating and rating given by partner is quite close. Proportions over 1 means self rating is higher than rating given by partners, proportions under 1 means the average rating given by partners is higher than self ratings. After doing all the calculations and placing them in new columns, I then plotted it against success rate. From the graph on the right we can clearly see people who rated themselves higher than what partners rated them have a lower success rate. So in conclusion, confidence levels and success rate have an inverse linear relationship.

Screen Shot 2020-08-28 at 2.54.40 PM.png

Data Preparation for Machine Learning Algorithms

As mentioned before the dataset had repeating informations in separate lines. A single date is represented twice from two different perspectives. To tackle this problem I came up with a solution to find a unique key representing each date. The variable pair_val is calculated by (iid*pid) + (race+race_o)+ (int_cor * int_cor) + (age * age_o). The ideology behind this variable is lines with the same value in pair_val represents the same date.

explanation of pair_val calculation :

interest correlation should be the same between the two lines

iid*pid should be the same between the two lines(iid and pid is just flipped)

race and race of partner & age and age of partner is just flipped between the two lines

The next step I separated the DataFrame into two, left being female participants and right being the male participants. Finally using the unique key I did a join to represent each date on one line. I renamed variables from the right side as well to differentiate. Basically I data wrangled the DataFrame to become less long and more wide while retaining the same information.

Original DF (8062 X 24) -> New DF (4031 X 41)

Prediction on Match

I used a very basic logistic regression to predict the Match variable (0,1). I want to see using information such as rating each others attributes, interest correlation, age differences and other existing variables can accurately predict if there would be a second date. Match=1 means both participants agree to a second date. I obviously removed variables such as dec and dec_o (decision of participant and decision of partner) because that will make it too easy for the model to predict. Below is a box plot of accuracy score from 100 iterations of different train test splits.

Screen Shot 2020-08-28 at 4.29.04 PM.png

Screen Shot 2020-08-28 at 4.29.13 PM.png

Overall average accuracy rate of 86%!