Well, we actually made it not literally in garage. We are just a “garage-type” startup.

How we measured Twitter demographics in garage using AI

In minutes

Demografy
6 min readMar 21, 2018

--

We are proud to publish our first case study. In this case study we used Demografy to measure age range and gender of US Twitter users and compared our results with two separate studies on Twitter demographics published by Pew Research and comScore. We also made a brief overview of existing technologies of measuring demographics of website audiences.

Demografy is a B2B SaaS platform that uses machine learning based noninvasive technology to get demographic data using only names. It can be used to get demographic insights or append lists with missing demographic data.

Unlike traditional solutions, businesses don’t need to know and disclose their customer or prospect addresses, emails or other sensitive information. This makes Demografy privacy by design and enables businesses to get 100% coverage of any list since all they need to know is names.

For those who are interested in results, we are going to start with key findings and then dive deeper into comparing existing technologies and describing our methodology.

Key Findings

Accuracy

We evaluated Demografy’s accuracy compared to well established existing solutions. The gender measuring accuracy of Demografy was 96% (49% male 51% female vs 51% male 49% female) if we deem Pew Research survey results as a golden standard. The age measuring accuracy was 94% compared to comScore. These numbers are pretty in line with our existing accuracy benchmarks which are 97% and 95% for gender and age range respectively when compared to self-reported data.

Time and cost

*- not including data collection time though this process doesn’t take much time.

**- it’s hard to estimate large scale costs that comScore and Pew Research bear but they are much higher than applying machine learning algorithm on the set of easily available data

Goal of the research

One of the reasons behind the research is assessment of Demografy’s accuracy. However before conducting this research we have already tested accuracy of Demografy by comparing its results with hundreds of thousands of publicly available self-reported records of real people.

So there are three key aspects of the research:

  • Explore and compare available approaches to measuring Internet demographics.
  • Benchmark our own performance against known and well established solutions.
  • Implement proof of concept of a brand new demographics measuring method that eliminates disadvantages of traditional approaches.

As a case for our research we chose the task of measuring US Twitter demographics. More specifically age range and gender distribution of US Twitter users.

Existing methods and problems

Many are curious how companies measure demographics of websites they don’t have access to. We can split traditional approaches into three main methods:

  1. Panel data from volunteer Internet users. This method involves large scale and diversified network of volunteer users with tracking software installed on their devices. These users provide demographic data about them and sites they visit are being automatically tracked. Then sampling is performed so these results can represent a wider population. Some examples are comScore or Nielsen.
  2. Surveys of Internet users. These are traditional surveys of users that include questions about their demographics and Internet usage. Examples are Pew Research or any other polling organization.
  3. Cookie-based on-site analytics. Most popular example is Demographics in Google Analytics. It involves putting third-party cookies on site visitors’ devices to track them across the Google Audience Network in order to infer their demographics using Internet usage patterns and/or Google+ profile data. Unlike other methods, this method is available only to site owners.

As the major goal of our research is measuring demographics of third-party websites (for example competitors’ websites) we will focus on the first two methods since they allow to measure not only own websites but any other available.

While these two approaches are recognized as being pretty accurate with sampling error plus or minus approximately 3 percentage points they have some disadvantages:

  • Cost. Large diversified panels of Internet users or large scale surveys are very expensive involving heavy infrastructure and many man-hours.
  • Time. These measurements are normally ran over long period of time to collect and process enough data from all respondents or panel users.

The two examples of these approaches we’re going to use in our research are comScore (panel data) and Pew Research (surveys).

Methodology and the new method

In this research we measured gender and age distributions of US Twitter users and compared it to other researches. For this purpose we combined two studies from Pew Research and comScore on gender and age distribution respectively. Both studies are conducted during 2016 at the time when the data we feed Demografy with was also collected.

Normalizing data

  1. Pew Research, Demographics of US social media in 2016. A survey of 1,520 adult Americans conducted March 7-April 4, 2016. Pew Research samples landline and cellphone numbers to yield a random sample of US population with average sampling error of 2.9 percentage points according to their methodology. The study provides both gender and age distributions. However metrics used are different from ours since they provide age distribution of US population while comScore provides more convenient for our case age distribution of US Twitter users. So we used only gender data from Pew Research to avoid excessive data normalization which may cause significant errors in estimations. As of gender we normalized Pew Research’s data using US Census gender data for 2016 so 24% males of US population and 25% females of US population became 49% and 51% of US Twitter population respectively.
  2. comScore, Distribution of Twitter users in the United States by age group. comScore uses panel data approach which means it has a large diversified network of volunteers with self-reported demographic data and tracking software that analyzes their Internet usage. comScore normally updates statistics each month resulting in 30 days data collection period similar to Pew Research’s 4 weeks survey period. They don’t provide information on their accuracy but they sample their audience to represent the entire US population so probably they have comparable 3% sampling error. comScore provides more convenient metrics for age groups as age distribution of US Twitter population. However their age groups are different from the ones that Demografy uses so we normalized data and transformed their 6 groups into 3. Since new age margins fall into the middle of original age groups, we splitted each such group into two equal parts and added them to respective new age group. So 18–24 (17.7%), 25–34 (22.5%), 35–44 (19.5%) became 18–39 (49.95%) (17.7% + 22.5% + 9.75%) or 50% after rounding. Though this is not the perfect way of normalizing data, we made this assumption as the best available option due to the lack of additional data and low probability to spoil results noticeably.

Demografying Twitter

As of our measuring, we already had a data set of 500,000+ random Twitter users collected in 2016 for another project using public Twitter API. The collected profiles are completely random to ensure that there is no bias in source data. For the research we cleansed this data to have a quality sample of US Twitter accounts only. For this purpose we used both available location profile data and applied text mining algorithms to filter only accounts of people (those with personal names) who located in US. Additionally we used only accounts that were active during the last 30 days before they were collected. Finally we extracted a random sample of 10,000 US accounts.

After that we applied our proprietary technology to detect age and gender distributions of these accounts. The resulted data was compared to Pew Research and comScore.

Conclusion

Demografy can be used as a new viable solution for measuring demographics on a large scale. Unlike traditional approached it doesn’t require long term and expensive investments while showing comparable accuracy. At the same time it should be noted that Demografy is limited to only audiences containing personal names. For instance, it can’t be used to measure anonymous site audiences. However, it can be used to measure sites like social networks with published profile data of its users. It can be also applied to marketing lists and other data sources containing names.

Follow us in social networks to get updates:

--

--

Demografy

Privacy by design AI platform that predicts customer demographics using only names - www.demografy.com