Each year we scrape data from tens of millions of unique consumers in the US alone, and tens of millions of consumers across numerous other markets. This generates hundreds of millions of data points, and it’s done in a completely randomized manner. Our data comes from forums, comment sections on blogs/news/e-commerce/video sites, as well as other types of long-form (deep) engagement platforms. We do this because our interest is to decode the culture around a topic. That is, we want to identify the universe of topics surrounding a topic, as well as the depth of the relationship between these topics.
Ethnography (the study of a culture and its language to decode meaning) requires a minimum level of data quality.
When we scrape data, we look for platforms that satisfy two important criteria from a quality standpoint, so as to enable real ethnographic analysis.
- Some form of anonymity: Platforms that allow users to use a pseudonym, giving them the feeling of anonymity, leading to more honest discussion on topics that range from people's deep routed underlying beliefs to their opinions and attitudes on matters both public and private.
- Long form discourse: Platforms that enable real discussion rather than encourage users to hit 'like' or 'share'. The act of needing to use words and compose sentences to articulate one's feelings and opinions allows us to not only understand what people are saying, but also get at why they're saying it. This is why we don't scrape platforms like Facebook, Instagram or Twitter. They do not satisfy either of these criteria and therefore deliver low-quality data from an ethnographic standpoint.