What data sources do you use?

Each year, we scrape data from tens of millions of unique consumers in the US alone, as well as tens of millions more across other global markets. This results in the generation of hundreds of millions of data points.

Where do you collect data?

Our data is sourced from a wide range of platforms, including:

Forums like Quora
Comment threads on aggregators like Reddit
Comment sections on blogs and news sites such as Forbes and TechCrunch
Reviews on eCommerce websites like Amazon, Travelocity, or App stores
Video platforms such as YouTube
Other deep engagement platforms like Substack

We scrape these sites in a completely randomized manner, ensuring a representative sample by gathering from a diverse array of sources each month.

What data do you collect?

Our aim is to analyze culture by understanding long-form conversations. This means we focus on collecting data that supports an anthropological analysis. We gather entire conversations in their context, not just isolated posts or keywords. The full context of the back-and-forth between commenters or posters creates meaning through the exchange of ideas and the consensus that emerges.

This approach reflects how an anthropologist observes and takes notes on an entire event, rather than tally the use of pre-selected phrases.

💡

NOTE: In non-English speaking markets, we collect data in the local language, with our anthropologists analyzing it directly. This avoids the pitfalls of machine translation and the errors that can arise from interpreting without an understanding of the region's cultural and political context.

What happens to the data?

The long-form conversations we gather are processed and embedded in our vector database. This allows us to map out topics of conversation, the relationships between these topics, and the depth of discussion. We are less concerned with individual conversations and more focused on understanding the overall shape of discourse around a topic—what it means to consumers today.

Data quality

Ethnography is the study of culture and language to decode meaning. This requires a certain level of data quality and a certain kind of data.

We prioritize platforms that meet two key criteria to ensure the data supports ethnographic analysis:

Allows pseudonymity: Platforms that let users remain pseudonymous encourage more honest discussions, revealing deeply held beliefs, opinions, and attitudes on both public and private matters.
Enables long-form discourse: We focus on platforms that foster real, substantive conversations, as opposed to those that promote quick interactions like 'likes' or 'shares'. Writing and expressing ideas in full sentences helps us not only understand what people are saying but also why they are saying it.

This is why we do not scrape platforms like Facebook, Instagram, or Twitter. They fail to meet these criteria and thus provide data of lower quality for ethnographic research.