TubeCensus Reveals YouTube's Hidden Creator Economy with 20 Years of Data
New dataset covers 30-36% of all YouTube content using Internet Archive captures...
Understanding YouTube's creator economy has been hampered by the platform's closed API, which doesn't expose full channel metadata or historical subscriber counts. Researchers needed a way to study how algorithm changes shape creator incentives and content trends without relying on black-box data. TubeCensus addresses this directly: it compiles nearly 20 years of YouTube page captures from the Internet Archive, linking and organizing them into a longitudinal dataset of channels and subscriber counts.
This approach is fully transparent and replicable, requiring no interaction with the official API. Validation shows TubeCensus captures creators responsible for 30-36% of all YouTube content, with strong coverage of prominent channels. The dataset is distributed as an easy-to-use pip package that hides the complexities of YouTube identifier and archive systems. Early exploratory analysis already reveals patterns in channel growth and content dynamics, setting the stage for rigorous studies on platform economics and algorithmic impact.
- Built from 20 years of Internet Archive snapshots, bypassing the YouTube API entirely.
- Covers 30-36% of all YouTube content with good coverage of top creators.
- Available as a pip package for easy, replicable research access.
Why It Matters
Opens up YouTube's creator economy to rigorous, reproducible research, shifting from API black box to transparent data.