Fandom stats and surveys and so on

[sticky post]Welcome, and introductions!
[This sticky welcome post will stay at the top to welcome all newcomers!]

This is a community to discuss fandom (fanworks and/or fans) from a quantitative/analytical angle.
Read more...Collapse )

If you're new here, please feel free to leave a note in the comments on this post introducing yourself and sharing a bit about your fandom interests and past fandom analyses (if applicable).  You can also ask any questions about the community here.

If you're new here, please feel free to make a new post and say hi!  :)  Tell us about what you're interested in, or what your background is, or ask questions.

Currently, more discussion is going on in #fandom stats and #fandom stats discussion over on Tumblr, but I'd like to also keep this community going for people who are happier on LJ and conversations that are better suited to threaded interactions.  So feel free to jump in on either/both platforms!

AO3 stats: collection and reduction
A quick intro, since this is my first post: 2005-ish was when I entered fandom, but I've only been interested in the stats of it for a few years now. Having a maths/physics background means that actually collecting data is a whole new world for me! I'm not on tumblr, but I have the same user name on AO3 and Dreamwidth.

Plunging into the deep end, I wrote some Javascript code to calculate the ratio of any of the basic AO3 stats to any other (hits, kudos, comments etc.) and bin them into a grading system. This actually started because I wanted to measure "log(kudos/hits)" as a proxy for the quality of a fic. Then I realised that this is not a great approximation. Which made me think:
  • Are there (literal, not computer) applications for this sort of on-the-fly, small-dataset, stat-extracting code?

  • Which of these raw stats are meaningful? What hypotheses can I test with them?

  • By combining the variables, what sort of trends and/or noise am I removing?

  • Can we make derived quantities which are less prone to uncertainties or that answer more interesting questions?

  • How much per-fandom calibration would this require?

  • In a megafandom, what numbers would you use to select a sample of "quality" fic small enough to then be judged by hand?

In short, is what I am doing actually practical? Am I missing useful or important functionality in the code? (TBH I am a JS newbie, so my code is probably going to make some of you cry. Nonetheless, I'm happy to share it.)

Sorry for the sudden avalanche of questions, I'm just happy to find other stats geeks lurking in fandom! :)

Beginners Doing Stats: Daredevil Kink Meme
I decided to write a post about a project me and dustysoulss are currently working on: analyzing the Daredevil Kink Meme, hosted on Dreamwidth. We're two beginners when it comes to statistics and our fumbling might be interesting to others like us. There is a lot of fan communities on Dreamwidth so the technicalities of gathering data from it might help someone considering a similar project.
And honestly, we're a bit stuck right now and really could use some help.

It all started because dustysoulss got interested in the behavior of the Daredevil Kink Meme. She was curious about the way the fandom develops: Daredevil is a new show, so it would be a good time to observe the the fandom preferences (especially main ships) taking shape. She thought it might be possible to get data programatically through the Dreamwidth API, which is where I (an obsessive API tinkerer) came into the project.

Well, as for the Dreamwidth API, the result is disappointing: Dreamwidth API can't be used to gather public data.

I expected something like Tumblr's or Twitter's API. These are designed, basically, as a lightweight alternative to the website - you can authenticate as a user and modify your own blog, you can browse someone's public blog posts, or you can get browse aggregate pages (e.g. tags).
The DW API (which is originally LJ API, but lets not go there) is older, and its approach is different. It's designed as an interface for desktop publishing applications and chat clients. You can get data about your own blog, but not about any other. This, unfortunately, isn't really explained in the documentation - I suppose older folks would expect that, but I didn't, and I spent all day neck-deep in LJ documentation pages dating back to early 2000s.

But, hey, at least now you don't have to.

I ended up downloading the html and parsing it with Python scripts. The DW devs are explicit about their dislike for scraping, but since I got the data once and worked on offline copy, I feel I did my best. If any of you are considering similar project, I uploaded my scripts to github - it's nothing fancy, but it's a good starting point and it might make your life easier.

As for the actual data, it turns out that we are somehow lucky: the daredevil kinkmeme admins maintain a Delicious account with links to the prompts, organized by tags.
Only 'somehow' though, because Delicious API is the same kind as DW - and the fairly new site design means that scraping it is even harder than scraping DW. You literally have to scroll and scroll and scroll until you reach the first bookmark, then save the entire site, and then somehow parse the monster file. (This is the script I used for the parsing.)

After I got the additional data from Delicious, I merged both sets. Now we have a csv file and a json file with a list of prompts, with the date posted, title of the original prompt, url, the prompt itself, number of replies it had on DW, and a list (or array) of tags.

...and we're not sure what next.

Dusty originally had some vague ideas what she'd like to look into:

  • what's being requested on the kinkmeme, and how does it change over time?

  • are there some "juggernaut ships" forming?

  • what kind of prompts are these - AUs, crossovers, RPF, hurt/comfort,...?

  • how much comments are on which prompts?

And as I gathered the data, I had some ideas of combinations that might be interesting:

  • total number of prompts per ship

  • a graph comparing the number of prompts per ship over time (per day? week?)

  • something with the non-ship tags? maybe non-ship tags vs ship tags - which non-ship tag is in majority for which ship

  • percentage of filled prompts per ship

  • number of comments on prompt by ship

But it's all vague, and aimless, and we don't know what to focus on. On top of that, neither of us has experience with making graphs, so we don't know what tools are there.

Which is the point where I ask for help:

  • what kind of "output" would be best to have from this data? I mean, my knee-jerk reaction is "lets make graphs!", but I know it's not just about graphs, it's about giving insight into a thing... or, thinking as a journalist, about telling a story.

  • what can we use to produce it? What tools do you have experience with? What would you recommend to two complete beginners with varying experience with coding? I might be able to wrangle a graphing library, either in Python or JavaScript (preferably something more beginner-friendly than D3.js, though); Dusty knows some basics in Python.

What do you think?

Measuring representation in media
[also posted something similar about this on Tumblr; I still am not sure whether this LJ community is sustainable when people are mostly on Tumblr these days... but I do find threaded conversations a lot easier here.]

I'm interested in trying to assess the degree of representation in media -- of gender, initially, but if I can eventually figure out how to assess other representation (POC, queer characters or actors, religion, country of origin, etc.) that would also be awesome.  In part, I'd like to try to quantify disparities in modern popular media. But from a fandom perspective, I'm also really interested in how this ties into shipping.  How much of the predominance of M/M and lack of F/F in fandom is explained by the gender representation (or lack thereof) in the source material?

Figuring out how to get representation data about source media is tricky.  If I stick to TV and/or movies, IMDB at least lists what actors appear in, and the actors' genders.  I can look at billing order or number of episodes that way.  Maybe there are also other good metrics I should pay attention to?

Better still would be to measure screen time or lines of dialogue. I did an analysis for Sherlock of the number of lines of dialogue each character has based on arianedevere's transcripts, but that’s too time intensive to do for a large batch of shows (since there’s no central repository of transcripts that are all formatted the same way).

These are some numbers I got when horsing around with various different methods of measuring representation from the data on IMDB:

  1. I could compare the number of actors of different genders who were in at least X of the episodes -- e.g.,
    in BtVS, there are 8 women and 8 men who are in at least 20/145 episodes;
    in Xena, there are 4 women and 4 men who are in at least 10/134 episodes;

    in Sherlock, there are 4 women and 6 men who are in at least 5/15 episodes (note that as well as the 9 main episodes, IMDB also counts Unaired Pilot, Many Happy Returns, the Christmas Special, and the 3 episodes of s4 based on rumored casting so far)

  2. I could compare the total number of appearances by all male actors to total number of appearances by all female actors (again excluding minor guest roles) -- e.g.,
    in BtVS, there are 623 appearances by women and 551 appearances by men (of characters who appeared in at least 20 episodes);
    in Xena, there are 290 appearances by women and 107 appearances by men (of characters who appeared in at least 10 episodes);
    in Sherlock, there are 32 appearances by women and 64 appearances by men (of characters who appeared in at least 5 episodes)

  3. As I said, lines of dialogue is too labor intensive to do on a big scale -- but just by way of comparison for Sherlock:
    in Sherlock, there are 1K lines of dialogue by women and 7.5K lines of dialogue by men (of characters with at least 50 lines)

Read more...Collapse )

Suggestions for how best to measure gender representation in TV shows from easily available online data (e.g., IMDB or Wikipedia) are welcome! Also for how to measure other representation -- POC, queer characters or actors -- but I suspect that’s much harder. :-/

Any ideas?  Thoughts?  Warnings?  Pointers to related work?  :)

Drawing data from tumblr
Hey, all! I wanted to let anyone interested know that I have some code up, with explanations, at my github blog.   The topic is sampling from the tumblr API and trying a simple clustering algorithm on tags related to the tag "Supernatural."

The results are not earth-shattering, but for me it was worth going through an exercise that drew from a web API, looked at fandom data, and did some basic data science--using python.

The explanations assume working knowledge of python.  I also assume some mathematical background, but I haven't gone into huge amounts of depth and I've tried to explain things in English as well using my example data set to illustrate the concepts.  A background in the show Supernatural is not assumed either, but could probably only help :)

Constructive criticism is welcome here, or on my tumblr (, or at

Searching for a collaborator
Hi folks,

I'm new to LJ in this incarnation (the less said about my teenaged-self's LJ, the better...), but I'm thrilled to find you! I have been working a bit on a project and I'd like to collaborate with someone. Destinationtoast pointed me towards this LJ.

It's not fandom stats; it's really narrative analysis that incorporates statistics. I'm thinking about how (or whether) specific character agency in Sherlock correlates to commonly-expressed fan reactions. Put another way, this project started with this question: Does some folks' frustration with S3 John come about because John really had unprecedented agency in that season? Wait, did he, in fact, have more agency than previously?

So I'm going through episodes scene by scene and identifying narrative agents and things. (I'm also figuring out scene timestamps and durations, which is helpful for calculating things like how much of each episode is spent in 221b!) I don't think I'm ultimately going to work on answering the prompting question, but I'm curious to see what shakes out after I've gathered the data, and what sorts of ideas present themselves. I've finished ASiP and TGG, and I'm about halfway done with TEH. Ideally, I'd like to do all the episodes, but this will take awhile.

I'm looking for someone to help with me data analysis and visualization. I am handy with a spreadsheet and I can calculate whatever needs to be calculated myself, but I'm not sure I'm asking the most interesting questions of these data and I'm curious what some others might think. Also, I'm not entirely sure what appropriate visualizations might be, and could use some suggestions.

The spreadsheet for ASiP is here:

Does anyone want to play around in it with me? The organization and some of the content may be a bit opaque; let me know if you'd like me to explain how the ss and current calculations are set up.

(This is not the most straightforward question, but I'm still exploring the idea myself.)

On the challeges of comparing demographic data from AO3, FFN, and Wattpad
Making comparisons between data from multiple studies is delicate business. When I saw destntoast's post about the Wattpad stats, my experimentalist alarm bells started ringing because the numbers on each archive come from very different subsets of their users.

Sampling is a really important topic in empirical work. In most research, we can't measure all the things, so we do our best to measure some of the things (a sample or subset) in such a way that these fewer measurements reflect the true distribution across all the things.

I could talk about sampling for ages, so to keep the focus, this post specifically discuss the information about user ages. However, the same or similar factors matter for other kinds of demographics too.

Above the cut, I've tried to explain the sources of user ages from each archive, to the best of my knowledge. Below the cut, I'll get into their expected consequences on the relationship between these numbers and the true distribution of user ages.

AO3 census
The information about the users of AO3 come from an anonymous survey conducted by Centrum Lumina in 2013. It was posted on tumblr, got widely reblog, and a remarkable 10,000 people submitted information on their identities and their uses of the archive, whether or not they had accounts. Of those who participated, 99.9% reported their ages.

To put this number in perspective, nearly 4000 of those who answered reported archiving works on AO3. From my posting rates data, a quick and dirty estimate puts that at around 1% of people who had, by that time, posted works to the archive. If this was a random sample of archiving users of AO3, we could expect a margin of error of +/- ~1.5% on a lot of the resulting stats on that subset of respondents. And if the other types of users are represented in the same proportion, their numbers would have a similar degree of accuracy. But both of those ifs are pretty big, as carefully acknowledged by the diligent Centrum Lumina.

FFNet User Accounts
The demographic data about FFNet comes from a (well sampled) subset of new user accounts over the year 2010. This is NOT a sample of people who use this site like the AO3 census: you don't need an account to read fic on FFNet or submit reviews, and we don't know how this sample of new users in 2010 represent those who were active at that time but joined before 2010.

The data on user ages comes the profiles of this sample of new user accounts. Specifically, the numbers are taken from the 9% of these new users who chose to declare their age in the text of their public profiles. Yeah. The stats reported on the FFNet analysis (the margin of error) treats this source of numbers as a random sampling of new users ages, but, as I'll get into later, there are so many reasons for that not to be the case.

Wattpad numbers
To be honest, I don't know where the numbers on Wattpad user ages comes, but here is my best guess. I don't think they are from an independent anonymous survey of people who use the archive (like the AO3 census). I googled for such a thing and found nothing: were these numbers from a survey, it can't have been very big.

However, these numbers come straight from Wattpad itself, so it's more likely they are from user accounts like the FFNet data. Note: Whether or not they post content, the users of Wattpad are strongly encouraged to have accounts by both the website interface and functionality, much more so than FFNet or AO3. It is possible to read stories on Wattpad without an account, but you have to click past pop up windows and start from a directly link to a work, rather than the splash page of

Anyway, users with accounts on Wattpad have the option of reporting their date of birth on their user profiles in a dedicated date-feild. They can further opt to make that information visible on their public profile. If they are from user accounts, the numbers passed on by Emily might reflect the ages across all accounts reporting date of birth or some selected subset, though what kind subset I couldn't say.

So what?
The comparisons between the archives' user ages data in destntoast's Wattpad post sent me into a bit of a panic because each set of numbers is likely to be biased for different reasons and in different ways. Given their origins, I'd bet that:
1. the AO3 ages are either pretty accurate or skew a little older than the true population
2. the FFNet ages skew substantially younger than the true population
3. the Wattpad data could skew younger, older, or towards the late teens (specifically over seventeen) depending on how account holders expect Wattpad to use their profile details, and how/whether the Wattpad user accounts were sampled.

Below the cut I get into the hypothesized factors behind these expectations.

Read more...Collapse )

Edits warning: I will undoubtably make edits, but they will probably be trivial orthographic corrections. If some sentence is unparseable, my dyslexic self will probably never notice, so please point it out.

Whoops! :)
Members of this community have raised good questions/issues with my Wattpad post here or over email (THANK YOU).  Unfortunately, I won't have time to fix it until tomorrow -- and it's already out in the world getting reblogged, so I can't just take it down without doing more harm than good.

So for now, I'm taking down the bit behind the Read More and leaving a brief explanation for why I did (and the fact that I'll correct/improve the content later).  But I don't want to just get rid of my original post.  So I'm putting it here for archiving and further discussion:
Read more...Collapse )

Please keep adding thoughts and suggestions (here or in my previous post), and I'll incorporate them in my New Improved Write-Up tomorrow/this weekend/as soon as I can.

(Also, if you have better suggestions about how to retract a Tumblr post/deal with errors, I'm all ears... the fact that reblogs make copies even after the original has changed limits the ways to do that.)

Edit: also, I'm not distraught about this and don't think it's a Big Deal. We all make mistakes or oversimplifications; I have done so in the past; I will do so again. (Frankly, some of my past posts could also have benefited from this kind of concrit.) I really appreciate you guys helping to point out problems and nuances!   That's one of the kinds of thing I hoped this community would be for.  :)  

Edit: new version of the post now up!

Wattpad stats and thoughts about dealing with for-profit sites
I just posted some stats about Wattpad, based on numbers provided to me by someone on the Wattpad staff (herself a fangirl, and working on developing their Fanfiction genre).  I welcome feedback about the stats, and ideas for what data to ask for next; my impression is that they're interested in fandom stats and open to sharing more data, but also ridiculously busy (it took them months to get me these stats).

I also welcome discussion about the process of working with a (for-profit) site directly to get and share fandom stats.  I feel slightly wary about the whole thing, because I feel like it would be easy to appear to be a shill for Wattpad.  (Though I hope I'm being transparent enough about everything that there's less risk of that?)   On the other hand -- DATA!  :D   Also, I'm trying not to let the fact that Wattpad now follows me on Tumblr dampen any of my critical thoughts, but who knows... always a risk.

(I also realize that it's not like I'm a NYTimes reporter, and nobody is actually holding me accountable for any of this.  This is all very low stakes.  But it's an interesting opportunity to think about #ethics in fandom stats.  /sorry )

Edit:I temporarily took down parts of the original post and explained why here:  Further discussion more than welcome in that thread or below!  :)

Data, Data, Data (About Fandom!) chat log and results
Today, Unlocked Con had a live chat/panel on fandom stats and surveys (see previous post). It was a blast, a big hectic blast, as can be see in the transcript. Still, we did get through morst of our intended topics. destntoast​ gathered a Sherlock stats cheatsheet was a great start to the conversation, and the tumblr post includes extra links to some other fandom studies relevant to the discussion.

The panel was surprisingly interactive! When the chat turned to the topic of surveys and what we’d like to know about fans, shinysherlock​ suggested we take a poll of ages present. 56 participants volunteered for this impromptu study and the following plots report the results (my thanks to participant blacktail for catching a calculation error!)

In the chat, we had an average age of 34, but there was a wide distribution of ages participating, standard deviation of 12 years, and a range from 15 to 55. This is older than many people expect (inside and outside of fandom), but our chat participants were hardly random sample. I wouldn’t assume that all Sherlock fans get this excited about live spreadsheet editing!
We also tried to get a sense of where people were from, but participation on this poll was much smaller. Of those who answer, most were from North America, the rest were in Europe, but who knows about the other 50+ people who popped through during 1.5 hours of chatting.


Log in