Occasionally, GroupLens receives requests for datasets that we possess. In many cases, we are able to provide this data as we have with the Movielens rating datasets. One of the data collections that we have is a 10% sample of Wikipedia page requests (essentially every 10th HTTP request), since April 2007. This data accumulates at a rate of about 5 GB/day, and we currently have around 4 TB of unprocessed compressed data. This is approximately 40 TB when uncompressed. While we sometimes get requests for this data, the sheer size of it makes it difficult for us to make it available for download.
Although we cannot make this data available for download, depending on your request and our availability, we may be able to collaborate with you by performing the analysis you need on our data.
Also, we are not the only ones who have view data of Wikipedia. There are several other sources that have data on page views. Here are some of these resources and the type of data that they have available:
- stats.grok.se – Provides data on per-page view counts by month.
- dammit.lt/wikistats – Has files containing hourly per-page view count snapshots, with archives that currently go back to October 2009.
- Wikipedia Page Traffic Statistics on AWS – Hourly traffic statistics for a 7 month window (October 2008 - April 2009) are available on Amazon Web Services. This data was assembled from files that were available from dammit.lt/wikistats at the time.
I just read an article called "Why study time does not predict grade point average across college students" by Plant, Ericsson, Hill, and Asberg. The article is an interesting look at past data on what predicts GPA, and a small-scale (88 student) study at one university. The authors are big fans of the "deliberate practice" model of learning, and focus on seeing if that translates into academic performance. Some of the interesting information (mostly from past studies):
- studying without distraction predicts higher grades (no TV, no iPod, no study partners)
- students who study without distraction study for *fewer* hours, but get *higher* grades
- focused study is important. Just as many recreational tennis and golf players don't get better over 20 years of playing, just "reading" isn't enough. Deep thinking, analysis, and putting ideas together correlate with better grades.
- scheduling is important. Planning ahead for getting school activities done, and studying at regularly scheduled times correlate with higher GPA.
- going to class predicts higher GPA
- working too many hours, and partying too many hours both predict lower GPA
Overall there weren't a lot of big surprises, but I did find it interesting how important focused, uninterrupted study is. In fact, the total amount of study time did NOT predict good grades. A shorter amount of more focused study was more valuable. (Students tended to have to go to the library to get the more focused study time.)
What works for you?
Doctors: do no harm.
Authors: keep the reader turning the page.
Speakers: keep the listener, uh, listening.
The title of this post and the third aphorism represent the sine qua non for a successful research talk (or any kind of public speech). Once the audience stops listening, you, the speaker, might just as well stop speaking.
I've been thinking about this ever since the CSCW conference last week. I saw quite a few talks on subjects I'm interested in, with good research, good content in the presentation, and good - i.e., fluent - delivery. I was engaged by the content in many cases and asked a lot of questions.
However, in reflecting on my experience, many of the talks began to seem, hmmmm..., monotonous. The speakers didn't look animated. They didn't use much of a dynamic range in their speaking: they weren't loud sometimes and quiet others, fast sometimes and slow others. There weren't too many jokes (shout out to Cliff and Reid, two speakers who did joke a bit). The slides too were pretty homogeneous: none that shouted "I'm important - notice me!".
Again, the content was good - it wouldn't have gotten in otherwise!
But speakers, lively up yourselves! It'll keep your audiences' ears open, so that your great content will get in. (And please: if you do a lively presentation with poor content or poor organization or poor slides, it'll just seem ... poor.)
I've been spending a lot of time lately thinking about survey writing. In 2008, I took a short three day course with Jon Krosnick of Stanford University, which made me think about survey writing in a new way. In particular, I started realizing that the surveys that I wrote were poorly written.
Since then, it seems like I keep finding poorly written surveys everywhere I turn. Here are some examples I've found recently in my everyday life:
The US Postal Service sent me a Postal Customer Questionnaire lately because they were thinking about closing my branch. "If you now receive Post Office box service, you will be able to transfer your remaining box rent credit to another post office, or you may be eligible to receive a partial refund. How would you feel about consolidating the Dinkytown station with other postal stations? Better, Just as Good, No Opinion, or Worse" I answered No Opinion. Then I crossed it out and marked Worse. Then I crossed it out and marked No Opinion and wrote a three sentence explanation in the "Please explain" section. Why was this such a hard question to answer? Well, primarily because they'd never asked me about how I'd feel about the consolidation WITHOUT the refund. So now they were merging my opinions about the consolidation in with my feelings about the refund. Personally, I was mad that they were consolidating and I'd feel cheated if they didn't refund my money, but really the refund wouldn't change my opinion at all. No where on the survey did they ask me anything to this effect.
This second example isn't exactly a survey, but is still getting at some of the problems with survey writing. I'm having problems with allergies and need to go see an allergy specialist. So the clinic sent me my paperwork so I could fill it out before my appointment. Leaving aside many of my other complaints (and there are many!), the first main page has a section entitled "Chief complaints of patient." For each option you are supposed to check "Yes" or "No." The options are Asthma, Rhinitis (Hay fever), Urticaria (Hives), Eczema, Sinusitis, Chronic recurrent bronchitis, Nasal polyps, Recurrent otitis media, Recurrent pneumonia, G.I. disturbances (colic, diarrhea, etc), Insect sting reaction, drug reactions, or blank lines. Now I'm a pretty smart person. I've been in school for a grand total of twenty-one years now, but I can't tell you what many of those things are, and I can't tell you which ones I should select. I have a runny nose and a cough. I've been diagnosed with something, but I forget what it is, and it didn't include the second two symptoms, just the runny nose. Why on earth is this questionnaire that is obviously for the patient or patient advocate full of doctor jargon instead of patient jargon?
Now that I know better, I want to do my best to avoid writing bad survey questions, but at the same time, it's incredibly difficult to write good survey questions. So what I've been doing is writing my same old, same old questions and then revising...and revising...and revising. Trying to revise them to turn them into good questions isn't easy, but I try. I also ask for a lot of feedback and am very self-critical. One proof-reading pass doesn't cut it for a survey, even if it's only going out to 10 people. That would reflect poorly on me, my advisor, my lab, and my university...so I do more work. Hopefully if you take one of my surveys, you'll see the result of this work, and if not, I hope you'll take a moment to let me know.
This article on Read/Write Web describes how Stack Overflow, the tech Q&A site, will let other sties use their software, changing the look and feel, while keeping the Q&A goodness.
Since I started riding the bus to work, I've gained about 80 minutes of reading time a day, and lately I've been reading recent issues of interactions. I've found many of them quite interesting, probably because they're rather far afield from my usual concerns. They're mostly by and for HCI (broadly construed) practitioners, rather than for researchers. One particular article got my attention: Learning from Activists: Lessons for Designers, by Tad Hirsch.
Hirsch talks about how activists have been technology innovators, touches on some examples, and talks about what the design process is like under these conditions. For example, the "immediacy of activist projects, coupled with a perpetual lack of funding, forces a kind of rough-and-tumble innovation". Sounds right.
Things get more interesting later, as Hirsch says that "Activists willingness to engage in extra-legal activity also enables unique design opportunities". He hastens to add that he doesn't mean violence or vandalism, but the "exploit[ation] [of] excess acpacity", like squatting in abandoned buildings or using wireless networks without their owners' permission.
Finally, he talks about how "contestational designers" [i.e., those who design for activists] are "openly partisan practitioners who take sides in pressing issues of the day. They are neither objective technicians nor hired guns -- images that continue to dominate the technical development community".
It was this final point that I found most provocative. On the one hand, I too feel that it is imperative for all educated people -- and I hope that includes not just designers, but also software professionals, academics, and students -- to "take sides in pressing issues of the day". However, if we do that in our roles as professionals -- as designers, as researchers, as academics, etc. -- do we lose our professional community? Note that Hirsch isn't proposing this [that all designers, let alone all HCI researchers, should become "contestational designers"] -- I'm just tracing out the implications of his advice.
For example, it's obvious that the HCI community is heavily liberal and leftist. "Everyone" at CHI 2009 was ecstatic about Obama's election, but this fact was mostly "informal". That is, the conference program per se did not reflect it. But what if this changes? What if activist papers play a larger and larger role in our professional community? Would they all be liberal-activist papers? Would that drive out non-liberals? Or would conservative activists, Chinese nationalist activists, anti-abortion activists, gun-rights activists all be represented? Would you be happy about papers about providing technology support for radical environmentalists to shut down a coal plant, or for radical anti-abortion activists to shut down a family planning clinic?
I don't know... However, as I get older, I am more and more interested in reconciling my personal beliefs and my professional practice. Hirsch's article made this topic more urgent for me.
Hello all. We've been asked by several of the Netflix Prize teams if they can use the MovieLens datasets in training their algorithms. The answer is yes! We're happy to encourage algorithmic experimentation using our datasets -- and you don't even have to share any of your winnings with us :). We only ask that you credit the MovieLens datasets on your web site, and in any written descriptions you write of the resulting algorithms.
In a "too delicious to be true" story, Amazon has used one of its Kindle's features to erase copies of the book 1984 from their customer's devices. Yes, that 1984, the one about the futuristic society that controls and audits everything their citizens read or speak.
Apparently a third-party seller uploaded an illegal version of 1984 to the Amazon web-site, and some users purchased it. When Amazon found out the version was illegal, they refunded the purchase price *and deleted the copies of the book from the Kindles*. Almost too funny to be true. (One of the users was a 17 year old high school student whose notes on the book were also erased by Amazon when the deleted his copy of the book.)
Amazon has already promised not to do something like this again. However, the story makes clear the deep danger in aggressive digital rights management. If the owners of the content can control what you read, when you read it, and how you read it, our access to media becomes only a temporary "right" that can be granted and taken away at a whim. We need to create a set of rules that ensure that information can never be controlled in this way.
One extreme example of the need for rules to protect the free flow of information is the hubbub over the new version of Hemmingway's "A Movable Feast". Depending on who one talks to, Hemmingway's grandson Sean has either edited the book to make it truer to how Hemmingway really felt about his first wife or has altered Hemmingway's text to change history about that relationship. (It helps muddy the water that the first wife is Sean's grandmother.) The publisher is releasing the new version, which will now be compared endlessly by scholars to the 1964 original. What would happen in the digital world of the future? Would the publisher be able to change the text of everyone's original version to the new updated content? Presumably noone would lobby for such a world ... but if we aren't careful to constrain contracts between publishers and digital device owners, we could accidentally end up living in it!
How wonderful that Amazon made this mistake with the book 1984. It's not the greatest of the anti-utopian novels -- that's Huxley's Brave New World! -- but perhaps we were too quick to accuse it of wandering too far from reality ...
In the human body, there are two groups of cells that manage the production and refinement of bone. Osteablasts create new bone while osteaclasts break bone down. These cells are constantly working in parallel to manage our bone structure and repair damages. When a fracture takes place, osteoblasts come in to calcify the tissue surrounding the break. They aren't very picky about what or where they calcify so you'll end up with a large mound of bone where the break was. Over time the osteoclasts will trim and refine this bone down until the bone reacquires its original shape.
I feel that this is an excellent metaphor for how work is done in Wikipedia. There is a very large group of editors who do not make many edits on an individual basis, but they contribute the vast majority of content that makes it into the encyclopedia. They behave like osteoblasts in that they contribute large amounts of material but they don't have the experience to know what sort of content is encyclopedic. A smaller group of more active members of the encyclopedia (Wikipedians) perform the role of osteoclasts by trimming unencyclopedic content and refining what is left over into coherent articles.
In order for a human to have a healthy skeletal structure, a balance between bone formation and bone trimming has to be maintained. In the same way, the balance between content contributors and content refiners in Wikipedia must be maintained.
Atul Gawande has a terrific article in the New Yorker about how the way doctors organize themselves into social groups affects the effectiveness and cost of the medical care. (I first got turned on to Gawande by my daughter Karen, who gave me two of his books to read. He's very thoughtful and very smart about the problems of medical care -- and a terrific writer as well.)
There are tons of interesting thoughts in the article, which is a great read, as well as insightful. Here I'll just piece together the high-level flow of the argument around the structure of doctor's organizations within a locale.
1) The most expensive areas of the country for Medicare are 2-3 times the cost of the least expensive. If these most expensive areas could be changed to cost the same as the average areas, most of the expense problems of Medicare could be solved.
2) The most expensive areas of the country do NOT get better health outcomes than the less expensive areas. They do provide substantially more "services" -- hospitalizations, tests, surgeries, etc. -- but patients don't have live longer, aren't healthier, and aren't happier with the results.
3) By comparing expensive locales with less expensive locales, we can rule out most of the obvious causes of the difference. The expensive locales are very similar in types of patients, the problems those patients have, the training their doctors received, etc.
4) One key difference is that in the LESS expensive locales the doctors have organized themselves to create a medical system that changes substantially the motivations. Doctors are evaluated on long-term patient outcomes, and cannot make themselves richer by performing additional procedures on patients. The doctors work together collaboratively to learn how to better serve patients.
Fascinating article. Check it out! (Yes, it is a stretch for this blog. Perhaps we could argue that the connection is in understanding how big a difference social structure makes in the performance of an organization. In our work we're building computer tools to support those social structures; in this article, the doctors are inventing the structures themselves.)