GroupLens Blog

A wonderful innovation in the study of foreign languages is the use of the Internet to connect learners to native speakers. In some cases the learners write text that is commented on by the native speakers, while in other cases the two can talk with each other, such as in the Skype foreign language forums. These services provide a wonderful way for people to learn the truly important parts of a language: how to communicate with someone else from a different place and with a different background. Too often language skill acquisition is about formalisms and structure, rather than about communication.

An even more innovative way of learning language may be the ideas that Luis von Ahn is exploring in yet another one of his creative games. He is developing tools that allow native speakers of one language to help translate texts from another language that they do not know. The idea is that the tools will show the native speaker how to translate individual words, and the speaker will then fashion the result into idiomatically correct language in his or her native tongue. It is too early to know how well this will work, or if it does work whether the native speaker will actually be learning the other tongue or just volunteering his time in a useful way. In either case, the idea is fresh and interesting and I look forward to seeing how it works in practice.

This story in the New York Times (http://www.nytimes.com/2010/08/05/technology/05secret.html?_r=2) discusses negotiations going on between google and verizon so that google services can get special access to Verizon's data network. These sorts of agreements are a serious threat to competition on the Internet. The problem is that only large established players are going to be able to afford to pay for the enhanced service. Startup companies will be unable to get fast access to their services for consumers who might otherwise be interested.

The growth of the Internet is threatened if innovation can be stifled by these sorts of pair wise agreements. In order to encourage freedom and innovation we need to find a way to regulate the types of agreements that can be formed, and to ensure that others have access to the same levels of service quality. These principles are important, and they require regulation in order to create and maintain a fair and competitive marketplace of ideas.

There's a lovely article about the movie GroundHog Day. The article talks about a lot of issues in the movie, but ends with the wonderful quote:

A/B testing is like sandpaper. You can use it to smooth out details, but you can't actually create anything with it.

This thought reminds me of Don Norman's comment that one of the risks for the field of CHI is that we become so focused on analysis that we never actually create anything new.

John

Tagged:  •  

I finally got around to carefully reading "A Theory of the Critical
Mass..." by Oliver, Marwell, and Teixeira. Now I'm asking: what took
me so long?

The article formalizes the notion of critical mass in collective
action. It identifies two main independent variables that can
influence the "probability, extent, and effectiveness of group actions
in pursuit of collective goods":

  • The form of the "production function" that relates "contributions of
    resources to the level of the collective good". Two important
    categories of production functions are: (a) decelerating: the
    "first few units of resources contributed have the biggest effect on the
    collective good, and subsequent contributions progressively less"; (b)
    accelerating: "successive contributions generate progressively
    larger payoffs; therefore, each contribution makes the next one more
    likely."
  • The "heterogeneity of interests and resources" in the population of
    potentially interested actors.

The authors then show that the problems and opportunities for
collective action are very different for accelerating vs. decelerating
production functions and for homogeneous vs. heterogeneous populations
of actions. I'm not going to summarize the findings: the paper is a
joy to read, so I mostly want to urge you to do that.

However, there were a couple of ideas that I found particularly
relevant to issues in open content systems that I care about, so I did
want to mention them.

First, this work looks at critical mass in "public" goods, where all
the value is created by a group of people. This is true for many open
content systems: Wikipedia and OpenStreetMap are two good
examples. However, this isn't true of other systems, including our
Cyclopath bicycle routing system. Cyclopath began with a nearly
complete transportation map created from Mn/DOT data and with a good
objective route-finding algorithm that did not require user
input. While we have shown that user input improves route-finding
significantly and that algorithms based on user input are better than
purely objective algorithms, I think it's fair to say that most of the
value of the Cyclopath "good" already was present before any user
contributions were made. It's interesting to consider how the concepts
of this paper can be applied to a system like Cyclopath.

Second, Oliver at al. show that with decelerating production
functions, the optimal outcome would be achieved if the *least*
interested people contribute first and the *most* interested people
contribute later. This obviously isn't the way it usually works. They
point out that one way to make this happen is for the most interested
parties to "hold back"; perhaps they can offer "matching
contributions" to entice less interested parties to contribute early
in the process. This might suggest new strategies for
intelligent-task-routing-like strategies to elicit participation in
open content communities.

Third, many of the illustrative examples the authors give concern the
different opportunities for collective action in "upper middle class"
vs. "lower income" neighborhoods. I wonder: what's the equivalent of
an "upper middle class" open content system?

Fourth, the notion of "interest" presumed here is one of direct
tangible personal benefit: if I give N dollars, I'm increasing the
chances that I'll receive M dollars (M >> N). However, we know that
many contributors to open content systems (and many 'volunteers', too)
contribute for other types of reasons, e.g., they "believe" in the
public good, they are altruistic, or they want to build a
reputation. For example, in Cyclopath, our most active editors don't
request many routes. For another example, other researchers have shown
that there are many users in discussion forums who just answer
questions and don't ask any of their own.

Fifth, finally, and simply, I'd like to empirically measure the
production function in various open content systems. I suspect that in
many cases it is decelerating: i.e., early units of contribution are
proportionally more valuable. I'd also like to measure this for
individual users. Doing this calculation requires a way to measure the
global quality of an open content system as well as the quality for a
particular user. We can do both of these for Cyclopath. We can do the
latter for MovieLens... not sure about the former.

For my teaching I've been using Netbeans this semester, which has overall been wonderful.  Overall Netbeans has been an even better experience than Eclipse for teaching -- though both have a steeper learning curve than I'd prefer.
I've enjoyed Netbeans' built-in subversion support.  (This is not a differentiator with Eclipse, just a comment.)  However, getting subversion working reliably with netbeans on a windows box is a bit fiddly, and the online documentation makes it seem easier than it is.  It's easiest to break the setup into steps, and get each of them working before moving on to the next step.  (Part of what makes the documentation a bit complicated is that there are many alternatives.  I'm just going to describe one simple alternative, that assumes that you have a shell account on the Unix computer that contains your subversion repository.)  Here are the steps:
1. Get plink (from putty) working on your box.  Plink will be used by CollabNet to tunnel svn+ssh subversion connections.  First install the full putty from the web site.  Then create a .ssh key for putty using ssh-keygen, store it in a safe place on your Windows computer, and install the key in the authorized_keys file on your Unix server.  Then test with:
./PLINK.EXE -v -l <username> -i c:/path/to/key/file/id_rsa_putty.ppk <remote-host>
The result should be an ssh session to your remote host.  (plink is not a good client to actually use for ssh -- prefer putty -- but this is a simple test that it's working.)  (I'm using forward slashes in the above because I run it in cygwin shells.  You'll need backward slashes if you run it in the traditional unix command console.)
2. Install CollabNet's Subversion Client.  They have a simple installer.
3. Look in your Application Data directory for the Subversion subdirectory.  (It's possible you have to run the Subversion Client once to cause this directory to be created.)  Edit the config file in that directory.  Look for the section called "tunneling". In that section, after all the comments, add a line:
ssh = c:/Program Files/putty-0.60/plink.exe -v -l <username> -i c:/path/to/keyfile/id_rsa_putty.ppk
Here you use forward slashes, because the Subversion Client will translate them.  The path to plink.exe should be changed to wherever you put plink. Adding this line to the config file tells the Subversion Client what command to use with URLs of the form svn+ssh.
4. Test the subversion client from the command line with:
./svn ls svn+ssh://<remote-host>/path/to/remote/svn-repo
If this works you have a working subversion client on windows, which is 80% of the battle!
5.In Netbeans go to Tools/Options/Miscellaneous/Versioning and set the Path to the SVN Client to:
C:\Program Files\CollabNet\Subversion Client
(or wherever you installed Subversion).
6. Right click on a directory and you should be able to use Subversion Update and Commit commands!
Occasionally when things are tricky the netbeans client gets confused.  I just use the command-line client to do an svn update, and all is usually well after that.
One issue to watch out for: subversion is very sensitive to version changes.  The working copy (checked out version) will be updated by the subversion client to the style that version of the client expects.  So if you use both a netbeans client and a command-line client you should make sure they're the same "point" version number.  (E.g., They should both be 1.6.x, though they can have different xs.)
Good luck!
John

Tagged:  •    •  

One of the premier research platforms around here is Cyclopath, a geowiki and route-finding service for Twin Cities bicyclists.

Now, we've expected Google's announcement that they were getting into the bicycle routing business for some time. But that doesn't mean yesterday was relaxed for us. :)

After sleeping on it, (and speaking for myself) I think this development is actually either neutral or good. We're in a different niche than Google -- we're focused on open content and community, not just maps, and we're strongly local with personal connections to the cycling community and local agencies. And on the plus side: almost all of the reactions from the community I saw on the social web were very supportive of us, and I've never seen so much passion at Cyclopath Headquarters as I did yesterday!

We'll continue to write and publish consistent with our excellent track record (e.g., of the 5 papers we've submitted to top-tier conferences, 4 have been accepted on the first try and 2 have been nominated for Best Paper).

Details on what Google's announcement means for Cyclopath, from the user perspective, are here.

Lastly, and off-topic, please follow @grouplens and @cyclopath_hq on Twitter!

Occasionally, GroupLens receives requests for datasets that we possess. In many cases, we are able to provide this data as we have with the Movielens rating datasets. One of the data collections that we have is a 10% sample of Wikipedia page requests (essentially every 10th HTTP request), since April 2007. This data accumulates at a rate of about 5 GB/day, and we currently have around 4 TB of unprocessed compressed data. This is approximately 40 TB when uncompressed. While we sometimes get requests for this data, the sheer size of it makes it difficult for us to make it available for download.

Although we cannot make this data available for download, depending on your request and our availability, we may be able to collaborate with you by performing the analysis you need on our data.

Also, we are not the only ones who have view data of Wikipedia. There are several other sources that have data on page views. Here are some of these resources and the type of data that they have available:

  • stats.grok.se – Provides data on per-page view counts by month.
  • dammit.lt/wikistats – Has files containing hourly per-page view count snapshots, with archives that currently go back to October 2009.
  • Wikipedia Page Traffic Statistics on AWS – Hourly traffic statistics for a 7 month window (October 2008 - April 2009) are available on Amazon Web Services. This data was assembled from files that were available from dammit.lt/wikistats at the time.

I just read an article called "Why study time does not predict grade point average across college students" by Plant, Ericsson, Hill, and Asberg.  The article is an interesting look at past data on what predicts GPA, and a small-scale (88 student) study at one university.  The authors are big fans of the "deliberate practice" model of learning, and focus on seeing if that translates into academic performance.  Some of the interesting information (mostly from past studies):

  • studying without distraction predicts higher grades (no TV, no iPod, no study partners)
  • students who study without distraction study for *fewer* hours, but get *higher* grades
  • focused study is important. Just as many recreational tennis and golf players don't get better over 20 years of playing, just "reading" isn't enough. Deep thinking, analysis, and putting ideas together correlate with better grades.
  • scheduling is important.  Planning ahead for getting school activities done, and studying at regularly scheduled times correlate with higher GPA.
  • going to class predicts higher GPA
  • working too many hours, and partying too many hours both predict lower GPA

Overall there weren't a lot of big surprises, but I did find it interesting how important focused, uninterrupted study is.  In fact, the total amount of study time did NOT predict good grades.  A shorter amount of more focused study was more valuable.  (Students tended to have to go to the library to get the more focused study time.)
What works for you?
John

Tagged:  •    •  

Doctors: do no harm.

Authors: keep the reader turning the page.

Speakers: keep the listener, uh, listening.

The title of this post and the third aphorism represent the sine qua non for a successful research talk (or any kind of public speech). Once the audience stops listening, you, the speaker, might just as well stop speaking.

I've been thinking about this ever since the CSCW conference last week. I saw quite a few talks on subjects I'm interested in, with good research, good content in the presentation, and good - i.e., fluent - delivery. I was engaged by the content in many cases and asked a lot of questions.

However, in reflecting on my experience, many of the talks began to seem, hmmmm..., monotonous. The speakers didn't look animated. They didn't use much of a dynamic range in their speaking: they weren't loud sometimes and quiet others, fast sometimes and slow others. There weren't too many jokes (shout out to Cliff and Reid, two speakers who did joke a bit). The slides too were pretty homogeneous: none that shouted "I'm important - notice me!".

Again, the content was good - it wouldn't have gotten in otherwise!

But speakers, lively up yourselves! It'll keep your audiences' ears open, so that your great content will get in. (And please: if you do a lively presentation with poor content or poor organization or poor slides, it'll just seem ... poor.)

Tagged:  •  

I've been spending a lot of time lately thinking about survey writing. In 2008, I took a short three day course with Jon Krosnick of Stanford University, which made me think about survey writing in a new way. In particular, I started realizing that the surveys that I wrote were poorly written.

Since then, it seems like I keep finding poorly written surveys everywhere I turn. Here are some examples I've found recently in  my everyday life:

The US Postal Service sent me a Postal Customer Questionnaire lately because they were thinking about closing my branch. "If you now receive Post Office box service, you will be able to transfer your remaining box rent credit to another post office, or you may be eligible to receive a partial refund. How would you feel about consolidating the Dinkytown station with other postal stations? Better, Just as Good, No Opinion, or Worse" I answered No Opinion. Then I crossed it out and marked Worse. Then I crossed it out and marked No Opinion and wrote a three sentence explanation in the "Please explain" section. Why was this such a hard question to answer? Well, primarily because they'd never asked me about how I'd feel about the consolidation WITHOUT the refund. So now they were merging my opinions about the consolidation in with my feelings about the refund. Personally, I was mad that they were consolidating and I'd feel cheated if they didn't refund my money, but really the refund wouldn't change my opinion at all. No where on the survey did they ask me anything to this effect.

This second example isn't exactly a survey, but is still getting at some of the problems with survey writing. I'm having problems with allergies and need to go see an allergy specialist. So the clinic sent me my paperwork so I could fill it out before my appointment. Leaving aside many of my other complaints (and there are many!), the first main page has a section entitled "Chief complaints of patient." For each option you are supposed to check "Yes" or "No." The options are Asthma, Rhinitis (Hay fever), Urticaria (Hives), Eczema, Sinusitis, Chronic recurrent bronchitis, Nasal polyps, Recurrent otitis media,  Recurrent pneumonia, G.I. disturbances (colic, diarrhea, etc), Insect sting reaction, drug reactions, or blank lines. Now I'm a pretty smart person. I've been in school for a grand total of twenty-one years now, but I can't tell you what many of those things are, and I can't tell you which ones I should select. I have a runny nose and a cough. I've been diagnosed with something, but I forget what it is, and it didn't include the second two symptoms, just the runny nose. Why on earth is this questionnaire that is obviously for the patient or patient advocate full of doctor jargon instead of patient jargon?

Now that I know better, I want to do my best to avoid writing bad survey questions, but at the same time, it's incredibly difficult to write good survey questions. So what I've been doing is writing my same old, same old questions and then revising...and revising...and revising. Trying to revise them to turn them into good questions isn't easy, but I try. I also ask for a lot of feedback and am very self-critical. One proof-reading pass doesn't cut it for a survey, even if it's only going out to 10 people. That would reflect poorly on me, my advisor, my lab, and my university...so I do more work. Hopefully if you take one of my surveys, you'll see the result of this work, and if not, I hope you'll take a moment to let me know.

Syndicate content