Identifying Outreach Windows in Online Opioid Recovery

By on

Note: Rudy Berry completed a summer Research Experiences for Undergraduate (REU) program at the University of Minnesota with Professor Stevie Chancellor in the summer of 2022. This blog post summarizes his project outcomes. Way to go, Rudy! 

Summary: Identifying when relapse has occurred is a key factor to consider when determining how to reach out to individuals with Opioid Use Disorder. Information like time elapsed since a previous relapse influences the type of resources and language that should be presented. In this project, I wrote a script that successfully identifies the date of incidence of relapse from a relapse disclosure post in an opioid addiction recovery community on Reddit. With this information, we were able to determine the amount of time that has passed since an individual’s self-disclosed relapse and the time they reported it to the community. The ability to extract this kind of information from recovery posts may be a valuable tool for the future development of context-sensitive outreach systems.  

Overview: Opioid Use Disorder (OUD), colloquially known as Opioid Addiction, is a highly stigmatized health issue that has fueled the growing opioid crisis in the United States for over two decades. The CDC reports that Opioids were responsible for about 75% of all U.S. drug overdoses in 2020. Opioids have been linked to over 500,000 deaths since 1999 (CDC, 2021). In response to this crisis and the difficulty of finding support, there has been growing engagement in online recovery forums for substance abuse. These communities give members an anonymous space to seek advice, share success stories, and vent frustrations. Members of online addiction recovery communities frequently share feelings of shame and guilt (Mudry et al., 2012). So, the ability to detach oneself from a real-world identity is a major draw of these forums. The popular discussion website Reddit is home to a large online recovery community–r/opiatesrecovery.

In this project, our research goal was to identify the date that someone had relapsed in their OUD recovery journey. Identifying when relapse has occurred is key to aiding in the recovery process because advice is dependent on when someone has relapsed. If an individual has relapsed very recently, it is important to direct them to resources that can provide more urgent forms of harm reduction in the moment. If a relapse occurred in the distant past, it may be more appropriate to provide them with resources focused on long-term sobriety tips or maintenance care. The existence of online recovery communities presents a unique opportunity in HCI for researchers to develop technology that could provide additional support and resources to individuals with OUD beyond what community members already provide.

Therefore, the primary goal of this project was to write a script that could identify the date of incidence of relapse from the context of a relapse disclosure post. The project focused on two specific data sets; a set of posts and a set of comments all gathered from r/opiatesrecovery on Reddit. The ability to extract this kind of contextual information from recovery posts would allow outreach systems to provide more context-sensitive resources and messaging to individuals in OUD recovery based on an estimated date of relapse. We also wanted to determine the average window size between the incidence date of relapse and the postdate across all relapse posts and comments on the subreddit. 

What We Did: The first step we took was identifying posts where an individual had disclosed the occurrence of a relapse. Working with another team member, we created a regular expression that matches phrases that indicated relapse, like “I relapsed” or “I just relapsed”. This was done in collaboration with another ongoing project in the lab to identify people who disclose that they have relapsed. This allowed us to create reduced datasets of relapse posts and comments from a larger general dataset from across the subreddit. 

Once we collected the relapse posts, the next step was to identify nearby temporal expressions from the relapse time frame such as “yesterday” or “a week ago”. To do so we employed the SUTime library, a tool from the Stanford CoreNLP pipeline. SUTime is a powerful temporal tagging library that identifies temporal expressions by tokenizing text. It provides tags for four categories of temporal expressions: “Time”, “Duration”, “Set”, and “Interval”. When SUTime identifies a temporal expression it returns the expression text, type, date in reference to a passed in value or the system date, and the start and end position of the expression in the string of text. 

For this project, we were particularly interested in the text of the type “Time” since this allowed for the extraction of the most specific dates. However, we realized that a handful of posts in our dataset were matching the type “Duration”. This included posts with phrases like “I relapsed for a week” or “I relapsed for 5 days”. These phrases were typically found in longer posts with many details and much more context to consider. We took this into account in our validation process and included durations to establish the limitations of our system. We wanted to know whether a human reader could identify a relapse date from the context surrounding a duration. To analyze this, we took a sample of twenty posts where relapse dates were identified and a sample where none were identified and replicated this with and without durations. We then hand annotated the text to identify false positive and negative identifications. 

The second part of our validation process involved experimenting and evaluating the size of the character window around the relapse window to effectively identify relevant time words.  We picked three different window sizes and analyzed the entire post dataset using accuracy. We wanted to know how many posts our script was able to accurately identify the day, week, or month of relapse for each character count. 

*The number of posts at each step in the identification process

 Findings:

The first part of the validation process revealed that the time tagging system was much more accurate when excluding duration temporal types. A negative sample (posts with no relapse dates identified) of twenty posts with durations included revealed that there was only one post where a human reader would be able to establish a relapse date. The system correctly identified that no relapse date was discernible from the other nineteen posts. However, when excluding durations, our system correctly identified that no relapse date could be identified for all twenty posts in the negative dataset. Within a positive dataset (posts with relapse dates identified), the inclusion of durations had a more dramatic effect on the results. In the positive sample with durations included there were eleven posts where the system correctly identified that a relapse date could be identified from the text. However, there were nine posts where the system incorrectly identified the beginning of durations as possible relapse dates. So, for almost half the sample the script would identify a relapse date, while a human reader would not be able to. This can be attributed to the fact that durations were typical of posts with more complexity to consider. For instance, in an example like “I got out of rehab then relapsed for five months”, the system would incorrectly identify the relapse date as five months prior to the post date. In this case a human reader would have to analyze the entire post to make a more accurate relapse date approximation. The results of the positive dataset without durations were better, with only five posts being incorrectly labeled as posts where a relapse date could be determined. Based on this outcome we decided to work only with “time” temporal types and exclude durations.

         During the second part of our validation process, we selected character counts of 100, 150, and 200 around our regular expression. The best performance was at one hundred, with an accuracy of 73.4% for the entire dataset of posts. This was verified by reading each post and identifying the correct relapse date. The issue with wider character windows was the inclusion of many temporal expressions. Our script is written to return the first expression it finds. In text like “I started my recovery journey a year ago and today I relapsed”, the relapse date would be incorrectly identified as a year ago. Alternatively, in a phrase like “Starting all over again today after I started relapsing again last month”, the relapse date would be incorrectly identified as the post date or “today”. A window size of 100 fails for both cases, and instances like these are more frequent past one hundred characters. Further testing is necessary to determine the best way for the script to choose between multiple time expressions.

*This histogram shows the number of comments corresponding to certain window sizes in the dataset. For instance, the first bar shows that there were over 200 posts where relapse was disclosed to the subreddit within 0-10 days of occurrence.
This histogram shows the number of posts corresponding to certain window sizes in the dataset.
This histogram shows the portion of the comment histogram from 10-200 days.

The histogram data we collected reveals spikes in relapse disclosure within the first ten days of relapse as well as at the one-month, two months, one-year, and two-year marks. The post dataset had a mean window size of 64.6 days with a median of 7.0 days. The comment dataset had a mean window size of 177.8 days, with a median window size of 30.0 days.

Overall, the script we created can extract information about relapse incidence dates and could be easily replicated and improved for an outreach system. This system could use the window size in conjunction with other information such as sentiment and prior relapse disclosures to send an individual a message with context-sensitive resources and word choice. 

One finding from the identifier I found particularly interesting was how many people reached out to online communities to disclose relapse so soon after it had occurred. This highlights a need for these systems to focus on how to support individuals during the immediate aftermath of a relapse. In the future, further modifications could be made to address the contextual limitations of durations and multiple time expressions. Through this project I learned a lot about the benefits of anonymity in online spaces. It was interesting to see people being open about their setbacks and experiences in real-time. This work has made me more curious about the role that anonymous online communities play in de-stigmatizing OUD as well as mental health risks like anxiety and depression, and the types of systems that can safely facilitate them. 

https://journals.sagepub.com/doi/full/10.1177/1049732312468296

https://www.cdc.gov/drugoverdose/epidemic/index.html#:~:text=The%20number%20of%20drug%20overdose,rates%20increased%20by%20over%206%25.

What Wikipedians Want (but Struggle) to Prioritize

By on

The English version of Wikipedia contains over 6.5 million articles… but only 0.09% of them have received Wikipedia’s highest quality rating. In other words, there’s still a lot of work to be done.

But where to start?

A group of highly experienced Wikipedia editors tried to answer that question. Through extensive discussion and consensus-building, they manually compiled lists of Vital Articles (VA) that should be prioritized for improvement. We analyzed their discussions to try to identify values they brought to the table in making those decisions. We found––among other things––a desire for Wikipedia to be “balanced”, including along gender lines. Wikipedia has long been criticized for its gender imbalance, so this was encouraging!

But how is this value reflected in the actual prioritization decisions in the lists of Vital Articles these editors developed?

Not so much.

Figure 4 from our paper shows what would happen if editors were to use Vital Articles to  prioritize work on biographies: the proportion of highest quality biographies about women would decrease––from 15.4% to 14.7%. By contrast, using pageviews (which indicate reader interest) to prioritize work would result in an increase in the proportion to 21.4%.

In short, if you want more gender balance, just prioritize what readers happen to read––not what a devoted group of editors painstakingly curated over several years with gender balance as one of the goals in mind. So what gives? Are Wikipedians just pretending to care about gender balance?

Not quite.

As it happens, only 7.5% of VA’s participants self-identify as women. For reference, that figure was 12.9% on all of English Wikipedia at the time we collected our data. Prior work gives plenty of evidence to help explain why a heavily male-skewed group of editors might have failed to include enough articles about women despite good intentions. Some of the reasons are quite intuitive too; as one Wikipedian put it, “On one hand, I’m surprised [the Menstruation article] isn’t here, but then as one of the x-deficient 90% of editors, I wouldn’t have even thought to add it.”

The takeaway: when it comes to prioritizing content, skewed demographics might prevent the Wikipedia editor community from fully enacting its own values. However, this effect is not the same for all community values; we find that VA would actually be a great prioritization tool for increasing geographical parity on Wikipedia. As for why? We have some ideas…

But for more on that (and other cool findings from our work), you’ll have to check out our research paper on this topic––coming to CSCW 2022! You can find the arXiv preprint here.

How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions

By on

I sat in a gray cubicle, next to a social worker deciding whether to investigate a young couple reported for allegedly neglecting their one-year-old child. The social worker read a report aloud from their computer screen: “A family member called yesterday and said they went to the house two days ago at 5pm and it was filthy, sink full of dishes, food on the floor, mom and dad are using cocaine, and they left their son unsupervised in the middle of the day. Their medical and criminal records show they had problems with drugs in the past. But, when we sent someone out to check it out, the house was clean, mom was one-year sober and staying home full-time, and dad was working. But, dad said he was using again recently.” The social worker scrolled down past the report and clicked a button; a screen popped up with “Allegheny Family Screening Tool” at the top and a bright red, yellow, and green thermometer in the middle. “The algorithm says it’s high risk.” The social worker decided to investigate the family.

Image: Allegheny County Department of Human Services

Workers in Allegheny County’s Office of Children, Youth, and Families (CYF) have been making decisions about which families to investigate with the Allegheny Family Screening Tool (AFST), a machine learning algorithm which uses county data including demographics, criminal records, public medical records, and past CYF reports to try to predict which families will harm their children. These decisions are high-stakes: An unwarranted Child Protective Services (CPS) investigation can be intrusive and damaging to a family, as any parent of a trans child in Texas could tell you now. Investigations are also racially disparate: Over half of all Black children in the U.S. are subjected to a CPS investigation, twice the proportion for white children. One big reason why Allegheny County CYF started using the AFST in 2016 was to reduce racial biases. In our paper, How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions, and its associated Extended Analysis, we find that the AFST gave more racially disparate recommendations than workers. In numbers, if the AFST fully automated the decision-making process, 68% of Black children would’ve been investigated and only 50% of white children from August 2016 to May 2018, an 18% disparity. The process isn’t fully automated though: the AFST gives workers a recommendation, and the workers make the final decision. Over that same time period from 2016 to 2018, workers (using the algorithm) decided to investigate 50% of Black children and 43% of white children, a lesser 7% disparity.

This complicates the current narrative about racial biases and the AFST. A 2019 study found that the disparity between the proportions of Black and white children investigated by Allegheny County CYF fell from 9% before the use of the AFST to 7% after it. Based on this, CYF said that the AFST caused workers to make less racially disparate decisions. Following these early “successes,” CPS agencies across the U.S. have started using algorithms just like the AFST. But, how does an algorithm giving more disparate recommendations cause workers to make less disparate decisions?

Last July, my co-authors and I visited workers who use the AFST to ask them this question. We showed them the figure above and explained how the algorithm gave more disparate recommendations and that they reduced those disparities in their final decisions. They weren’t surprised. Although the algorithm doesn’t use race as a variable, most workers thought the algorithm was racially biased because they thought it uses variables that are correlated with race. Based on their everyday interactions with the algorithm, workers thought it often scored people too high if they had a lot of “system involvement,” e.g. past CYF reports, criminal records, or public medical history. One worker said, “if you’re poor and you’re on welfare, you’re gonna score higher than a comparable family who has private insurance.” Workers thought this was related to race because Black families often have more system involvement than whites.

The primary way workers thought they reduced racial disparities in the AFST was by counteracting these patterns of over-scoring based on system involvement. A few workers we talked with said they made a conscious effort to reduce systemic racial disparities. Most, however, said reducing disparities was an unintentional side effect of making decisions holistically and contextually: Workers often looked at parents’ records to piece together the situation, rather than as an automatic strike against the family. For example, in the report I mentioned at the top of this article, the worker looked at criminal and medical records only to see if there was evidence that the parents abused drugs. The worker said, “somebody who was in prison 10 years ago has nothing to do with what’s going on today.” Whether they acted intentionally or not, workers were responsible for reducing racial disparities in the AFST.

For a more in-depth discussion, please read our paper, How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions, and our Extended Analysis. All numbers in this blog are from the Extended Analysis. The original paper will be presented at CHI 2022. This work was co-authored with Hao-Fei Cheng, Anna Kawakami, Venkatesh Sivaraman, Yanghuidi Cheng, Diana Qing, Adam Perer, Kenneth Holstein, Steven Wu, and Haiyi Zhu. This work was funded by the National Science Foundation. Also see our concurrent work, Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support. We recognize all 48,071 of the children and their families on whom the data in our paper was collected and for whom this data reflects potentially consequential interactions with CYF.

Learning to Ignore: A Case Study of Organization-Wide Bulk Email Effectiveness

By on

We’re at a university where an employee receives 27 bulk emails from the organization (the untargeted and unpersonalized emails sent to a large list of recipients) —  each of them contains over 8 messages on average. That means an employee receives over 250 unique pieces of content per week from central units (e.g. president office, provost office) — not from their students and their peers, but from the communicators in central units. By inputting a mailing list and pressing a button, a communicator could send an email to over 20,000 employees (see figure 1).

Figure 1. We found that the burden of being aware was put collectively on recipients.

The current organizational bulk email system is not an effective system. For one, these bulk emails are not free — imagine that each employee spent 2 min reading a bulk email, with average rates, this email will cost 20,000 * 2 min * 0.5 $/min = $20,000 to the university.  But of course, this cost isn’t paid by the sender, it is absorbed by all the departments where staff work.  So the sender thinks the message is free, yet each department and unit has its employees’ time taken away bit by bit by these messages.

For another, these bulk emails are not being remembered. Through a survey with 11 real bulk messages sent in the last two weeks and 11 fake bulk messages, we found that the real bulk messages only had about a 22% gain over messages that were not sent to them (38% of the real messages were recognized while 16% of the fake messages were also claimed “seen” on average). 

So we carried out a study to examine current practices and experiences of the stakeholders of this system. We conducted artifact walkthroughs with six communicators and nine recipients within our university. We also interviewed two of the managers of those recipients. Specifically, in these artifact walkthroughs, the recipients walk the interviewer through the previous email messages they received; the communicators walk us through the previous email messages they sent out.

We found that:

First, recipients are burdened, they feel the responsibility of awareness was shifted to them by those bulk messages.

Second, naturally, stakeholders have different preferences. The leaders and managers think that employees should know what’s going on in the university – however, the employees feel that most of these messages are too high-level to be relevant.

However, on another side, communicators have to send these emails even when they know that their recipients will dislike them — they have their own difficulties. 

First, they have clients–organizational leaders–who want everyone to know their messages.

What’s more – communicators lack tools, they have very limited tools for targeting/personalizing emails; e.g., they could only target people by job code — however, a general title like program associate tells you nothing about this employee’s job content.

Most important, the system appears to work because these emails have good open rates. Open rates are nearly the only metric in the current bulk email tech platform because it is easier to get than end-to-end metrics like recognition rate or reading time. However, most of our recipients read the first line, then close the email — simply because they can’t get enough information from the subject line. In other words, we should not confuse a message that people open with one that actually contains content they find useful.

Figure 2. Summarization of our findings.

To summarize, none of the stakeholders has a global view of the system and sees the costs of the current bulk email system to the organization (see figure 2).  We’re working on a following-up project to provide possible solutions to improve this system.

How to Infer Therapeutic Expressive Writing in the Real World?

By on

“I enjoyed writing. Perhaps it was because I hardly heard the sound of my own voice. My written words were my voice, speaking, singing, … I was there on the page” – Jenny Moss

We have a natural desire for expressive writing to hear the voice deep inside ourselves in difficult times. Although previous studies have proven therapeutic effects of expressive writing, most of them studied the activity in controlled labs where the writing was guided by a researcher. We think that therapeutic expressive writing happens spontaneously in the real world as well. We focus on spontaneous expressive writing on CaringBridge, an online platform for people to write and share their or their beloved ones’ health journeys and get support. Our goal was developing a computational model to infer whether a post does or does not contain expressive writing in order to help people get more benefit from using online health communities like CaringBridge.

One major challenge we encountered to achieve this goal is that there is no past data on therapeutic expressive writing in the wild. To address this challenge we thought about how we could adapt expressive writing data that was collected in the lab. We looked at 47 past lab studies and what they could tell us about expressive writing. Turns out that the writing that was counted as “expressive” in these studies, shared some common characteristics: it used emotion and cognitive words a lot more than the writing that was not “expressive”. We used a clever statistical model (more details in the paper) to look at each CaringBridge post and tell us how much it matched those characteristics. The research team also looked at 200 posts ourselves to see how often our model would come to the same conclusion as the research team as to whether a blog post constituted “expressive writing”. We agreed about 67% of the time, so there’s obviously a lot of room for improvement (we assume that humans are generally right and that the algorithm needs to improve how well it recognizes these posts).

Despite the limitations of the model, it provides the first ever opportunity to understand how often expressive writing may occur in the wild. We applied our model to the dataset of 13 million CaringBridge journal posts and inferred 22% were expressive and 78% were not expressive. This provides evidence for spontaneous expressive writing in the wild.

To sum up, our paper has three contributions. First, it demonstrates a way to use aggregated empirical data. In cases where no data are available, we could use common characteristics reported in past studies to study the group we are interested in, as we did in the paper. Second, it provides a baseline model to infer expressive writing and to be improved upon. Future research could use more sophisticated features and models by constructing a gold standard dataset or transferring knowledge from a related task that has already been learned. Third, it identifies expressive writing as a potential measure for online health communities. How much an individual engages with spontaneous expressive writing not only reveals their current writing practices, but also reveals the difficult times they are going through. Online health communities can then target their messaging by sending emotional support to those in difficult times and providing writing tips to those who are less expressive so that people can gain the most benefits from their writing.