Going with the data flow
The first time we managed to sequence the human genome, it took scientists 13 years of painstaking work to pull it off. Today, machines can generate a sequence automatically in a matter of hours. But that comes with another problem: how to actually analyze all the data that comes out.
Ka Yee Yeung, a professor at the School of Engineering and Technology at the University of Washington Tacoma, wants to make that easy. It will take working with a lot of folks from a lot of different fields, but she has the idea and the team that might make it possible.
Here’s the problem biologists face: genetics has gotten big. While there are still discoveries to be made by taking an old-school approach and looking at just one or two genes at a time, many of our recent breakthroughs have taken a much broader — and data-intensive — approach. Take precision medicine, for example. To help treat cancers like acute myeloid leukemia, doctors nowadays can take a sample of a patient’s tumor or cells, use high-throughput sequencing to read out the cancer’s entire genome, then use this readout to identify specific mutations that can be targeted with cancer-killing drugs.
This can make the process of finding an effective cancer treatment a lot simpler. Unfortunately for the people actually doing this work, however, just making sense of a sequencer's output can be anything but simple. As a way to visualize just how long the human genome is: scientists once tried printing out just one person’s genome — all the A’s, T’s, G’s, and C’s — and the resulting series of books topped out at roughly 262,000 pages. Imagine having to look through all that for what may be just one errant letter — and some studies include dozens, if not hundreds, of patients all at once. It’s a challenge far too great for a person to do by hand.
“So, this is great,” joked Yeung. “It keeps us in business.”
In 2019, Yeung, a computer scientist by training, teamed up with a colleague, Ling-Hong Hung. Funded by the National Institutes of Health, they created an open-source platform called the BioDepot Workflow Builder, with the vision to help non-technically-trained users perform big data analysis efficiently.
The attractive thing about BioDepot’s workflow is that scientists don’t need to know how to code to get it to work — they’re presented with a collection of widgets that they can combine like building blocks to accomplish a task. One widget might clean up the data, for example, while another might compare samples and highlight differences. A third might visualize the data, turning it into charts and graphs. The exact workflow and the input parameters can be tweaked depending on what the scientist needs and Yeung’s lab constantly works with their users to customize the software to their needs.
Of course, that does mean there can be a bit of a learning curve when two academic worlds collide. In some cases, Yeung says, you don’t even start with a common vocabulary. But she insists that the resulting need for flexibility and learning with each new project isn’t a downside. “That’s what I love most about this job,” she says.
In fact, this interdisciplinary nature — this crossroads between different areas of knowledge — is part of what attracted Yeung to computational biology in the first place. She studied computer science as an undergraduate, enjoying the puzzle-like nature of coding and the ability to flex her math skills.
After graduating, she pursued graduate school to study computational biology. The field, with its heavy use of algorithms and statistics, was a natural fit for her passion. It did mean diving into biology, but Yeung did so with such enthusiasm that after graduation she joined the Department of Microbiology at the UW Seattle School of Medicine for ten years before once again returning to the more computational side of things.
Today, Yeung’s day-to-day work has remained highly interdisciplinary — and not just among those prospective BioDepot users. Her students are similarly diverse in terms of interests and disciplines. “Everyone has a different background,” she says.
Lots of Yeung’s students come from engineering backgrounds, but the lab has also been home to geographers, artists, and even an anthropologist — plenty of whom have been international students. Yeung says the students bring so much to projects and is proud of the fact that UW Tacoma offers many different programs to help students from all walks of life get into computer science.
“One time I had a student in my class who told me he was a truck driver and wanted to transition,” says Yeung.
During her tenure at UW Tacoma, Yeung has produced over 50 peer-reviewed publications, most of which have included students as co-authors. She also serves as co-principal investigator on a grant from the National Science Foundation for the retention of STEM students.
Recently, Yeung was awarded the Virginia and Prentice Bloedel Professorship from UW. The endowment that comes with it will let Yeung supplement and extend her support for students. “I’m really grateful,” she says.
“This endowed professorship gave us the incredible opportunity to value a faculty member who was not only a leader in their scholarship, but also an outstanding teacher,” says Andy Harris, the executive vice chancellor for academic affairs at UW Tacoma. “As a sign of just who she is, when she found out that she had the endowed professorship, we had a preliminary conversation about what she would want to do with it. Her first response, without hesitation, was, ‘I just want to do something that helps our students be successful.’ And that's what she's done.”
Looking to the future, Yeung hopes that her lab can one day refine BioDepot to the point where they can start to spin off commercial prototypes, making the software much more widely available to prospective users. It’ll take a lot more work, though, particularly around optimization and user experience.
While adding whole brand-new functions can be its own puzzle, Yeung says it’s often the painstaking work of making things as polished, easy-to-use, reliable, and low-cost as possible that can be the real challenge. The more processor- and data-hungry a system is, for example, the more expensive it is to run. And when you’re processing thousands or even millions of data points per project, even small inefficiencies can add up quickly.
Nevertheless, Yeung’s hopeful that one day, the software will be so customizable, so intuitive, and so efficiently run that even folks with no computer science experience at all will be able to use it off the shelf. At that point, maybe even the biggest, most complicated data sets — and in turn, the subtlest truths about diseases, treatments and clinical outcomes — could finally be teased out.