Are data scientists erasing domain experts?

Outcomes vary across workplaces and institutions so generalizations need to be nuanced. Also, the term "domain expert" is ambiguous in the literature.

Jun 18, 2024

What is this thing called data science? Image source.

One of the themes of my research—which is based on the ethnographic fieldwork I did amongst programmers and educational researchers working with MOOCs—is the question of “data science.” What does it mean to do data science? Who does it and why do some workers call themselves “data scientists”? What is it for? Does introducing the data scientist into an organization disrupt existing workplace hierarchies and if so, how? In this series of posts, I plan to write about some of the existing research on this that I find exciting, especially research relates to the last question on the relationship between data scientists and the so-called “domain experts.”

But before I do that, I want to set the context for this debate. I will argue that when we look at the breadth of this research—which ranges across different institutions from law enforcement to medicine to academic physics and biology laboratories—we need to be careful about how we generalize. Concluding that “data scientists are taking over” simply based on one particular site is not correct, especially since, as we’ll see, outcomes are different at different sites. Part of the reason for the confusion is that researchers have used the term “domain expert” to describe very different sets of workers.

This post will have three sections.

Section 1 will start with the history and some basic definitions of terms like “data science,” “data scientists,” and “domain experts” and also outline some of the big claims that have been made—claims that I think are too hasty. The next two sections will say why.

Section 2 will describe some of the ethnographic studies (also some other methods) that have been conducted across different sites. Comparing them, it becomes clear that interactions between data scientists and domain experts are different at different sites with radically different outcomes.

Section 3 will describe another problem that hinders generalization which is that different researchers have very different definitions of what it even means to be a “domain expert.”

And in future posts, I will try to work through some other studies that I have found useful that complicate the picture but also give us useful insights.

1. Data science and domain expertise: Describing the debate

In October 2012, the Harvard Business Review declared “data scientist” to be the “sexiest job of the 21st century.” Part of a "Spotlight package" on the power of "big data" and its potential to change organizations and management, the articles in the issue collectively argued that with the growth of the internet, as customers interacted with businesses through software-driven web applications, "companies that make a sophisticated analysis of the huge data streams now available can unlock deep insights and value." For a company to do this well, they needed “data scientists,” a "hybrid of data hacker, analyst, communicator, and trusted adviser." Data scientists, the article argued, had to be good not just at writing code or doing statistics but also in "speaking the language of business and helping leaders reformulate their challenges in ways that big data can tackle."

We are now twelve years past this declaration, so it is safe to say that the position of “data scientist” is now somewhat more established and institutionalized. For instance, we now have established universities offering undergraduate and graduate degrees in data science. At the University of California, Berkeley, where I teach, the first “data science” majors graduated in 2018. Data science is currently the third-most popular major at Berkeley after computer science and economics and it may overtake them since its whole point is that it is less heavy on the kinds of mathematics and theory that people are expected to learn in CS and economics programs.

One of the raging battles about the new data science is about whether the data scientists take away power from the established “domain experts.” Many analysts of the digital believe that they do. Where they differ is how they evaluate it. Some—especially if they are engineers or managers—think this is a good thing. Others—often social scientists of the critical persuasion—think this is a bad thing.

You can see this in the HBR special issue itself. In one of the articles in that issue, the management scholars Andrew McAfee and Erik Brynjolfsson contrast their putative "data scientists" with already existing "domain experts." They suggest, without saying so outright, that current domain experts traffic in opinions rather than analysis. They exhort managers to replace their reliance on "HiPPOs," i.e., highest-paid person’s opinions, presumably provided by domain experts, with data science and data scientists. But they don’t think domain experts should disappear. Instead, what they suggest is that the role of the domain experts should change and they should be "valued not for their HiPPO-style answers but because they know what questions to ask." In other words, they suggest that domain experts should think of what questions to ask and data scientists should answer them.

Now contrast this to a review essay that summarizes the impact of the digital on society written by the sociologists Jenna Burrell and Marion Fourcade. They argue that people like data scientists, by virtue of their ability to "touch and understand computer code," construct their techniques as "universal and domain nonspecific." In so doing, they "increasingly carve[s] away at and lay[s] claim to tasks [of domain experts] [...] in every occupation from business management, medicine, and the criminal justice system to national defense, education and social welfare."1

So we have two contrasting takes that nevertheless agree on one thing: that data scientists do take over from domain experts.

2. Outcomes are different across data-driven domains

Let’s leave aside the question of whether data scientists taking over from domain experts is a good or bad thing. Let’s first ask if it is true: is it true that data scientists, by virtue of their claimed “domain nonspecific” techniques are indeed taking over from established experts in particular occupations and institutions.

There is reason to suspect that the situation is more complicated. For one thing, when people talk about “data science,” they often mean at least two, relatively distinct, entities.

Data-driven science: In the sciences, by which I mean academic disciplines like physics and biology, the data that is analyzed (DNA sequences, outputs from particle accelerators and radio telescopes) is often in digital format. In the last two decades, with the advances in computing power, the datasets that these researchers work on are often massive. And because these data are digital, there are new possibilities and techniques that researchers can bring to bear on analyzing them for insights.
Corporate data science: Separately, corporations and businesses have been moving online and, because they are able to record every click made on their websites, they have amassed huge amounts of data that promises insights into customers and users. This is, of course, true for native digital businesses like Google and Facebook but it is true also for institutions of journalism, law enforcement, or even medicine. The HBR issue I mentioned before is about corporate data science, not data-driven science.

Obviously there are overlaps in these two areas. For instance, academic social scientists can also draw on publicly available data (or negotiate with corporations for data) to create a data-driven sociology or computational social science. And there are scientists who work for corporations (e.g., big data driven education researchers who work for Khan Academy; machine learning researchers working for Spotify) who are interested in publishing findings in academic journals. And data-driven scientists are obviously inspired by the use of techniques in corporate data science. But the fact remains that the data scientists in data-driven science and corporate data science differ in terms of their incentives, goals and institutional context, if not their techniques.

It turns out that the presumed conflict between “data scientists” and “domain experts” plays out very differently in these two settings.

In the natural sciences like physics and biology, it seems as if it is the "domain experts" who are ascendant. The data scientists—essentially scientists who have taught themselves how to build software to do data analysis—increasingly complain that their colleagues, fellow physicists and biologists who are not programmers, mostly treat them as technicians whose job is program and build software tools rather than as real scientists with careers.

For example, in 2013, the physicist Jake VanderPlas wrote a viral blogpost titled "The Big Data Brain Drain: Why Science is in Trouble" in which he argued that the kind of work he did—building software for data analysis—was not rewarded by the physics community. VanderPlas argued in his post that he and others like him would eventually decamp to corporate life (which paid a lot more) if this was not rectified. And indeed, that's exactly what happened, as VanderPlas who had a PhD in Astronomy and Astrophysics, eventually left the University of Washington (where he worked as a post-doctoral data scientist) to be a software engineer for Google Research in Seattle.

In the broader social sciences (in the academy but also in corporations, for what is corporate data science if not social science in the broadest sense?), the situation is reversed. Here, it is often the “data scientists” who are able to set the agenda. This difference can be traced to three factors: the bulk of the data in physics and biology is "crafted" by domain experts; social scientific data (e.g. clickstream data from Twitter or Facebook) is often produced for reasons other than research. Moreover, the audience for knowledge claims that come from data-intensive physics or biology is often other physicists or biologists themselves while the audience for social science is often far larger.

But there are even more complications here because the situation in corporations varies from sector to sector. Consider these three cases:

Much has been made about the racial bias of the COMPAS algorithm that creates a recidivism score for criminal defendants that can then be used for bail hearings. The question of bias here is very complex but do judges actually use the COMPAS score when they make sentencing or bail decisions? When the ethnographer Angel Christin examined the practices of judges, she found that they often refused to use COMPAS scores because these scores go against their conception of their job. So who is winning here: the engineers that created the COMPAS score or the judges/domain experts who refuse to use them?

Something similar seems to be happening in law enforcement and medicine. Police officers in the US have an enormous appetite for data and have integrated a number of information technologies into their practices. But when ethnographer Sarah Brayne looked at how they actually used these technologies in practice, she found that police officers absolutely refused to surrender control to algorithmic software on matters such as quantified risk assessments or predictive analytics, both of which were likely to lead to results which disagreed with their intuitions and would have lessened their work autonomy. Similarly, Claire Maiers, another ethnographer, found that clinicians working in the Neo-natal Intensive Care Unit (NICU) tending to prematurely born infants who were hooked up into a machine-learning based system which scored them for risk based on their vital signs almost never used the algorithm as the sole resource for their decisions.

On the other hand, software companies like Uber, Lyft, and Doordash have almost single-handedly created the app-driven gig economy of just-in-time service. Here, clearly, the data scientists/engineers designing the apps have power over the “domain experts” (drivers, deliverers) who provide the actual services advertised in those apps. As ethnographers like Alex Rosenblat have shown, tweaks in a software feature can result in the increase or decrease in livelihoods of thousands of people.

So, to conclude, it is too soon to just argue that data scientists have taken over from domain experts. The interactions between the two groups are complicated and vary sector by sector: they are different in academia and corporations, they are different for particular types of fields (medicine, law, etc.).

3. Who is the domain expert exactly?

There is a final point and you may have already guessed it from section 2. The term “domain expert” itself remains unclear and scholars have used it to refer to everyone from existing high-status researchers to (low or high-status) front-line workers.

Researchers as domain experts: In “Prospecting (in) the data sciences,” Stephen Slota and colleagues describe the organization of the NSF BDHubs initiative which aims to create big data infrastructure for several natural sciences on a regional basis. Here, the domain experts are the scientists; the authors argue that the data scientists, who are “hungry for categories of domains,” are constantly “prospecting,” i.e., making “the data, knowledge, expertise, and practices of worldly domains available and tractable to data science method and epistemology.” Prospecting is about finding (or inventing) gaps in a domain and then collaborating with the existing domain experts to fill those gaps. This allows the data scientists to construct data science itself as a “domain-agnostic” field, thereby expanding their influence and jurisdiction.

Users as domain experts: In “The Deskilling of Domain Expertise in AI Development,” the CSCW scholars Nithya Sambasivan and Rajesh Veeraraghavan interview data scientists2 who work in “low-resource contexts” such as education and agriculture and require the labors of fieldworkers (e.g., teachers, transportation workers) to create the data which they then use to build, train, and test their models. The authors find that these data scientists consistently treat these fieldworkers as resources rather than partners, often configure them as lazy or incompetent, and seek to control them computationally rather than engaging with them as experts.

Both papers are making an argument that data scientists erase domain expertise in some way. But they have very different definitions of who a domain expert is and a very different mechanism of erasure. In the first, the domain experts are high-status researchers; in the second, they are low-status fieldworkers. For NSF BDHubs, the claim is that the domain experts end up partnering with the data scientists but have been subtly undermined; while in AI development, the claim is that domain experts were treated as resources but should have been treated as (junior?) partners.

There are even more complications if you take some of the cases I used in section 2. Let’s take Northpointe, the company that makes COMPAS. One would assume that Northpointe employs both people who identify as data scientists and also those who identify as criminologists. What kind of relationship exists between these groups who both work for Northpointe? But on the other hand, Northpointe serves state governments and the criminal justice system; judges are their clients and/or users. Do the relationships between the data scientists and the domain experts look different when they are co-workers as opposed to when they are clients or users?

These complexities need to be worked through before we get into a solid theory of just how exactly data science is undermining domain expertise.

In my subsequent posts in this series, I will work through some of these issues by looking at some of the really interesting studies of data scientists that I have read and encountered. Thanks for reading!

To be fair, Burrell and Fourcade say this about a broader category of people they call “the coding elite” whom they define as a "a nebula of software developers, tech CEOs, investors, computer science and engineering professors, among others, often circulating effortlessly between these influential roles." Nevertheless, their key citations for this claim are a series of papers by David Ribes, Stephen Slota, Andrew Hoffman, Geoffrey Bowker, et al. (some of it discussed in section 3) that are explicitly about academic data scientists and that I will discuss in more detail in future posts.

They call them AI developers.

michael

Jun 19, 2024

As someone unfamiliar with data science, this was informative to both get a sense of what it is, in addition to how popular understanding of its consequences have been too simplistic.

I'm curious if you plan to create a general framework from the current studies on domain experts and data scientists in terms of when or where they may be more dominant in distinct fields. From the examples of judges, police, and nurses, I wonder if one can expect any instance of domain experts having the final word when they can choose to apply data science findings, versus front-line workers like drivers who have no autonomy.

Expand full comment

Technology and Society

Discussion about this post