The debate over "prospecting"

Slota et al. argue that data science, conceived as a content-free science, actively prospects into and is hungry for domains. But what is prospecting, really? And does it disempower domain experts?

Aug 17, 2024

In my last post, I described the work of David Ribes and colleagues on the history and logic of the term “domain,” as it’s been used within computer science, data science, and policy. The authors point out that the term “domain” is used in a certain way: computer scientists see themselves as domain independent and see their job as building software that will allow better computation in these domains as well enable the transfer of data and concepts across domains. They call this the “logic of domains.”

But in another paper, the authors (Slota et. al. 2020, [1]) go further. They argue that “data science demands kinds of things to analyze, and analysis is hungry for categories of domains (e.g., biology, geology, chemistry, etc.) and domains’ own categories which can be worked upon”; they call the process through which a domain is made legible to the data science method as “prospecting.” The logic of domains, they argue, by definition, leads to a praxis of prospecting, which, in turn, constitutes data science and the data scientific method as something universal.

This is a fascinating argument and I will delve into it in depth but it should be said that Slota et. al. get some strong pushback from Tanweer and Steinhoff (2024) [2]. Based on interviews with practicing data scientists (many of whom are in the same data science cluster at UW that Slota et. al. also observed), they argue that data scientists show two distinct styles of thinking through which they understand themselves and their science. In the first “transdisciplinary” style, they see data science as a content-free set of techniques that can be applied to any domain and their own goal as striving for a sort of universal technique of data processing and manipulation. This transdisciplinary style reflects a prospecting approach to domains. But there is also a second style of thinking which Tanweer and Steinhoff call “extradisciplinary.” In this style, data scientists don’t see their science as universal at all; rather, they see it as craftwork that is an integral part of existing disciplines and fields that, in turn, facilitates the transfer of data and techniques across these fields. Clearly, this is the opposite of what a “prospecting” worldview would entail.

So how to square the circle? Tanweer and Steinhoff argue that the prospecting or transdisciplinary worldview is more common to those data scientists who are trying to establish it as an institution (they label such actors as “sociotechnical vanguards”); these sociotechnical vanguards, as part of their institution-building to make a space for data science within academia, end up articulating the transdisciplinary worldview, which forms a key part of their boundary work to build institutions. Whereas the rank-and-file of data scientists, those who are plugging away on their datasets and their research, are really working through the extradisciplinary worldview.

So who’s right? And what does all this mean when it comes to the question of whether data scientists are taking over from domain experts? This post will get into this in detail and consist of three sections.

In Section 1, I will analyze [1] in more depth to understand what “prospecting” really means and what it looks like. I’ll argue that [1] is not so successful at demonstrating the practice of prospecting; data science is certainly pervaded by the logic of domains but the relationship between this logic and data scientific practice is underdetermined.

In Section 2, I will delve into the reconciliation of the two findings proposed by Tanweer and Steinhoff: that institution-builders tend to perpetuate the transdisciplinary worldview while rank-and-file data scientists subscribe to the extradisciplinary worldview.

In Section 3, I will bring both papers together to address the question of whether data scientists are taking over from domain experts. I will argue that the distinction between “data scientists” and “domain experts” in both papers is unclear; this is not a hard-and-fast boundary as much as it is a matter of what people choose to call themselves. However, it is clear that even if data scientists are not taking over from domain experts, there is certainly something like a data science “style of reasoning” that is percolating into various research activities. But the precise practice of this style of reasoning is still unclear and probably varies from institution to institution.

All right, so let’s get into it.

What is “prospecting”?

The key concept underlying Slota et. al. [1] is “prospecting.” But what is prospecting exactly? Prospecting is about how data scientists intervene in new areas by turning a new area of intervention into a “domain.” “Prospecting” means making “the data, knowledge, expertise, and practices of worldly domains available and tractable to data science method and epistemology.” Because data science is “hungry for categories of domains,” data scientists are driven to understand the institutional dynamics in the domain, discover (or invent) the disorder that exists in these domains (for instance, that different practitioners within the domain use data differently), and find partners, i.e., domain experts, with whom to work with in coming up with solutions to these disorders. Throughout the process of prospecting, the data scientists construct data science itself as a “domain-agnostic” field, thereby expanding their influence and jurisdiction.

The authors also argue that:

Prospecting is vital to the praxis of domains, the connecting step that produces connection and asymmetry between the domain and the data scientist and enables the coordination of data toward an application regardless of the context of its initiation. [Emphasis in the original]

In other words, prospecting is the constituting practice that both underlies and is the result of what they had earlier described as the “logic of domains” (discussed in my previous post). Or to put it in the authors’ slightly turgid language:

Prospecting, in turn, can be thought of as encoding the logic [of domains] within a praxis: as the functional means by which the interfacing of “worldly” domains—with the “principle” data science disciplines, and with each other—is enabled and structured, and through which a given domain’s resources are rendered visible, available, and amenable to being intervened upon for any given set of ends, from the merely technical to the deeply embedded social.

The authors argue that because it is a praxis, the concept of prospecting is different in kind from concepts like “digital colonialism” or “data extractivism.”1 It is not an analyst’s description of a state of affairs (as those other terms often are) but a description of what the actors themselves do (though perhaps at a somewhat abstract level).

But what does prospecting look like as an actual practice? I will say that I found the paper a little vague when it came to describing exactly how prospecting works.

Partly, this is because this is not an ethnographic paper even though it relies on ethnographic fieldwork. This is not my interpretation; the authors state this at the outset when they say:

[Our] objective here is not so much to provide an ethnographic recounting of the BDHubs as such, but rather to deploy insights we gleaned during our study of this initiative as a means of furnishing a broader understanding of data science as an emergent universal(izing) science, with particular emphasis placed on prospecting […] which we argue is an enabling force driving the broader dataﬁcation of science and society

But it also comes from the concrete examples that the authors provide. Here is one which I will lay out in full:

Case in point: the 2017 National Transportation Data Challenge, a BDHubs-led initiative that aimed to contribute to the international “Vision Zero” strategy of eliminating trafﬁc fatalities on highways. When considering the problem of trafﬁc accidents from a data scientiﬁc approach, the data scientists involved in this endeavor reached out to researchers and practitioners in government, commercial, and academic organizations to discover what data and computational resources were out there, and in what form. The data scientists were then able to evaluate the data according to their own needs (Is there good metadata? Is it consistently structured? How difﬁcult or expensive would it be to gain access?) and engage with the various domains producing that data in order to better understand it and, ultimately, to apply it to research into the causes of accidents and possible solutions for avoiding highway deaths. We observed actors testing various sources of data in initial analyses to gauge its suitability in answering their questions, all of which took place before the analysis of the data began. It is this process of selecting, testing, and evaluating available data that structures what the results of that analysis would look like, while remaining relatively invisible in the ﬁnal product.

What exactly is this example saying? It starts from a problem that some of the BDHubs researchers were interested in: eliminating traffic fatalities on highways. To solve this problem, the researchers (and it seems, researchers who identified specifically as “data scientists”) reached out to people who were well-versed in this issue (so what we might call the “domain experts”) and asked them what data was “out there.” They then proceeded to “evaluate” the goodness of this data and used that as the basis on which to structure the research collaboration.

Why does this matter? The authors say that this evaluation structures the actual research that will follow (but has not yet started) in some non-trivial way. And because it will shape the research in that way, it’s an important step that is taken but not acknowledged.

But it isn’t clear, at least in this paragraph, how it will shape the research that follows and why it matters.

Here’s another example the authors use. It’s fairly long so I will only extract the relevant parts:

Following from the instruction and agreements established to support Big Data at the national scale, and with a particular emphasis on bridging across the academy, industry, and the public sector, a cornerstone of the [BDHubs Initiative] was a series of workshops and design charrettes that brought together representatives from these different sectors who sought to identify and characterize the forthcoming challenges for Big Data. It was here that NSF leadership and program staff sought to assess, understand, and in many ways render the widely claimed, broadly applied ﬁeld of data science in some way tractable to development under a cohesive funding effort.
In reports and presentations drawn from these workshops, we see initial prospecting efforts working to establish an agenda and mode of work that would persist throughout the process of developing, funding, and rolling out the BDHubs consortium. Introducing and oriented around the loosely deﬁned notion of “partnership,” workshop attendees were primarily recruited through existing social networks and familiarity with research work undertaken by its organizers. “Why are you here?,” one slide asked: “You have made some connection about Big Data with OSTP, the Big Data Senior Steering Group and/or one of the agencies involved.” Workshop attendees were then charged with a set of four tasks: “Fact ﬁnding: Collect data and information . . . Idea ﬁnding: Listen for new ideas, models, partnerships, etc . . . Partner ﬁnding: Search for your Big Data ‘soulmates’ . . . Solution ﬁnding: Discern promising ideas that can be applied and that would make a difference.” Note that each of these “charges” for action during the workshop are activities that loosely ﬁt under the umbrella of prospecting: discovery of social connection; provision of access to data; mapping and understanding existing social, organizational, and institutional ties; and discovering those basic questions that might be readily answered.

Again, what exactly is happening here? To set the parameters and evaluate the feasibility of the BDHubs Initiative, the NSF organized a series of workshops with “representatives of different sectors.” In these workshops, the participants were asked to think about the outstanding problems in their field and then see if they could seek “partnerships” with their “Big Data ‘soulmates’” through which those problems could be solved.

Sounds fairly benign to me. So why does this matter? The authors’ contention is that these agenda-setting activities are structuring the research even before the research actually starts and that this is, in some way, problematic. But it’s not clear how exactly these activities structure the research or how they influence the actual research products. Of course, this means it’s not clear why exactly this is problematic.

What’s clear though is that both vignettes I quoted above are shot through with the “logic of domains.” And in that sense, most data science is structured by the idea that there is something, call it “data science,” that is domain-less and whose techniques can be applied to particular domains, and moved between domains, for their betterment.

But how this “logic of domains” translates into research practices is not so clear. It certainly pervades the agenda-setting activities of data science. When the NSF creates funding initiatives or when schools start “data science” bachelors or Master’s programs, they need a way to instantiate the logic of domains and they do so with activities like the ones I describe above (reaching out to people, creating workshops) and those activities can certainly be called prospecting insofar as they are saturated with the logic of domains. But the relationship between these activities and the actual practice of data science is far from clear (and more on this Section 3).

Last, but not the least, it seems fairly clear to me that the “social” aspects of prospecting are far more important than the strictly “technical” ones. Consider the first vignette. When the BDHubs leaders started to think about traffic fatalities as a problem they could work on, they reached out to various researchers in government, industry and academia to ask them what issues they were facing and what kind of data they worked with. They did do a technical evaluation of the data that existed but the bigger purpose of this seems to have been to bring these researchers (domain experts?) into the initiative. This social aspect is even more explicit in the second vignette which is organized around workshops where researchers try to find “soul mates” who they can partner with to solve the problems that matter to them through data. To put it in a different way, prospecting seems to be a profoundly social activity.

Transdisciplinary versus extradisciplinary views of data science

Before I dive into the implications of all this for the question of whether data scientists erase domain experts, let’s take the paper by Tanweer and Steinhoff [2] which offers a critique of prospecting based on some empirical data analysis.

Tanweer and Steinhoff interviewed more than 100 practicing data scientists, part of the University of Washington’s eScience initiative, asking them about what it meant to them to be a data scientist. They found a spectrum of views which they characterize into two ideal-typical types: the transdisciplinary view and the extradisciplinary view. Table I in the paper (below) captures the essential difference between these two views.

In the transdisciplinary view, which they found only among a minority of their participants, the data scientists treated their science as “transcendent,” “appropriative,” and “impositional.” Which is to say, they argued that data science is a set of methods and techniques that can be applied across domains. As the authors put it in explaining a particular quote from their informant:

Data science and the domains have tidy and discrete roles to play in [the transdisciplinary] framing, with different goals and rewards: the domains provide raw material in the form of data, while data science provides the tools to transform it; the domains need data science to provide technical advances for dealing with an unprecedented deluge of data, and data science uses the domains to stay relevant and innovative. When those expectations are not met, it [data science] is deemed a failure.

But the authors argue that many more data scientists in their sample had an “extradisciplinary” (a term they coin) view of data science. Here, their participants understood data science as “grounded,” “relational,” and “adaptive.” Which is to say, they did not think that their data science work could generalize across domains; rather, they believed that data science methods emerged from specific problems that people (in domains) were trying to solve, and these methods could be adapted across domain boundaries but they would never be fully general.

A quote from one of their informants, a biology post-doc, illustrates this process:

You let the astronomers come up with new image analysis techniques because that's what they do all the time and then you apply it to your data…. there’s no reason we shouldn't be going dumpster diving in their fields to take things that we can apply to biology and other domain sciences…. I see the things that you could do by combining multiple fields as the thing that's very exciting and we call it data science and that's cool.

This biology post-doc is a data scientist (which is to say he or she is also a “domain expert”) who suggests that he enjoys going “dumpster diving” into other fields like astronomy, take techniques he finds data-driven astronomers using for their own work, and then adapt these techniques into biology. This, he says, is data science and “that’s cool.”

Tanweer and Steinhoff argue that the transdisciplinary worldview aligns with the idea that the heart of data science is the logic of domains and prospecting; but this is not the worldview of practicing data scientists, at least in their field-site, who see data science as much more tentative and improvisatory. For most of their data scientists, they argue that data science is a craft activity, requiring the data scientist “to develop skills that cannot be formally taught but instead must be acquired through iterative contact with their medium (data) and tools (software).”

This comes through in a powerful vignette the authors use to open their paper.

A statistician, a computer scientist, and an astrophysicist walk into a panel discussion about data science. The moderator asks the panelists, what does data science mean to you? […] After the other panelists at the National Science Foundation-sponsored Data Science Workshop provide their answers, the astrophysicist offers his own interpretation of their responses: ‘It’s interesting that data science, to a computer scientist, is computer science. And data science, to a statistician, is statistics.’ The astrophysicist then delivers the punchline: ‘Data science, to a scientist, is just science.’ As the audience chuckles, he continues, ‘It’s what we’ve been doing for many years and it’s what we will be doing for many years, but it’s with larger data sets.’

The computer scientist and the statistician, in this telling, are offering up a transdisciplinary view of data science (perhaps modeling it on their own disciplines) while the scientist offers up the extradisciplinary view saying that is is “just science” and by implication, a craft activity.

So is there a way to reconcile the diametrically different findings of [1] and [2] with respect to the essence of data science? Tanweer and Steinhoff argue that it is often the “sociotechnical vanguards” of data science who are likely to articulate the transdisciplinary worldview: expert researchers, probably very senior, who are also institution-builders and trying to make a space for data science within institutions (universities, corporations) and within disciplines (biology, geology). For these people, the transdisciplinary view, the vision of data science through the logic of domains and prospecting, seems like the best way to establish it institutionally. But most practicing data scientists don’t seem to think that way and subscribe more to the extradisciplinary view. (By definition, there are “few” sociotechnical vanguards.”)

This, the authors say, is also reflected in the quotations of their interviewees. Those who advocate for the transdisciplinary view often speak in “hypotheticals, comparisons to other fields, or generalities” whereas those who advocate for the extradisciplinary worldview “tended to describe specific personal experiences.”

However, the authors offer some caveats, especially with reference to their field-site. They tell us that the University of Washington has been an outlier when it comes to how academic data science is practiced. Unlike other universities that pushed to accelerate data science, UW did not create stand-alone data science undergraduate or graduate degrees, focusing instead on “a non-degree granting research center that coordinated with other degree-granting units to create specializations in data science within existing undergraduate majors and doctoral programs.” In other words, data science at UW “has been a diffuse enterprise by design” (my emphasis), with more data scientists with “backgrounds in the natural, social, and human sciences” than with backgrounds in “math, statistics, physics, computer science, and engineering” at comparable institutions.2

Do data scientists erase domain experts? No. Does data science? Maybe.

So, if we put these two papers together, where does that leave us in terms of the question that began this series: are data scientists erasing domain experts?

As I also pointed out in my first post, the term “domain expert” is quite ambiguous. There is the first question of whether this means a meta-expert or a front-line worker. But even in this case, where everyone is a high-status researcher, it’s not exactly clear what exactly distinguishes a domain expert from a data scientist.

This comes across most clearly in [2]. Tanweer and Steinhoff are at pains to point out that the “data scientists” they interview seem to have dual identities. They are dedicated to adapting from other domains but first and foremost, they consider themselves as researchers of a particular domain. As they put it:

As noted earlier, the neat bifurcation of ‘data science’ and ‘domain science' is commonly deployed rhetorically in our field site. For example, the eScience Institute often runs programs that pair ‘domain scientists’ from outside the eScience Institute with ‘data scientists’ employed by the institute. But as mentioned previously, eScience ‘data scientists’ often have advanced training in what are typically considered to be ‘domains’ and have not left their home disciplines to exclusively embrace the identity of a data scientist. They continue to identify with their own ‘domains’ by, for example, holding joint appointments in other departments, writing for disciplinary-specific publications, and organizing data science trainings within their own disciplinary communities. (My emphasis.)

But this seems equally true in [1] especially when we focus on concrete events described in the paper. For instance, in trying to solve the problem of traffic fatalities, when the “data scientists” reach out to the “researchers and practitioners in government, commercial, and academic organizations to discover what data and computational resources were out there, and in what form,” this is essentially an invitation to these “domain experts” to partake in an endeavor where they will assume, perhaps temporarily, the identity of data scientists. When the NSF conducts workshops that asks particular experts to find their “Big Data soul mate,” this is an invitation to them to partake in using the data science method and thereby become a data scientist.

But here, I want to make a distinction between “data scientists” and “data science” as a style of reasoning.3 This is the distinction that the sociologist Gil Eyal has famously made between “experts” and “expertise” or what Nikolas Rose makes between psychologists and psychology. Eyal and Rose argue that what’s most important is to understand the style of reasoning adopted in an institution, the ways in which an institution defines its object of analysis and intervention that lead to the creation of tasks for experts and workers. This can be quite different from the question of jurisdiction—i.e., the division of labor or who does a particular task—and confusing the two is an analytical error. As Rose has put it, “the social consequences of psychology [or psychological expertise] are not the same as the social consequences of psychologists [or psychology experts].”

What if we think of data science as a style of reasoning? This might suggest that even if data scientists are not erasing domain experts—indeed, it is now many domain experts who call themselves data scientists—it is conceivable that many domain experts are now operating with a different, underlying, style of reasoning.

But that leads to two other questions, one descriptive, the other normative.

The descriptive question is: what is this data science style of reasoning exactly? Is it characterized mainly by prospecting and the logic of domains (i.e., transdisciplinary) or is it extradisciplinary? This is an open question.

Then there is a normative question: is the fact that many researchers are adopting this style of reasoning a bad thing? Both the papers discussed in this post argue that there is something fundamentally problematic about one style of reasoning: prospecting or the transdisciplinary worldview. Slota et. al. think that a focus on prospecting means that data scientists analyze their data in a way that leaves out potentially better ways. Even Tanweer and Steinhoff argue that even if “the sociotechnical vanguards of data science […] strategically indulge in a sweeping transdisciplinary vision while continuing to support a more modest extradisciplinary quotidian reality on the ground,” they are also “organiz[ing] institutions around a transdisciplinary ideal in which data science is distanced from academic disciplines in its transcendent, appropriative, and impositional guise.” But why is this a problem exactly? I will return to these questions in later posts.

Papers discussed or mentioned in this post:

[1] Slota, Stephen C., Andrew S. Hoffman, David Ribes, and Geoffrey C. Bowker. "Prospecting (in) the data sciences." Big Data & Society 7, no. 1 (2020).

[2] Tanweer, Anissa, and James Steinhoff. "Academic data science: Transdisciplinary and extradisciplinary visions." Social Studies of Science 54, no. 1 (2024): 133-160.

To be fair, these concepts, in my opinion, tend to be quite nebulous and abstract and away from the realm of practice.

The authors are arguing, in other words, that their field-site might be an outlier when it comes to the predominance of the extradisciplinary worldview. But of course, this begs the question of why the institution-builders at UW, who designed the data science programs in this way, still seem to mostly articulate the transdisciplinary worldview.

Slota et. al. use the term “style of organizing,” which is quite similar to “style of reasoning,” when they refer to the logic of domains, but when it comes to prospecting, they call it a praxis rather than a style of organizing.)

Technology and Society

Discussion about this post