One Health Data Co-Analysis: Considerations, Pitfalls and Biases
Contents
OHS brings together data from multiple sources and sectors (public health, animal health, food safety, environment) to analyse them together in a common context. In this co-analysis it is important to consider some potential pitfalls and sources of bias that could arise from the great diversity that exists in One Health data, and which should be considered when performing data analysis and presentation.
Below are some examples of aspects to consider, produced through a discussion workshop within Work Package 6 of MATRIX on the 28th of October 2021, in which surveillance experts from a variety of countries and sectors participated.
Terminology and definitions
The language and terminology used to describe disease surveillance activities in different sectors may differ, or the same terms may be used to describe different things. For example, a case may not be defined in the same way which will pose problems when comparing case data if the differences are unknown or unconsidered.
Awareness of potential terminological differences is important both when analysing OH data and when presenting the results to the end user. Before analysis, consider: if there are differences, do they have operational relevance? For example, an outbreak may be defined differently in different sectors based on a difference in needs/objectives and capacities of response. Such differences may be more difficult and require more reasoning to circumvent but are nonetheless important to identify.
When presenting OH surveillance results in a website or dashboard, one way to avoid terminological misunderstandings is to include a glossary. Here, a useful tool to refer to is the OHEJP Glossary, in which a member of the OH community can look up terms and therefore be aware of potential differences in definitions across sectors. Building surveillance dashboards with the OHEJP Glossary in mind from the start ensures that terms are used correctly and consistently while also saving the effort of writing a glossary “from scratch”.
Methodologies
The laboratories involved in surveillance and the methodologies, practices and standards that they operate under may differ, both within and between sectors. One reason for this difference is that different sectors deal with different strains and pathogens to which laboratory practices need to adapt. Information about which laboratory methods have been used and for which strains is important for interpreting OH data and assessing whether they even can be analysed in a common context. Another issue related to laboratory methods when co-analysing OH data could be differences in the resources available for data management. Different sectors and different actors within sectors have different resources, which may have consequences in how data are stored and managed, which in turn could affect the ability to combine the data with that from other sectors.
To improve reporting of designs, implementations and methodologies that are used for description of outcomes in surveillance activities, the One Health Consensus Report Annotation Checklist was developed within the OHEJP ORION project. This checklist proposes that reporters producing surveillance results should provide as much information as possible, such as which methods were used and how different strains and pathogens are dealt with. Following the checklist facilitates that the data produced can be compared across sectors in the long run, which is more valuable for the community.
Potential routes of disease transmission
Exploring and understanding the underlying routes of disease transmission tells us how the data we observed came to be in the first place, which also helps inform how data should be analysed and presented. For example, if studying a zoonotic agent for which human cases have been observed, it is quite useful to be able to trace the infection backward and link the cases to their animal source, especially if the goal is to predict and prepare for future outbreaks. Finding this link can inform about factors such as geographical connections, expected time delay between cases and what data should be collected to achieve a timelier surveillance.
If transmission did not occur through direct contact, there exists some environmental route which needs to be studied. When doing so, one needs to be specific about one means by “the environment” and which potential routes that may include, to avoid searching in the dark. To make a proper risk assessment it is important to collect as much information and parameters as possible about the potential transmission routes so that they can be properly defined. However, in normal surveillance contexts there is usually a limit to what information is available. In this case there needs to be a discussion about what potential routes have not been explored due to missing data . This requires a surveillance system which is highly adaptable, especially when facing an emerging threat for which there may be little to no understanding of the routes of transmission.
A practical example can be found in the Campylobacter surveillance system set up in Norway (Swanson et al., 2022). There, two routes of transmission are analysed through which poultry farm outbreaks are related to an increase in human gastrointestinal disease.
The first is a water route: when there is heavy rainfall, waterworks cannot decontaminate wastewater as efficiently which leaves the ground susceptible to pollution by pathogens from farms. These pathogens can then infect humans through drinking water. Such a route implies a very local geographical scope and a shorter delay between animal and human infection. To better predict outbreaks of this nature, the Norwegian surveillance system has included weather data (in particular, precipitation data) on municipal level.
The other route is through the food industry. There is a risk of infected broilers making it to slaughter, and their contaminated meat being sold to humans who get infected due to not handling and preparing the chicken meat correctly. This route has a larger geographical scope as abattoirs can sell and distribute meat products to grocery stores all over the country, which also implies a greater time delay. Surveillance of this route therefore requires analysis of data on a national level.
Data properties and statistical considerations
Data alone are not very useful if not accompanied by a description of the data’s properties, also known as metadata. This is especially important when matching and co-analysing data across sectors, as the metadata provides information about how data can be combined. For example, data sampled in different ways (e.g. random sampling vs. triggered sampling) cannot be compared directly unless there is metadata available to describe the sampling process (how and why) in a meaningful way for statistical methods. Again, some of these concepts may have different interpretation in different sectors (e.g. active vs passive sampling/surveillance). For posterity it is also valuable to document how OH data was integrated and subsequently analysed, and what assumptions that integration and analysis was based on.
Time frame
Data collected and analysis in different contexts and for different purposes may be collected in different time frames. There may be variation in collection frequency, delay from collection to analysis or the time scale in which the data are reported (e.g. daily, workdays, weekly, monthly…). Which timeframe that is relevant to use is highly dependent on the surveillance question. This is closely related to the topic of terminology and case definitions, in that whenever the time frame used has an operational relevance one should think of ways to make different needs match.
Sometimes the issue is simply one of how time is defined, and which formats are used, something that is dependent on where in the world and in what context the data were recorded. For statistical work, it is often a problem that time data are not stored in a format that can be worked with programmatically. To circumvent this, time can be stored in the universal Unix format (number of seconds since 1 January 1970), but this requires some effort to subsequently translate it into a format that makes sense for the users.
In disease surveillance, there will always be some delay between infection, sample collection, sample analysis, reporting, data analysis and presentation of results. It is important to be aware of what delays exist and how they affect the surveillance, especially when aligning case curves from different sectors or training statistical models to predict future development.
Geographical scope
All geographical data needs to be aggregated to some level to be useful, and when comparing aggregated data (e.g. animal health data collected on farm level vs. human case data on municipal level). If this is the case, it is necessary to think about if and how these data can be translated into a common context. Matching geographical surveillance data may be difficult depending on what kind of location metadata is available. How the data should be matched also depends on the underlying route of transmission, as previously illustrated by the Norwegian Campylobacter example where food-borne zoonoses are expected to be able to spread far from the infection source.
Differences may not just be due to varying structures and needs of different sectors; there may also be intra-sector variation due to different possible levels of granularity. In countries with a high degree of regional autonomy, for example, data may not be collected and aggregated the same way in all regions. This becomes especially relevant in international surveillance contexts. Geographical data can also vary structurally in terms of data format (e.g. vector data vs raster data) or visualisation properties such as raster size (geographical resolution) and which coordinate systems were used. Such technical aspects are equally important to ensure that data are matched correctly.
User group considerations
The desired outcome of the surveillance and an associated dashboard will naturally depend on the target audience. Experts of different sectors will be interested in different perspectives, but they sometimes overlap. It is this overlap, where the results are valuable for more than one sector, which is the core of One Health surveillance. Knowledge of who will be interested in the results and what they want to see in a dashboard is therefore crucial for creating a useful product. This is not just a consideration for the design stage of dashboard development but will be vital in deciding what type data is collected and how they are analysed and presented. As such, the target audience should be defined as early as possible. For the development of dashboards, a general rule of thumb is that every piece of information presented in the dashboard should be justified from a user point of view, which means that one must define what use is expected by the audience and whether that information meets the expectation.
During the COVID-19 pandemic, it has become evident that there could be a great interest in surveillance data outside the realms of expert authorities and academia, to the public as well as companies, statistical institutes, etc. This comes with great technical requirements on the surveillance institutes, which poses challenges if the IT infrastructure is not prepared for a wide audience. The pandemic has also created a great demand for real-time surveillance data where the results are expected to be available as soon as possible, which brings its own technical challenges but also risks causing confusion as numbers and figures will need to be adjusted a posteriori as new data keep coming in. All in all, the pandemic has clearly shown that when working with disease surveillance with an interest for the wider public it is important to have the infrastructure capacity to handle large amounts traffic, and to consider the timeliness with which data should be published. With the user group in mind, there needs to be a consideration of when and how often information presented in a dashboard should be updated. What is the operational relevance of updating the dashboard with a certain frequency? How real-time is real-time enough? In urgent situations it may be advantageous to prioritise timeliness, while other contexts should favour data completeness.