Generative AI works as a way to present the answers it derives from a Large Language Model (LLM). With programs like ChatGPT, whatever anyone has typed in to the internet has been ingested into its LLM and a subset is produced in answer to questions. If you’re working in drug discovery, it would be great to benefit from what competitors are working on. But if you put your team’s ideas online, they potentially become available to everyone else.
Using ChatGPT-like tools for serious data gathering should be approached with caution. Yet, generative AI presents a highly useful tool for general-purpose domain acquisition at the start of a project. The other main challenge in working with generative AI is how to derive value from integrating datasets with it.
Keep high-value data exploration in-house
It makes sense to keep potentially high-value data exploration in-house. By creating your own LLM based on your team’s results, field trials, or both, and supplementing it with publicly available scientific or medical database information, you can develop a robust, proprietary LLM. You could then use ChatGPT to analyse this private LLM while ensuring your organisation’s intellectual property (IP) is safeguarded.
In addition, ChatGPT may not provide the depth of answers or the level of accuracy required from an unstructured heap of data. While it is possible to integrate PubMed’s 30 million peer-reviewed articles, the data quality is not guaranteed. Even in peer-reviewed articles, errors or differences in scientific interpretation may lead to inaccurate information.
Enter knowledge graphs
In order to create a knowledge base that is useful for research purposes, it is essential to structure and organise the data in a manner that allows patterns and connections to be identified easily for ChatGPT to use. One approach that is becoming increasingly popular is the use of a knowledge graph, which can transform your private LLM into a powerful research engine.
Knowledge graphs, as defined by the Turing Institute, “organise data from multiple sources, capture information about entities of interest in a given domain or task (such as people, places, or events), and create connections between them.” A graph-based knowledge graph of a specific, targeted piece of biomedical domain, can support a lot of valuable data analysis work.
During research on COVID-19, looking at disease-connected genes by connecting the data in this way, it became possible to quickly query and analyse relationships. It was then possible to analyse these relationships between the modules or sub-graphs of the genes in the virus but also see the connections between genes and corresponding proteins. This allowed researchers to identify a number of genes and affected metabolic pathways and understand more about what was going on in this area of disease research.
In these cases, the ability to use a natural language interface instead of writing code means the research team could more easily navigate complex data structures and identify specific patterns or relationships of interest.
Researchers made an important breakthrough with the creation of BioCypher, a FAIR (findable, accessible, interoperable, reusable) framework that builds biomedical knowledge graphs while preserving all the links back to the source data
The BioCypher team took a large corpus of medical research papers, built a large language model around them, and then derived a knowledge graph from the model. This approach allowed researchers to more effectively interrogate and work with that mass of previously unstructured, but now very well-organised and well-structured, data. And having the data in a knowledge graph means it is transparent and its answers enable better deliberation.
This approach can be replicated effectively, using an LLM to do the natural language ingestion heavy lifting. This would allow life science professionals to create a knowledge graph that will help make the most sense of the data.
Accessible to non-computer scientists
Using a natural language interface approach to translate questions or statements into Cypher code to then query in-house data store is infinitely superior for research purposes than imprecise ChatGPT answers. The NLU approach can make it much more accessible for non-computer scientists to work in a database and ask meaningful questions. Typically, researchers do not ordinarily want to interact with database querying layer and do not feel comfortable if you use technical terms like “Cypher” or “Python.”
NLU interfaces enable real-life interrogation of patient data, for example, whether Type 2 diabetic patients in a trial have a secondary clinical condition, such as liver cancer. This type of query is precisely what life science researchers need to achieve as they seek to ask meaningful questions and obtain useful answers from their clinical trials.
Using ChatGPT-like systems to support converting natural language into data queries is key to getting valuable answers from complex data systems. It is clear that NLU, in conjunction with a knowledge graph, is set to unlock the next generation of healthcare research.