• Stars
    star
    591
  • Rank 75,679 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    BSD 3-Clause "New...
  • Created almost 2 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

LLM-based ontological extraction tools, including SPIRES

OntoGPT

DOI PyPI

Introduction

OntoGPT is a Python package for the generation of Ontologies and Knowledge Bases using GPT. It is a knowledge extraction tool that uses a Large Language Models (LLMs) to extract semantic information from text.

This makes use of so-called instruction prompts in Large Language Models (LLMs) such as GPT-4.

Currently three different strategies for knowledge extraction have been implemented in the ontogpt package:

  • SPIRES: Structured Prompt Interrogation and Recursive Extraction of Semantics
    • Zero-shot learning (ZSL) approach to extracting nested semantic structures from text
    • This approach takes two inputs - 1) LinkML schema 2) free text, and outputs knowledge in a structure conformant with the supplied schema in JSON, YAML, RDF or OWL formats
    • Uses text-davinci-003 or gpt-3.5-turbo (gpt-4 untested)
  • HALO: HAllucinating Latent Ontologies
    • Few-shot learning approach to generating/hallucinating a domain ontology given a few examples
    • Uses code-davinci-002
  • SPINDOCTOR: Structured Prompt Interpolation of Narrative Descriptions Or Controlled Terms for Ontological Reporting
    • Summarize gene set descriptions (pseudo gene-set enrichment)
    • Uses text-davinci-003 or gpt-3.5-turbo (gpt-4 untested)

Pre-requisites

  • Python 3.9+
  • OpenAI account

An OpenAI key is necessary for using OpenAI's GPT models. This is a paid API and you will be charged based on usage. If you do not have an OpenAI account, you may sign up here. You will need to set your API key using the Ontology Access Kit:

poetry run runoak set-apikey -e openai <your openai api key>

You may also set additional API keys for optional resources:

  • BioPortal account (for grounding). The BioPortal key is necessary for using ontologies from BioPortal. You may get a key by signing up for an account on their web site.
  • NCBI E-utilities. The NCBI email address and API key are used for retrieving text and metadata from PubMed. You may still access these resources without identifying yourself, but you may encounter rate limiting and errors.
  • HuggingFace Hub. This API key is necessary to retrieve models from the HuggingFace Hub service.

These optional keys may be set as follows:

poetry run runoak set-apikey -e bioportal <your bioportal api key>
poetry run runoak set-apikey -e ncbi-email <your email address>
poetry run runoak set-apikey -e ncbi-key <your NCBI api key>
poetry run runoak set-apikey -e hfhub-key <your HuggingFace Hub api key>

Setup

For feature development and contributing to the package:

git clone https://github.com/monarch-initiative/ontogpt.git
cd ~/path/to/ontogpt
poetry install

To simply start using the package in your workspace:

pip install ontogpt

Note that some features require installing additional, optional dependencies.

These may be installed as:

poetry install --extras extra_name
# OR
pip install ontogpt[extra_name]

where extra_name is one of the following:

  • docs - dependencies for building documentation
  • web - dependencies for the web application
  • recipes - dependencies for recipe scraping and parsing
  • gpt4all - dependencies for loading LLMs from GPT4All
  • textract - the textract plugin
  • huggingface - dependencies for accessing LLMs from HuggingFace Hub, remotely or locally

Examples

Strategy 1: Knowledge extraction using SPIRES

Input

Consider some text from one of the input files being used in the ontogpt test suite. You can find the text file here. You can download the raw file from the GitHub link to that input text file, or copy its contents over into another file, say, abstract.txt. An excerpt

The cGAS/STING-mediated DNA-sensing signaling pathway is crucial for interferon (IFN) production and host antiviral responses

... [snip] ...

The underlying mechanism was the interaction of US3 with Ξ²-catenin and its hyperphosphorylation of Ξ²-catenin at Thr556 to block its nuclear translocation ... ...

We can extract knowledge from the above text this into the GO pathway datamodel by running the following command:

Command

ontogpt extract -t gocam.GoCamAnnotations -i ~/path/to/abstract.txt

Note: The value accepted by the -t / --template argument is the base name of one of the LinkML schema / data model which can be found in the templates folder.

Output

The output returned from the above command can be optionally redirected into an output file using the -o / --output.

The following is a small part of what the larger schema-compliant output looks like:

genes:
- HGNC:2514
- HGNC:21367
- HGNC:27962
- US3
- FPLX:Interferon
- ISG
gene_gene_interactions:
- gene1: US3
  gene2: HGNC:2514
gene_localizations:
- gene: HGNC:2514
  location: Nuclear
gene_functions:
- gene: HGNC:2514
  molecular_activity: Transcription
- gene: HGNC:21367
  molecular_activity: Production
...

Working Mechanism

  1. You provide an arbitrary data model, describing the structure you want to extract text into
    • This can be nested (but see limitations below)
  2. Provide your preferred annotations for grounding NamedEntity fields
  3. OntoGPT will:
    • Generate a prompt
    • Feed the prompt to a language model (currently OpenAI GPT models)
    • Parse the results into a dictionary structure
    • Ground the results using a preferred annotator

Strategy 2: HALO

Documentation to come

Strategy 3: Gene Enrichment using SPINDOCTOR

Given a set of genes, OntoGPT can find similarities among them.

Ex.:

ontogpt enrichment -U tests/input/genesets/sensory-ataxia.yaml

The default is to use ontological gene function synopses (via the Alliance API).

  • To use narrative/RefSeq summaries, use the --no-ontological-synopses flag
  • To run without any gene descriptions, use the --no-annotations flag

Features

Define your own extraction model using LinkML

There are a number of pre-defined LinkML data models already developed here - src/ontogpt/templates/ which you can use as reference when creating your own data models.

Define a schema (using a subset of LinkML) that describes the structure in which you want to extract knowledge from your text.

example custom linkml data model ```yaml classes: MendelianDisease: attributes: name: description: the name of the disease examples: - value: peroxisome biogenesis disorder identifier: true ## needed for inlining description: description: a description of the disease examples: - value: >- Peroxisome biogenesis disorders, Zellweger syndrome spectrum (PBD-ZSS) is a group of autosomal recessive disorders affecting the formation of functional peroxisomes, characterized by sensorineural hearing loss, pigmentary retinal degeneration, multiple organ dysfunction and psychomotor impairment synonyms: multivalued: true examples: - value: Zellweger syndrome spectrum - value: PBD-ZSS subclass_of: multivalued: true range: MendelianDisease examples: - value: lysosomal disease - value: autosomal recessive disorder symptoms: range: Symptom multivalued: true examples: - value: sensorineural hearing loss - value: pigmentary retinal degeneration inheritance: range: Inheritance examples: - value: autosomal recessive genes: range: Gene multivalued: true examples: - value: PEX1 - value: PEX2 - value: PEX3
Gene:
  is_a: NamedThing
  id_prefixes:
    - HGNC
  annotations:
    annotators: gilda:, bioportal:hgnc-nr

Symptom:
  is_a: NamedThing
  id_prefixes:
    - HP
  annotations:
    annotators: sqlite:obo:hp

Inheritance:
  is_a: NamedThing
  annotations:
    annotators: sqlite:obo:hp
```
  • Prompt hints can be specified using the prompt annotation (otherwise description is used)
  • Multivalued fields are supported
  • The default range is string β€” these are not grounded. Ex.: disease name, synonyms
  • Define a class for each NamedEntity
  • For any NamedEntity, you can specify a preferred annotator using the annotators annotation

We recommend following an established schema like BioLink Model, but you can define your own.

Next step is to compile the schema. For that, you should place the schema YAML in the directory src/ontogpt/templates/. Then, run the make command at the top level. This will compile the schema to Python (Pydantic classes).

Once you have defined your own schema / data model and placed in the correct directory, you can run the extract command.

Ex.:

ontogpt extract -t mendelian_disease.MendelianDisease -i marfan-wikipedia.txt

Multiple levels of nesting

Currently no more than two levels of nesting are recommended.

If a field has a range which is itself a class and not a primitive, it will attempt to nest.

Ex. the gocam schema has an attribute:

  attributes:
      ...
      gene_functions:
        description: semicolon-separated list of gene to molecular activity relationships
        multivalued: true
        range: GeneMolecularActivityRelationship

The range GeneMolecularActivityRelationship has been specified inline, so it will nest.

The generated prompt is:

gene_functions : <semicolon-separated list of gene to molecular activities relationships>

The output of this is then passed through further SPIRES iterations.

Text length limit

Currently SPIRES must use text-davinci-003, which has a total 4k token limit (prompt + completion).

You can pass in a parameter to split the text into chunks. Returned results will be recombined automatically, but more experiments need to be done to determined how reliable this is.

Schema tips

It helps to have an understanding of the LinkML schema language, but it should be possible to define your own schemas using the examples in src/ontogpt/templates as a guide.

OntoGPT-specific extensions are specified as annotations.

You can specify a set of annotators for a field using the annotators annotation.

Ex.:

  Gene:
    is_a: NamedThing
    id_prefixes:
      - HGNC
    annotations:
      annotators: gilda:, bioportal:hgnc-nr, obo:pr

The annotators are applied in order.

Additionally, when performing grounding, the following measures can be taken to improve accuracy:

  • Specify the valid set of ID prefixes using id_prefixes
  • Some vocabularies have structural IDs that are amenable to regexes, you can specify these using pattern
  • You can make use of values_from slot to specify a Dynamic Value Set
    • For example, you can constrain the set of valid locations for a gene product to be subclasses of cellular_component in GO or cell in CL

Ex.:

classes:
  ...
  GeneLocation:
    is_a: NamedEntity
    id_prefixes:
      - GO
      - CL
    annotations:
      annotators: "sqlite:obo:go, sqlite:obo:cl"
    slot_usage:
      id:
        values_from:
          - GOCellComponentType
          - CellType

enums:
  GOCellComponentType:
    reachable_from:
      source_ontology: obo:go
      source_nodes:
        - GO:0005575 ## cellular_component
  CellType:
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000000 ## cell

OWL Exports

The extract command will let you export the results as OWL axioms, utilizing linkml-owl mappings in the schema.

Ex.:

ontogpt extract -t recipe -i recipe-spaghetti.txt -o recipe-spaghetti.owl -O owl

src/ontogpt/templates/recipe.yaml is an example schema that uses linkml-owl mappings.

See the Makefile for a full pipeline that involves using robot to extract a subset of FOODON and merge in the extracted results. This uses recipe-scrapers.

OWL output: recipe-all-merged.owl

Classification:

image

Web Application Setup

There is a bare bones web application for running OntoGPT and viewing results.

Install the required dependencies by running the following command:

poetry install -E web

Then run this command to start the web application:

poetry run web-ontogpt

Note: The agent running uvicorn must have the API key set, so for obvious reasons don't host this publicly without authentication, unless you want your credits drained.

OntoGPT Limitations

  1. Non-deterministic
  • This relies on an existing LLM, and LLMs can be fickle in their responses
  1. Coupled to OpenAI
  • You will need an OpenAI account to use their API. In theory any LLM can be used but in practice the parser is tuned for OpenAI's models

SPINDOCTOR web app

To start:

poetry run streamlit run src/ontogpt/streamlit/spindoctor.py

HuggingFace Hub

A select number of LLMs may be accessed through HuggingFace Hub. See the full list using ontogpt list-models

Specify a model name with the -m option.

Example:

ontogpt extract -t mendelian_disease.MendelianDisease -i tests/input/cases/mendelian-disease-sly.txt -m FLAN_T5_BASE

Using local models

OntoGPT supports using language models released by GPT4All.

Specify the name of a model when using the extract command with the -m or --model option and OntoGPT will retrieve the model.

For example:

ontogpt --verbose extract -t mendelian_disease.MendelianDisease -i mendelian-disease-sly.txt -m ggml-gpt4all-j-v1.3-groovy

will download the ggml-gpt4all-j-v1.3-groovy.bin file, generate a prompt, and try that prompt against the specified model.

Citation

SPIRES is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning.

arXiv publication: http://arxiv.org/abs/2304.02711

Contributing

Contributions on recipes to test welcome from anyone! Just make a PR here. See this list for accepted URLs

Acknowledgements

We gratefully acknowledge Bosch Research for their support of this research project.

More Repositories

1

mondo

Mondo Disease Ontology
Jupyter Notebook
235
star
2

biolink-api

API for linked biological knowledge
Python
63
star
3

dipper

Data Ingestion Pipeline for Monarch
Python
57
star
4

MAxO

Medical action ontology
HTML
50
star
5

curate-gpt

LLM-driven curation assist tool (pre-alpha)
Jupyter Notebook
50
star
6

koza

Data transformation framework for LinkML data models
Python
47
star
7

SEPIO-ontology

Ontology for representing scientific evidence and provenance information
Makefile
44
star
8

monarch-legacy

Monarch web application and API
JavaScript
42
star
9

embiggen

πŸ‡ Embiggen is the Python Graph Representation learning, Prediction and Evaluation submodule of the GRAPE library.
Python
39
star
10

SvAnna

Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
Java
32
star
11

loinc2hpo

Java library to map LOINC-encoded test results to Human Phenotype Ontology
Java
29
star
12

phenol

phenol: Phenotype ontology library
Java
23
star
13

kboom

Bayes OWL Ontology Merging
Java
20
star
14

GENO-ontology

Repository for representing genotypes and their association with phenotypes
Makefile
18
star
15

phenomics-assistant

LLM retrieval augmented generation agent for Monarch.
Python
18
star
16

monarch-ui

The previous version of the Monarch Initiative website
Vue
17
star
17

monarch-disease-ontology-RETIRED

THIS IS THE OLD REPO: Use this one instead: https://github.com/monarch-initiative/mondo-build
Prolog
17
star
18

owlsim-v3

Ontology Based Profile Matching
JavaScript
16
star
19

Squirls

Interpretable prioritization of splice variants in diagnostic next-generation sequencing
Java
15
star
20

monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
Python
14
star
21

phenopacket-store

Collections of GA4GH phenopackets that represent individuals with Mendelian diseases.
Jupyter Notebook
14
star
22

phenogrid

The phenogrid widget
JavaScript
13
star
23

oai-monarch-plugin

First pass at a thin wrapper around the Monarch API and ChatGPT plugin
Python
13
star
24

monochrom

Standardized identifiers and OWL classes for chromosomes and chromosomal parts across species
Python
12
star
25

dragon-ai-results

Jupyter Notebook
11
star
26

pyphetools

Python Phenopacket Tools
Python
11
star
27

hpoannotqc

HPO Annotation QC
Java
11
star
28

ontorunner

Jupyter Notebook
11
star
29

HpoTextMining

Perform text mining for identification of HPO terms in free text from scientific publication.
Java
10
star
30

vertebrate-breed-ontology

Makefile
9
star
31

fenominal

Phenomenal text mining for disease and phenotype concepts
Java
9
star
32

semsimian

Simple rust implementation of semantic similarity
Jupyter Notebook
8
star
33

HpoCaseAnnotator

Next-generation Biocuration App for annotating cases and PhenoPackets
Java
8
star
34

babelon

A format for language profiles for ontologies
Python
8
star
35

gpt-mapping-manuscript

Jupyter Notebook
7
star
36

helpdesk

The Monarch Initiative Helpdesk
7
star
37

pheval

There is currently no empirical framework to evaluate the performance of phenotype matching and prioritization tools, much needed to guide tuning for cross species inference. Many algorithms are evaluated using simulations, which may fail to capture real-world scenarios. This gap presents a number of problems: it is difficult to optimize algorithms if we do not know which choices lead to better results; performance may be sensitive to factors that are subject to change, such as ontology structure or annotation completeness. We will develop a modular Phenotypic Inference Evaluation Framework, PhEval and use it to optimize our own algorithms, as well as deliver it as a community resource.
Python
7
star
38

phenio

An integrated ontology for Phenomics
Makefile
6
star
39

oncoexporter

Cancer data to GA4GH phenopacket
Jupyter Notebook
6
star
40

monarch-ui-new

DO NOT USE
6
star
41

SLDBGen

Python
6
star
42

mapping-walker

walks mapping graphs running boomer
Python
6
star
43

kghub-downloader

Configuration based file caching downloader
Python
6
star
44

disease-miner

mining and merging disease resources
Web Ontology Language
6
star
45

monarchr

R package for easy access, manipulation, and analysis of Monarch KG data
R
6
star
46

omim

Data ingest pipeline for OMIM.
Python
6
star
47

agent-smith-ai

Python
6
star
48

PhenoteFX

PhenoteFX
Java
5
star
49

monarch-mapping-commons

Building a fully exectuable workflow for boomer
Python
4
star
50

monarch-cypher-queries

Java
4
star
51

mondo-ingest

Coordinating the mondo-ingest with external sources
Jupyter Notebook
4
star
52

gpsea

A Python library for discovery of genotype-phenotype associations
Jupyter Notebook
4
star
53

monarch-project-template

A Cookiecutter to kickstart Python based projects.
Python
4
star
54

monarch-phenote

stub for monarch phenote
Java
4
star
55

oai-monarch-wrapper

Alpha test for a Monarch-backed plugin for ChatGPT.
Python
4
star
56

pyrophen

Pyrophen generates a FHIR Code system representing the Human Phenotype Ontology
Java
4
star
57

monarch-ontology

Top level monarch importer ontology
Makefile
4
star
58

mondolib

Python library for mondo QC
Python
3
star
59

malco

Multilingual Analysis of LLMs for Clinical Observations
Jupyter Notebook
3
star
60

negativeExampleSelection

Demonstration of bias in graph machine learning owing to negative example selection
Jupyter Notebook
3
star
61

hpo-plain-index

Generates a solr index of plain language terms in the HPO and their grouping classes
Python
2
star
62

monarch-owlsim-data

Makefile
2
star
63

monarch-neo4j

Dockerized Managed Neo4j Database for Monarch
Shell
2
star
64

ont-review

Jupyter Notebook
2
star
65

ols_monarch

A dockerised version of the OLS for Monarch Ontologies
Makefile
2
star
66

pheval.exomiser

This is the Exomiser plugin for PhEval. Highly experimental.
Python
2
star
67

monarch-mme

Matchmaker Exchange for Monarch
Scala
2
star
68

phenopacket2prompt

GA4GH Phenopacket to LLM prompt
Java
2
star
69

monarch-semantic-similarity-profiles

Jupyter Notebook
2
star
70

talisman-paper

Jupyter Notebook
2
star
71

SciGraph-docker-monarch-ontology

2
star
72

phenotype2phenopacket

Python
2
star
73

closurizer

Add closure expansion fields to kgx files following the Golr pattern
Python
2
star
74

glyco-phenotype-ontology

An ontology module describing molecular glyco-phenotypes
Makefile
2
star
75

cat-merge

Tooling for merging individual source KGX files in the Monarch ingest pipeline
Python
2
star
76

monarch-phenologs

Python
2
star
77

maxo-annotations

Annotations to terms of the Medical Action Ontology (MAxO)
2
star
78

ontogpt-experiments

Experiments and analysis related to OntoGPT methods.
Jupyter Notebook
2
star
79

release-utils

Utilities for generating tsv downloads, release diffs, and other reports from scigraph and solr
Python
2
star
80

sleep-apnea-clustering

R library and scripts to process sleep data for clustering
R
1
star
81

biolink-model-pydantic

Pydantic dataclasses for the Biolink model
Python
1
star
82

QC

Makefile
1
star
83

monarch-ontology-dashboard

The Monarch Initiative Ontology dashboard focuses on increasing quality of all Monarch related ontologies
HTML
1
star
84

ga4gh-server

Implementation of the ga4gh/schemas
Scala
1
star
85

phenoCompare

Phenotype Compare
HTML
1
star
86

hpo-survey-analysis

Toolkit for analyzing surveys utilizing the HPO
Jupyter Notebook
1
star
87

exomiser-phenotype-data-revised

Repo for generating monarch data dependencies like the one used by exomiser
Makefile
1
star
88

monarch-file-server

Terraform, scripts, and documentation for the monarch public file server
Python
1
star
89

monarch-plater-docker

Dockerfile
1
star
90

analysis-sandbox

Scripts to analyze monarch data sets and scigraph
Jupyter Notebook
1
star
91

clingen-ingest

Python
1
star
92

phenotypr-body

JavaScript
1
star
93

monarch-py

Monarch Python API
Python
1
star
94

omia-ontology

Source files for axiomatizing the ontology of OMIA
Makefile
1
star
95

mckb

Monarch Cancer Knowledge Base
Web Ontology Language
1
star
96

SciGraph-docker-monarch-data

Build two Docker images with the monarch configs. Uses master HEAD from the SciGraph github repo.
1
star
97

setsim

A proof of concept of the summing similarity measure.
Python
1
star
98

automaxo

Jupyter Notebook
1
star
99

web-hippo

The Monarch HIPPO: Deriving insight from the medical literature by fuzzy semantic searches over diseases and phenotypes.
JavaScript
1
star
100

DipperCache

Prefetch tens of gigs of files & provide more robust update info downstream
Makefile
1
star