Leveraging Biochemical Knowledge Extraction from Academic Literature through Large Language Models: A study of fine-tuning and similarity search

Tracking #: 3941-5155

Authors: 
Paulo Carmo
Marcos Gôlo
Jonas Gwozdz
Edgard Marx2
Stefan Schmidt-Dichte
Caterina Thimm
Matthias Jooss
Pit Fröhlich
Ricardo Marcacini

Responsible editor: 
Mehwish Alam

Submission type: 
Full Paper
Abstract: 
The discovery of new drugs based on natural products is related to the efficient extraction of biochemical knowledge from scientific literature. Recent studies have introduced several enhancements to the NatUKE benchmark, improving the performance of knowledge graph embedding methods. These enhancements include refined PDF text extraction, named entity recognition, and improved embedding techniques. Notably, some approaches have incorporated large language models (LLMs). Building on these advances, this study explores the fine-tuning of LLMs with similarity search, exploring both open-source and proprietary models for the automatic extraction of biochemical properties. We fine-tune the LLMs with similarity search to mitigate textual inconsistencies and enhance the prediction of five target properties: compound name, bioactivity, species, collection site, and isolation type. Experimental results demonstrate that similarity search consistently improves the performance, and open-source models can be competitive, occasionally outperforming proprietary models. We also find that the effectiveness of fine-tuning varies across models and biochemical properties. Overall, our findings highlight the potential of LLMs, particularly when fine-tuned and augmented with similarity search, as powerful tools for accelerating the extraction of biochemical knowledge from scientific texts.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Oct/2025
Suggestion:
Major Revision
Review Comment:

The paper presents an approach for NER that explores LLMs (both open-source and proprietary), enhanced with fine-tuning and similarity search techniques. The proposed method is evaluated using the NaTUKE benchmark, and the experimental results indicate improvements across five distinct tasks: identifying compound names, bioactivity, species, collection site, and isolation type.
The topic is timely and relevant, especially given the increasing role of LLMs in biomedical information extraction.
While my own expertise lies primarily in biomedical Knowledge Graphs (KGs) and KG embedding methods, rather than in LLMs, I find the paper’s focus on integrating LLM-based NER with domain-specific datasets promising. Nonetheless, I have several concerns regarding potential data leakage, the fairness of the comparisons, and the lack of standard metrics.
I leave some more detailed comments below:

- The motivation/introduction of the paper is not entirely clear to me. While KGs and KG embeddings appear to play a role in the introduction, they seem to be used mainly as baselines in the experiments. I would focus the introduction on LLMs and their limitations for NER. Moreover, the comparison between LLM-based methods (for NER) and KG-based methods (for link prediction) seems conceptually unfair, since the latter task differs in nature and does not benefit from access to the same textual knowledge or contextual information.

- There is a significant concern that some of the LLMs, especially proprietary ones, might have been trained on the NaTUKE dataset or related data. This raises the risk of data leakage, which could compromise the validity of the reported performance gains. I would recommend the authors to explicitly discuss this issue.

- The paper introduces Hits@k metrics with variable k values for each task. While this might capture certain nuances, it makes comparison with prior work more difficult. It would be preferable to report standard ranking-based metrics such as Hits@1, Hits@10, and Hits@100, which are widely used in link prediction tasks.

- The paper seems to evaluate one model per entity type (compound, bioactivity, species, etc.). It would be interesting to know whether the authors experimented with a single multi-task model capable of recognizing all entity types simultaneously. This could offer insights into the model’s generalization ability.

- Figure 3 is not clear to me. It is difficult to understand how it differs from the results presented in Table 2, and what exactly the k parameter represents in these plots. Providing an explanation in the caption would help.

Minor comments:
- In figure 1, consider simplifying the figure by removing the repeated “LLM output” box to make the diagram less cluttered
- In citations within the text, consider using only the author’s last name (e.g., change “data semantically integrable, do Carmo et al…” to “data semantically integrable, Carmo et al…”).
- In line 38 of page 10, ensure that the dataset name “NaTUKE” is written consistently throughout the paper

Review #2
Anonymous submitted on 07/Dec/2025
Suggestion:
Reject
Review Comment:

----------
Overview
----------

This paper proposes using LLMs with fine-tuning and similarity search for extracting five biochemical properties (compound name, bioactivity, species, collection site, isolation type) from scientific literature. The method consists of: (1) extracting text from PDFs using Nougat, (2) prompting LLMs (LLaMA 3.1, Qwen 2.5, Phi 4, GPT-4o) to extract properties, and (3) mapping LLM outputs to known answers via embedding-based similarity search. The authors evaluate configurations with and without fine-tuning across the NatUKE benchmark's four evaluation stages.

----------
Strengths
----------

- Comprehensive model comparison: The paper systematically evaluates multiple LLMs (three open-source, one proprietary) across all five extraction tasks with multiple configurations (zero-shot, fine-tuned, with/without similarity search), providing useful empirical observations about open-source vs. proprietary model performance

- Practical resources released: The fine-tuned single-task models (Bike-name, Bike-bioactivity, Bike-specie, Bike-site, Bike-isolation) released on HuggingFace offer immediately usable resources for practitioners working in this domain

- Structured experimental framework: The six research questions provide clear organization, and the authors transparently report results across all configurations rather than cherry-picking favorable outcomes

- Honest acknowledgment of limitations: The authors candidly note that similarity search "limits the extraction to a known indexed world" (p.16), which is an important limitation for real-world deployment

-------------
Weaknesses
-------------

1. Fundamental Methodological Concern: Retrieval vs. Extraction

- The similarity search mechanism raises a critical question about what this system actually accomplishes:

-- The LLM generates free-text output that is often incorrect or imprecisely formatted
-- A separate embedding model maps this output to the nearest valid answer from a pre-indexed vocabulary
-- The retrieved answer becomes the final "prediction"

- This design has significant implications:

-- The system cannot discover novel entities: New compound names, species, or locations not in the index cannot be extracted, which limits applicability to real-world scenarios where the goal is often to identify previously unknown information
-- Performance metrics may be inflated: Since outputs are constrained to valid answers from the same vocabulary used in evaluation, the system cannot produce out-of-vocabulary errors
-- The LLM's extraction capability becomes secondary: Table 2 shows pre-trained LLMs score approximately 0.00-0.10 without similarity search but 0.70-1.00 with it—the retrieval step, not the LLM, drives performance

- The authors acknowledge this limitation but do not adequately justify why this trade-off is acceptable for a knowledge extraction system, nor do they discuss paths toward addressing it.

2. Disconnect from Knowledge Graph Literature

- The NatUKE benchmark was designed around NUBBEkg, a knowledge graph with rich relational structure connecting papers to compounds, species, bioactivities, and locations.

- Prior work on this benchmark leveraged graph topology:

-- EPHEN: Propagates embeddings through graph structure via regularization
-- Zope et al.: BFS walks over the KG capture structural relationships
-- Metapath2Vec: Meta-path-based walks exploit heterogeneous node types

- This paper abandons graph structure entirely—the knowledge graph is reduced to a flat lookup table of valid property values.

- No justification is provided for why ignoring relational structure is appropriate, nor is there analysis of what structural information is lost.

- The comparison to graph embedding methods (Table 2) is therefore not straightforward, as the approaches solve fundamentally different problems: graph-based link prediction vs. text-to-vocabulary retrieval.

3. Incoherent Use of Embeddings

- The paper uses separate embedding models for similarity search (sentence-transformers for open-source, text-embedding-ada-002 for GPT-4o) rather than the LLMs' own representations.

- The recent EPHEN++ work (do Carmo et al., SAC'25—reference [11] in the paper) found that BERT embeddings outperform LLaMA and Gemma embeddings for this task, but this paper does not build on or engage substantively with that finding.

- The rationale for the specific embedding model choices is not provided, and different similarity implementations for open-source vs. proprietary models (nearest-neighbor vs. FAISS with HNSW) introduce unnecessary variation.

4. Experimental Design Concerns

- Training data limitations: Fine-tuning uses only 20% of data (first evaluation stage) due to computational constraints, which limits the validity of conclusions about fine-tuning effectiveness.

- Inconsistent input lengths: Text truncation varies (3000/2000/1500 words) across folds and properties, introducing confounding variables.

- Different prompt templates: Figure 2 shows different prompts for open-source models vs. GPT-4o (different role definitions, different output format instructions), making direct comparison problematic.

- Unexplained anomalies:

-- GPT-4o-SS achieves perfect 1.0 hits@50 on species extraction—no analysis of whether this reflects task triviality, pre-training data overlap, or genuine capability
-- Fine-tuning often degrades performance (e.g., Phi-SS: 0.89 → Phi-FT-SS: 0.70 on compound name, Table 2), contradicting the paper's framing of fine-tuning as beneficial

5. Related Work and Positioning

- Sections 2.1 (Automation extraction) and 2.5 (Large Language Models) cover related ground and would benefit from integration, contrasting traditional pipelined approaches with contemporary LLM-based methods.

- Missing discussion of:

-- Modern LLM-based information extraction approaches (structured extraction with schema guidance, function calling, JSON mode)
-- Retrieval-augmented generation (RAG) as an alternative architecture
-- Broader biomedical IE benchmarks (ChemProt, BC5CDR, SciREX) for context

- The paper does not adequately differentiate its contribution from the BiKE challenge papers, particularly Fröhlich et al., which also used LLMs (ChatGPT) in the pipeline

6. Writing and Presentation

- Inconsistent capitalization throughout: "Similarity Search," "Knowledge Extraction," "biochemical" in section headers should follow standard conventions.

- Acronyms introduced inconsistently: FAISS appears on page 9, line 11, then expanded to "Facebook AI Similarity Search (FAISS)" on lines 38-39.

- The QLoRA mathematical formulations (Equations 1-4, pages 7-8) are unnecessary for a paper that applies these techniques rather than contributing to them; a citation and brief description would improve readability.

- Figure 2 prompt templates define the LLM role as "chemist" or "scientist trained in chemistry," but the domain is biochemistry and natural products—this mismatch may confuse readers.

----------------------------
Suggestions for Improvement
----------------------------

1. Clarify the contribution's scope: If the goal is closed-vocabulary retrieval (matching text to known entities), frame it as such and compare to retrieval baselines. If the goal is open extraction, address how the system would handle novel entities not in the index.

2. Justify abandoning graph structure: Explain why ignoring the knowledge graph's relational information is appropriate, or better, incorporate graph-based signals. A hybrid approach combining LLM extraction with graph-based validation could leverage both paradigms.

3. Analyze the similarity search dependency: Provide ablation studies examining:

- What semantic properties make LLM outputs retrievable even when incorrect?
- How does retrieval performance degrade as vocabulary size increases?
- What error modes occur (e.g., retrieving semantically similar but factually wrong entities)?

4. Standardize experimental conditions: Use consistent prompt templates, input lengths, and similarity search implementations across all models to enable valid comparisons.

5. Investigate anomalous results: The perfect species extraction and cases where fine-tuning hurts performance deserve deeper analysis—these may reveal important insights about the task or method.

6. Strengthen related work integration: Merge Sections 2.1 and 2.5 to provide a coherent narrative from traditional IE to contemporary LLM approaches, and position the contribution more precisely within this landscape.

7. Improve writing consistency: Standardize capitalization, introduce acronyms at first use, and streamline mathematical content to focus on novel contributions.

--------
Summary
--------

This paper presents a thorough empirical study comparing LLM configurations for biochemical property extraction. However, the core methodological design—using similarity search to map LLM outputs to a closed vocabulary—fundamentally limits the system's capability to perform genuine knowledge extraction. The approach cannot discover novel entities, which is often the primary goal of extracting information from scientific literature. Additionally, the abandonment of knowledge graph structure represents a departure from prior NatUKE work without adequate justification. The experimental design has multiple confounds that complicate interpretation of results.

The authors have conducted substantial experimental work and provided useful practical resources. With significant revision to address the methodological concerns—particularly clarifying the system's actual capabilities and limitations, or redesigning to enable open extraction—this work could make a meaningful contribution. In its current form, however, the paper does not sufficiently advance the state of knowledge extraction to warrant publication as a full research article.

Review #3
Anonymous submitted on 23/Feb/2026
Suggestion:
Reject
Review Comment:

Summary: This study investigates fine-tuned large language models augmented with similarity search for extracting biochemical properties from scientific literature. Results show that similarity search improves performance, with open-source models sometimes outperforming proprietary ones in biochemical knowledge extraction.

--> The introduction is too broad as well as the research questions which were put forth are also very broad.
--> Page 2: line 8-11: There is no need to introduce what is RDF.
--> The introduction introduces a dataset in very detail through which it explains the problem statement but it would be good if the problem is well motivated with an example.
--> Page 2 line 14: cold use a reference.
--> Page line 28 and line 29 claim that biomedical properties are under explored with LLMs but there is a lot of discussion on going around this. There are even LLMs which are fine-tuned with biomedical data.
--> The research questions are too many and still generic. For example, the first RQ1 makes it looks like that this article is also surveying existing articles.

--> In the related work, the authors write automation extraction but the section talks about information extraction. Is that a typo?
--> The authors do not need to introduce the SoTA on Graph and Knowledge Graph Embeddings since it is not the main contribution of this paper.
--> The authors could refer to this survey to know already existing SoTA: https://github.com/quqxui/Awesome-LLM4IE-Papers it also includes papers on biomedical domain.

--> In the main methodology, the authors perform post processing to obtain optimal results. However, the model itself doesn't bring much improvements.
--> Did the authors explored LLMs which are already fine-tuned on biomedical domain? How would the results change.
--> The authors are talking about the library used in the methodology section which could go to the experimentation section.
--> There is no need to introduce LoRA in the main methodology. If you need to really give the background, then it should go to a preliminary section.
--> The authors add a lot of mathematical equations about QLoRA, why is it needed. Authors can simply refer to the original paper since this is not the main contribution of this paper. Section 3.1 (page 8: line 38-41) is the contribution of this paper.
--> Section 3.2 says "we noticed that most of the incorrect extractions output similar answers, albeit with synonyms or slightly different formats". Are there any precise examples? was this check systematic like error analysis.
--> Why cant we also try with FAISS? What is the main motivation behind introducing your own way of doing similarity search and what are the advantages?
--> Section 3.2 is very ill-written and hard to follow since it is repetitive in making their point and well as sometimes changing focus in each sentence.
--> Since the main contribution is unclearly defined, it is hard to see what is this journal bringing to the table.
--> Experiments are conducted only on one dataset.
--> The definition/formula of Hits@K is not needed.
--> In table 2: why the value of K is changing for each type?

Overall, the study is very experimental and lacks clear scientific contribution. It still needs a lot of work before being able to be accepted as a journal.