Review Comment:
----------
Overview
----------
This paper proposes using LLMs with fine-tuning and similarity search for extracting five biochemical properties (compound name, bioactivity, species, collection site, isolation type) from scientific literature. The method consists of: (1) extracting text from PDFs using Nougat, (2) prompting LLMs (LLaMA 3.1, Qwen 2.5, Phi 4, GPT-4o) to extract properties, and (3) mapping LLM outputs to known answers via embedding-based similarity search. The authors evaluate configurations with and without fine-tuning across the NatUKE benchmark's four evaluation stages.
----------
Strengths
----------
- Comprehensive model comparison: The paper systematically evaluates multiple LLMs (three open-source, one proprietary) across all five extraction tasks with multiple configurations (zero-shot, fine-tuned, with/without similarity search), providing useful empirical observations about open-source vs. proprietary model performance
- Practical resources released: The fine-tuned single-task models (Bike-name, Bike-bioactivity, Bike-specie, Bike-site, Bike-isolation) released on HuggingFace offer immediately usable resources for practitioners working in this domain
- Structured experimental framework: The six research questions provide clear organization, and the authors transparently report results across all configurations rather than cherry-picking favorable outcomes
- Honest acknowledgment of limitations: The authors candidly note that similarity search "limits the extraction to a known indexed world" (p.16), which is an important limitation for real-world deployment
-------------
Weaknesses
-------------
1. Fundamental Methodological Concern: Retrieval vs. Extraction
- The similarity search mechanism raises a critical question about what this system actually accomplishes:
-- The LLM generates free-text output that is often incorrect or imprecisely formatted
-- A separate embedding model maps this output to the nearest valid answer from a pre-indexed vocabulary
-- The retrieved answer becomes the final "prediction"
- This design has significant implications:
-- The system cannot discover novel entities: New compound names, species, or locations not in the index cannot be extracted, which limits applicability to real-world scenarios where the goal is often to identify previously unknown information
-- Performance metrics may be inflated: Since outputs are constrained to valid answers from the same vocabulary used in evaluation, the system cannot produce out-of-vocabulary errors
-- The LLM's extraction capability becomes secondary: Table 2 shows pre-trained LLMs score approximately 0.00-0.10 without similarity search but 0.70-1.00 with it—the retrieval step, not the LLM, drives performance
- The authors acknowledge this limitation but do not adequately justify why this trade-off is acceptable for a knowledge extraction system, nor do they discuss paths toward addressing it.
2. Disconnect from Knowledge Graph Literature
- The NatUKE benchmark was designed around NUBBEkg, a knowledge graph with rich relational structure connecting papers to compounds, species, bioactivities, and locations.
- Prior work on this benchmark leveraged graph topology:
-- EPHEN: Propagates embeddings through graph structure via regularization
-- Zope et al.: BFS walks over the KG capture structural relationships
-- Metapath2Vec: Meta-path-based walks exploit heterogeneous node types
- This paper abandons graph structure entirely—the knowledge graph is reduced to a flat lookup table of valid property values.
- No justification is provided for why ignoring relational structure is appropriate, nor is there analysis of what structural information is lost.
- The comparison to graph embedding methods (Table 2) is therefore not straightforward, as the approaches solve fundamentally different problems: graph-based link prediction vs. text-to-vocabulary retrieval.
3. Incoherent Use of Embeddings
- The paper uses separate embedding models for similarity search (sentence-transformers for open-source, text-embedding-ada-002 for GPT-4o) rather than the LLMs' own representations.
- The recent EPHEN++ work (do Carmo et al., SAC'25—reference [11] in the paper) found that BERT embeddings outperform LLaMA and Gemma embeddings for this task, but this paper does not build on or engage substantively with that finding.
- The rationale for the specific embedding model choices is not provided, and different similarity implementations for open-source vs. proprietary models (nearest-neighbor vs. FAISS with HNSW) introduce unnecessary variation.
4. Experimental Design Concerns
- Training data limitations: Fine-tuning uses only 20% of data (first evaluation stage) due to computational constraints, which limits the validity of conclusions about fine-tuning effectiveness.
- Inconsistent input lengths: Text truncation varies (3000/2000/1500 words) across folds and properties, introducing confounding variables.
- Different prompt templates: Figure 2 shows different prompts for open-source models vs. GPT-4o (different role definitions, different output format instructions), making direct comparison problematic.
- Unexplained anomalies:
-- GPT-4o-SS achieves perfect 1.0 hits@50 on species extraction—no analysis of whether this reflects task triviality, pre-training data overlap, or genuine capability
-- Fine-tuning often degrades performance (e.g., Phi-SS: 0.89 → Phi-FT-SS: 0.70 on compound name, Table 2), contradicting the paper's framing of fine-tuning as beneficial
5. Related Work and Positioning
- Sections 2.1 (Automation extraction) and 2.5 (Large Language Models) cover related ground and would benefit from integration, contrasting traditional pipelined approaches with contemporary LLM-based methods.
- Missing discussion of:
-- Modern LLM-based information extraction approaches (structured extraction with schema guidance, function calling, JSON mode)
-- Retrieval-augmented generation (RAG) as an alternative architecture
-- Broader biomedical IE benchmarks (ChemProt, BC5CDR, SciREX) for context
- The paper does not adequately differentiate its contribution from the BiKE challenge papers, particularly Fröhlich et al., which also used LLMs (ChatGPT) in the pipeline
6. Writing and Presentation
- Inconsistent capitalization throughout: "Similarity Search," "Knowledge Extraction," "biochemical" in section headers should follow standard conventions.
- Acronyms introduced inconsistently: FAISS appears on page 9, line 11, then expanded to "Facebook AI Similarity Search (FAISS)" on lines 38-39.
- The QLoRA mathematical formulations (Equations 1-4, pages 7-8) are unnecessary for a paper that applies these techniques rather than contributing to them; a citation and brief description would improve readability.
- Figure 2 prompt templates define the LLM role as "chemist" or "scientist trained in chemistry," but the domain is biochemistry and natural products—this mismatch may confuse readers.
----------------------------
Suggestions for Improvement
----------------------------
1. Clarify the contribution's scope: If the goal is closed-vocabulary retrieval (matching text to known entities), frame it as such and compare to retrieval baselines. If the goal is open extraction, address how the system would handle novel entities not in the index.
2. Justify abandoning graph structure: Explain why ignoring the knowledge graph's relational information is appropriate, or better, incorporate graph-based signals. A hybrid approach combining LLM extraction with graph-based validation could leverage both paradigms.
3. Analyze the similarity search dependency: Provide ablation studies examining:
- What semantic properties make LLM outputs retrievable even when incorrect?
- How does retrieval performance degrade as vocabulary size increases?
- What error modes occur (e.g., retrieving semantically similar but factually wrong entities)?
4. Standardize experimental conditions: Use consistent prompt templates, input lengths, and similarity search implementations across all models to enable valid comparisons.
5. Investigate anomalous results: The perfect species extraction and cases where fine-tuning hurts performance deserve deeper analysis—these may reveal important insights about the task or method.
6. Strengthen related work integration: Merge Sections 2.1 and 2.5 to provide a coherent narrative from traditional IE to contemporary LLM approaches, and position the contribution more precisely within this landscape.
7. Improve writing consistency: Standardize capitalization, introduce acronyms at first use, and streamline mathematical content to focus on novel contributions.
--------
Summary
--------
This paper presents a thorough empirical study comparing LLM configurations for biochemical property extraction. However, the core methodological design—using similarity search to map LLM outputs to a closed vocabulary—fundamentally limits the system's capability to perform genuine knowledge extraction. The approach cannot discover novel entities, which is often the primary goal of extracting information from scientific literature. Additionally, the abandonment of knowledge graph structure represents a departure from prior NatUKE work without adequate justification. The experimental design has multiple confounds that complicate interpretation of results.
The authors have conducted substantial experimental work and provided useful practical resources. With significant revision to address the methodological concerns—particularly clarifying the system's actual capabilities and limitations, or redesigning to enable open extraction—this work could make a meaningful contribution. In its current form, however, the paper does not sufficiently advance the state of knowledge extraction to warrant publication as a full research article.
|