Review Comment:
---------
Overview
---------
The paper, overall, is well written. In terms of content there is hardly any redundancy or unnecessary repetition in the ideas presented. Indeed, the way the prompting strategies are discussed in Section 3 with corresponding examples sheds very clear light on the methodology; this section almost seems like a manual for other researchers interested in trying out various prompting strategies. Having said this, the paper is full of short paragraphs, which makes it seem choppy in terms of readability; it can benefit from another round of edits to incorporate longer paragraphs instead. Furthermore, as a research paper for the SWJ I think in its current form it is very lean on empirical insights, since it seems more like an application of already existing methods on one dataset perhaps leaning more in the direction of a solid workshop paper on the theme of LLMs.
---------
Strengths
---------
Various flavors of prompt engineering strategies are introduced, very well described, appropriately applied, and soundly tested.
----------
Weaknesses
----------
> I think the paper is relatively lean in terms of contextualizing the work w.r.t. existing work in the field. I remain unsure what the end objective or purpose is for this work? Is it just to test various prompt engineering strategies? … and for which application exactly?
> I think as far as knowledge extraction, the open-domain Wikipedia is certainly one of the more tried and tested successful approaches out there, but there are others too. Such as in the biomedical domain. A notable mention is the BioCreative shared task series (https://biocreative.bioinformatics.udel.edu/tasks/) which each year releases invaluable databases of biomedical relational knowledge in the field or even the UMLS (https://www.nlm.nih.gov/research/umls/index.html). As a SWJ paper, I would have liked for a paper along the lines of this submission to set a broader vision in terms of application and insights regarding the prompt engineering for knowledge extraction theme. The present version of the work seems too focused and a bit lean on the empirical generalizability or applicability of the methodologies.
> The empirical evaluations are also lean. Only the GPT-4 model is tested. The downside with using proprietary and closed-source models is that no insights can be obtained in terms of open-source LLM development for future work. More comprehensive tests with at least two or three other—preferably non-proprietary—models, e.g. Mistral AI [1] or LLAMA [1], is warranted. The paper sets the tone as an empirical evaluation work. In this regard, while various flavors of prompting strategies are tested, as a well-rounded empirical evaluation, tests with at least 2 or 3 more LLMs is warranted to offer the reader well-rounded insights.
> I like that the paper references the prompt engineering manual (https://www.promptingguide.ai/techniques), but in terms of insights—-tieing in to the previous point—-as a journal paper for SWJ, I would have liked to see more insights and recommendations with more LLMs. Is a prompting method consistently strong across various tested LLMs? Which prompting method is recommended for the task given the respective LLMs?
In general, even if multiple LLMs were not tested another direction for contribution could be perhaps introducing a new prompt engineering method would also shows how this work goes beyond existing prompt engineering work as a research paper. Otherwise, tests with other relation extraction datasets would also be considered as offering strong empirical insights. For instance, as referenced before, the biomedical domain (e.g., Biocreative, BioNLP) has various strong datasets to support RE.
***********
References
***********
1. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand et al. "Mistral 7B." arXiv preprint arXiv:2310.06825 (2023).
2. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023).
--------------------------------------
Comments and questions for the authors
--------------------------------------
> Page 4, Line 44: Aren’t “task description” and “direct instruction” the same thing? You seem to say the prompt in lines 47 to 51 follow a zero-shot setting and thereby (see line 44) “there is no task description … incorporated into the prompt.” I think there is a problem in naming here. The x-shot setting (or in-context learning) only concerns itself with task examples; I do not think it has anything to do with task description or what do you think? In my view, “Task descriptions” and “task instructions” seem to be the same thing. Without a “task description,” how is the LLM to know what is expected of it? Maybe consider using some alternative naming to describe what you mean precisely.
> Page 5, I just had trouble interpreting the phrase “a one-shot RAG prompt” as it was intended by the authors. My first read of it felt like it was the prompt used by the retriever that did the RAG operation instead of the prompt for the LLM. Maybe the authors could consider rewording.
> This is relatively minor and offered as a point of reflection to the authors. Does the content in subsection 3.2 contribute to the theme stated by the overall section title “Prompt Engineering Methods”? To me, somehow it does not merit a subsection per say. Instead, it seems like the text that is subsection 3.2 can be incorporated in subsection 3.3 as a motivation instead.
> Page 9, section 3.6, lines 23 to 33: for the in-context example, i.e. the Audi-concerned text, was the generated knowledge instance, i.e. “Knowledge: [“Automobile is a/an concept.”, …]”,—was this automatically generated as well in a separate step or is this human-written? What does the whole workflow look like?
> Page 9, section 3.6, lines 23 to 33: I actually do not understand how the generated knowledge is helpful to the desired end goal i.e. the generation of triples. The aspect mentioned as knowledge, i.e. “Knowledge: [“Automobile is a/an concept.”, …],” is not quite part of the final set of Triples elicited, is it. Maybe an explanation of what the authors deem as knowledge and thereby have encoded it would add clarity.
> The “Self-Consistency Prompts” method and the “Reason and Act Prompts” method seem quite similar. I think more explanation or justification or clarification is needed when one would use one over the other.
> Page 13, line 15: it says “seventeen” distinct prompting strategies. Section 3 introduced six (considering 3.2 as part of 3.3). I see from Table 1 what is meant by seventeen. Maybe then consider rephrasing “six different prompting strategies under three different prompting settings…”
> Page 14, line 37,38: again minor and for readability. When explaining the results, I would suggest including text that directly points the reader to which cell is being read. For instance, something along the lines: “compare the numbers in the last columns for the row “zero-shot” versus “one-shot”, …”
|