Why Not Use Only Raw Papers?#


The following examples walk through the same research question โ€” once using only raw papers, and once using derived data โ€” to illustrate why structured artifacts make a meaningful difference.

Overview#

  • Solving the Same Research Question with Only Raw Papers
  • Breaking Down the Same Research Question with Derived Data

Solving the Same Research Question with Only Raw Papers#

Imagine an agent tasked with answering: "What are the main approaches to making LLMs robust to jailbreak attacks, and which ones could realistically be implemented?"

Without structured artifacts, the agent must begin by guessing how to search the literature. It typically performs a semantic search over titles, abstracts, or full papers using terms like jailbreak, prompt injection, or instruction robustness. The set of papers returned by this search becomes the foundation for everything that follows.

The agent then needs to read those papers and infer the key elements of each one โ€” what problem the paper addresses, what method it proposes, how the method is evaluated, and what results or limitations are reported. To do this, it parses abstracts, introductions, method sections, evaluation sections, and results tables, trying to reconstruct the same dimensions that a researcher would naturally look for when assessing a paper.

However, because these dimensions are not explicitly defined, the agent must decide on the fly what information matters and how to extract it. Each step is therefore dependent on the previous one: the initial search determines which papers are analyzed; the interpretation of those papers determines how the problem space is framed; and that framing shapes how the agent compares results or identifies approaches.

This process also creates a memory and context problem. All of the paper content, intermediate extractions, reasoning steps, and comparisons must coexist within the same working context. As the agent reads more papers and generates more intermediate reasoning, the context becomes increasingly crowded with partial interpretations, earlier assumptions, and chains of thought โ€” earlier conclusions may influence later interpretation, important details may be dropped as context windows fill up, and the agent may begin summarizing information in ways that subtly change meaning.

Inefficiencies and risks from this approach#

Search fragility
The initial query determines which papers are considered, and missing key terminology can exclude relevant work entirely.

Uncontrolled interpretation
The agent decides how to extract problems, methods, results, and assumptions from text, which can vary between runs.

Dependency chains
Every step relies on the accuracy of the previous one, so early misinterpretations propagate through the analysis.

Context pollution
The growing mixture of raw paper text, intermediate summaries, and reasoning steps increases the chance of confusion or drift.

Memory limits
As more papers are processed, the agent must repeatedly compress or discard information to stay within context limits.

Inconsistent comparisons
Methods and results may be interpreted differently across papers because the agent lacks a fixed schema.

High computational cost
Large volumes of text must be repeatedly processed and reinterpreted to reconstruct the same structured understanding.

In practice, the agent is attempting to recreate the structure of the research landscape dynamically from raw text, relying on its own reasoning at each step. This makes the process fragile, expensive, and highly sensitive to early assumptions โ€” especially when the goal is to build a clear, reliable map of a research field.


Breaking Down the Same Research Question with Derived Data#

Imagine a research team asking the same question: "What are the main approaches to making LLMs robust to jailbreak attacks, and which ones could realistically be implemented?"

Instead of starting by reading papers, they begin by mapping the problem space through semantic search over structured records. They run a semantic search across Problem artifacts using terms like jailbreak, prompt injection, or instruction robustness. Because each Problem is already a compact record, the search operates over small structured entries rather than entire papers.

This returns a set of Problem records, each linked to paper_ids. These IDs define the initial working dataset.

Next, they examine how the problem is being solved. Using the paper_ids returned from the previous step, they fetch the Method artifacts associated with those papers. At this stage they can also filter by simple attributes โ€” such as method_type, core_idea_tags, or compute_profile โ€” or run additional semantic searches within just this subset of Methods. Because the dataset is already narrowed to relevant papers, the search space is much smaller and more precise.

To understand how these approaches were evaluated, they retrieve Evaluation Pattern artifacts linked to those same papers. They can filter or search across attributes like benchmark, dataset, metric, or evaluation protocol, making it easy to isolate methods that were tested under comparable conditions.

Next, they look at performance by fetching Result artifacts tied to those evaluations. These records contain structured fields such as dataset, metric, and baseline comparison, allowing straightforward filtering and comparison across approaches.

Finally, they evaluate feasibility and limitations by retrieving Resource Requirement artifacts (compute, memory, dependencies) and Failure artifacts describing known weaknesses.

Throughout this process, each step operates on progressively smaller subsets of structured records:

  • Semantic search across Problem records
  • Filter by returned paper_ids
  • Fetch and search across Methods within that subset
  • Filter or search across Evaluation Patterns
  • Compare Results
  • Retrieve Resource Requirements and Failures

Because the searches are performed on small structured artifacts rather than full paper text, each query is lightweight and precise. Researchers can combine semantic search, attribute filtering, and subset queries to iteratively narrow the landscape without repeatedly parsing large documents.

This approach solves several inefficiencies present in raw-paper analysis. The system never needs to repeatedly process full PDFs to rediscover the same structure; each dimension โ€” problems, methods, evaluations, results, and constraints โ€” can be retrieved directly. Searches become deterministic operations over known fields, and reasoning can happen on a much smaller, cleaner set of data.

In practice, this means the team can map the research landscape with a sequence of fast retrieval steps, and only open a small number of canonical papers once the relevant approaches and evidence have already been identified.