Publication Harvester Agent: Guide To Automated Crawling
\nThis document serves as an instruction manual for operating the ChatGPT Atlas Agent when integrated with the Publication Harvester backend. The primary objective is to automate large-scale web crawling, extract PDFs, convert them to Markdown using Gemini, and store them using pgvector.
1. Agent Responsibilities: The Core of Automated Publication Harvesting
At its core, the agent is designed to streamline and automate the intricate process of gathering, processing, and storing academic publications. Let's delve into the specifics of its responsibilities.
Responding to User Requests
One of the primary functions of the agent is to handle diverse user requests related to publication harvesting. For example, a user might ask the agent to crawl specific repositories or websites, such as "Crawl DESY P05 + P07 and index all open-access PDFs from 2007–2025." The agent must interpret these requests accurately to initiate the appropriate crawling operations. This includes understanding the scope of the request, such as the specific sources to be crawled, the time frame for the publications, and any specific constraints, like only considering open-access materials.
Initiating Crawl Jobs
Once a user request is interpreted, the agent uses the /crawl backend action to start a job. This action triggers the backend system to begin crawling the specified sources according to the parameters defined in the user request. The initiation process involves setting up the necessary configurations, such as defining the URLs to be crawled, the depth of the crawl, and any filters to be applied during the crawling process. The agent ensures that the crawl is set up correctly to avoid unnecessary processing and to focus on the relevant publications.
Monitoring Job Progress
During the crawl, the agent continuously monitors the progress of the job via /job-status/{job_id}. This monitoring is crucial for providing real-time updates on the status of the crawl, including the number of PDFs discovered, the number of PDFs processed, and the number of PDFs remaining to be processed. The agent also tracks any errors encountered during the crawl, providing detailed information about the nature of the errors and their potential impact on the overall process. This continuous monitoring allows for timely intervention and adjustments to the crawl parameters if necessary.
Handling Browser Automation
In many cases, publication sources are not easily accessible through simple HTTP scraping. This is where the agent's browser automation capabilities come into play. The agent is equipped to handle various complex website interactions, such as:
- Logging into Portals: Many academic repositories require users to log in before accessing publications. The agent can automate the login process, using predefined credentials or prompting the user for credentials if necessary.
- Accepting Cookie Banners: Websites often display cookie banners that require user interaction before the site's content can be accessed. The agent can automatically detect and accept these banners, ensuring uninterrupted access to the publications.
- Navigating JavaScript-Heavy Sites: Modern websites often rely heavily on JavaScript to render content dynamically. The agent can execute JavaScript code, enabling it to navigate complex site structures and extract the required information.
Extracting Publication Links or PDFs
When simple HTTP scraping fails to extract publication links or PDFs, the agent employs advanced techniques to identify and extract these resources. This may involve analyzing the website's structure, identifying relevant HTML elements, and extracting the URLs or PDFs from these elements. The agent is designed to handle various website layouts and content structures, ensuring that it can extract publications from a wide range of sources.
Providing Stable Job Summaries
The agent is responsible for providing stable summaries of running jobs without spamming calls to the backend. This means that the agent must efficiently manage its interactions with the backend, avoiding excessive requests that could overload the system. The agent uses intelligent caching and throttling techniques to ensure that it provides timely and accurate summaries without impacting the performance of the backend.
2. Workflow Overview: A Step-by-Step Guide to Publication Harvesting
The agent's workflow is meticulously structured to ensure efficient and accurate publication harvesting. Here's a breakdown of the key steps involved:
Step A: Interpreting the User Request
The initial step involves a thorough interpretation of the user's request. This includes extracting essential information such as:
- List of Publication Source URLs: Identifying the specific websites or repositories from which publications should be harvested.
- Year Range (Optional): Determining the desired publication years, allowing users to focus on specific periods of research.
- Special Constraints: Recognizing any additional criteria, such as open-access only, specific domains, or publication types.
Step B: Calling the Backend to Start the Crawl
Once the user request is interpreted, the agent invokes the /crawl action in the backend. This action triggers a comprehensive process that includes:
- Crawling Specified Sources: Systematically navigating the provided URLs to discover all available publications.
- Identifying Available Years: Automatically detecting the range of publication years present in the sources.
- Identifying Open-Access PDFs: Filtering the discovered publications to focus on those that are freely accessible.
- Downloading PDFs: Retrieving the PDF files for all identified open-access publications.
- Sending to Gemini for Markdown Extraction: Leveraging the power of Google's Gemini model to convert the PDFs into structured Markdown format.
- Splitting Markdown into Chunks: Dividing the Markdown content into manageable sections and subsections.
- Embedding Each Chunk: Creating vector embeddings for each chunk to enable semantic search and analysis.
- Storing in Postgres + pgvector: Saving the Markdown content, embeddings, and metadata in a Postgres database with pgvector for efficient storage and retrieval.
Step C: Monitoring Progress
The agent provides continuous monitoring of the crawl's progress using the /job-status/{job_id} endpoint. This monitoring includes displaying:
- Total PDFs Discovered: The cumulative number of PDF files identified during the crawl.
- PDFs Processed: The number of PDFs that have been successfully converted to Markdown and stored in the database.
- Remaining PDFs: The number of PDFs that are yet to be processed.
- Any Errors: Detailed information about any errors encountered during the crawl, facilitating troubleshooting and resolution.
Step D: Using the Browser When Necessary
In certain scenarios, the backend may be unable to scrape a site due to factors such as:
- Authentication: Websites requiring login credentials to access content.
- Dynamic Content: Content generated dynamically using JavaScript.
- JS-Rendered Archives: Archives that rely on JavaScript to display publication links.
- CSRF Protections: Security measures that prevent automated scraping.
In these cases, the agent employs browser automation to:
- Navigate with Browser Commands: Controlling a web browser to navigate the site and interact with its elements.
- Extract Necessary Data: Identifying and extracting the required publication links or PDFs.
- Send Extracted PDF URLs to Backend: Submitting the extracted URLs to the backend via the /manual-pdf-submit endpoint for further processing.
3. Agent Behavior Rules: Ensuring Responsible and Efficient Operation
To ensure responsible and efficient operation, the agent follows a set of well-defined behavior rules.
When to Use Backend Actions
The agent utilizes backend actions whenever the operation is related to:
- Starting a Crawl: Initiating the crawling process for a set of specified sources.
- Checking Crawl Progress: Monitoring the status and performance of an ongoing crawl.
- Submitting Manually Obtained Links: Providing a mechanism for submitting publication links that were not automatically discovered.
When to Use Browser Control
Browser navigation is reserved for situations where:
- Login or Forms are Required: Accessing content behind authentication walls or requiring form submissions.
- Content is Hidden Behind JS: Extracting content that is dynamically generated using JavaScript.
- PDF Links are Not Present in the Page HTML: Discovering PDF links that are not directly embedded in the HTML source code.
Avoid These Common Pitfalls
To maintain optimal performance and avoid potential issues, the agent is programmed to avoid:
- Repeating Actions Too Frequently: Minimizing unnecessary requests to prevent overloading the backend.
- Starting Duplicate Jobs for the Same URLs: Preventing redundant crawling efforts by identifying and avoiding duplicate jobs.
- Attempting to Scrape Inside ChatGPT: Relying on the backend for scraping operations rather than attempting to perform them within the ChatGPT environment.
4. Expected Outputs to the User: Providing Clear and Informative Feedback
The agent is designed to provide clear and informative outputs to the user, ensuring a seamless and transparent experience. These outputs include:
Progress Updates
Regular updates on the progress of the crawl, such as "17 / 63 papers processed," providing a real-time view of the operation's status.
Final Summary
A comprehensive summary upon completion of the crawl, including:
- Total Papers Indexed: The total number of publications successfully processed and stored.
- Years Covered: The range of publication years covered by the crawl.
- Number of Sections/Chunks Stored: The number of individual content chunks stored in the database.
Clear Success Confirmation or Error Report
A clear indication of the crawl's outcome, whether it was successful or encountered errors. In case of errors, a detailed report is provided to facilitate troubleshooting.
5. Example User Requests: Illustrating the Agent's Versatility
To demonstrate the agent's capabilities, here are a few example user requests:
Example 1: Indexing Open-Access Publications
"Index all open-access publications from DESY P05 from 2015–2024."
Example 2: Extracting PDFs from a Publisher Portal
"Log in to this publisher portal, extract all PDF links from 2020–2021 issues, and index them."
Example 3: Checking Crawl Status
"Check if yesterday’s crawl finished."
6. Future Expansion: Enhancing the Agent's Capabilities
The agent's design allows for future enhancements, including:
- Automatic Rate Limiting: Dynamically adjusting the crawling rate to avoid overloading target websites.
- Scheduling Recurring Crawls: Automating the process of periodically crawling specific sources.
- Automatic Retry of Failed PDFs: Implementing a mechanism to automatically retry processing PDFs that initially failed.
- Synchronization with Lab Internal Datasets: Integrating the agent with internal datasets to enrich the harvested publications.
This guide provides a comprehensive overview of the Publication Harvester Agent, its functionalities, and its operational guidelines. By adhering to these guidelines, users can effectively leverage the agent to automate the process of gathering, processing, and storing academic publications.
For more information on web scraping best practices, check out this article on Web Scraping. Good luck!