The Anonymized Agentic Multimodal Retrieval-Augmented Generation (RAG) Pipeline for large-scale internal document processing features an AI-powered chatbot based on GPT-5, designed to assist with any type of documentation. It offers quick and accurate responses to queries related to the documentation provided.
Towards to end of the first quarter of the 21st century the Artificial Intelligence (AI) has emerged as a game changer in the business world. The demand for an intelligent enterprise documentation retrieval system that addresses the critical challenge of accessing information in large-scale software development environments is high, as such a system can significantly improve productivity by locating relevant documents, enhancing knowledge sharing, reducing the time spent searching for information, and accelerating development workflows.
The Agentic RAG pipelines model provides an approach that utilizes three microservice architectures that combine preprocessing (including anonymization, semantic vectorization, and conversational AI) to create a knowledge sharing platform. It is a cloud application that leverages Large Language Models to provide useful features. It can be integrated as a SlackBot to improve accessibility for users. It offers a continuous workflow that updates the database regularly.
A working prototype has been developed using Flask API’s and deployed to Cloud Foundry. Integrates \texttt{SpaCy}-based anonymizer for anonymized data processing, a PostgreSQL vector database for similarity search, and an Azure OpenAI-powered conversational interface with LangGraph orchestration for context-aware responses. The method’s versatility accross diverse document types, large languange models and allowing rapid database change hold early promise for robust performance for future applications.

If you have question or problems of any kind, do not hesitate to contact to the developer directly:
requirements.txt)Get the repository
git clone https://github.com/kadirtastepe/Multimodal-RAG-Pipeline.git
Before start working on this project, ensure that all the required Python libraries and environment are properly set up. Each time you are working on this project, you must set up the software environment first.
cd Multimodal-RAG-Pipeline.bash setup.source admin_setup.go.
| Command | Meaning |
|---|---|
make all IMAGES=<app> |
Build selected image. [anonymizer/vectorizer/api] |
make clean |
Delete all docker images. |
make fetch |
Fetch the data to anonymizer. |
make anonymize_md |
Anonymize all fetched md files. |
make anonymize_json |
Anonymize fetched json data. |
make preprocess_md |
Preprocess all fetched md files. |
make preprocess_json |
Preprocess fetched json data. |
make tar |
Create a tar file including preprocessed data. |
make vectorize |
Vectorize the preprocessed data and sends to postgres. |
make update |
After fetching the json data process outdated entries. |
make chain |
Executes fetch and anonymize together. |
make help |
Show all configuration options. |
| Command | Options | Meaning |
|---|---|---|
set_mode |
anonymizer/vectorizer/api |
Change build option. |
vectorize |
md/json number_of_batches |
Change file type and the number of batches for vectorization. |
set_slack |
on/off |
Toogle slack communication on/off without redeploying. |
s3 ls |
<none> |
Check the files in the blobstore. |
s3 delete |
file_name |
Delete the files in the blobstore. |
get_logs |
<none> |
Retrieve real-time logs from targeted apps. |