Generate analysis/resume of a IT discussion

hi guys, yes really need your expertise on few topics if you got a minute. so : the goal is to create an AI agent chatbot fot helpdesk IT support. I got my conversation between employees and tech guys in specific text files for each ticket.

What I expect happens simply when I just upload TXT file in chatgpt chat: it automatically detects its an IT support ticket, make a basic resume of the ticket, describe the initial issue, gives the resolution he detected and so on…

I tried to use pinecone as DB, SentenceTransformer, chatgpt4 mode, /multilingual-e5-large, NLP model andn so on to at the end, I didnt success :frowning:

The goal is to be able to ask a question to my chatbot and if the question is related to a topic or issue already raised by the past in the tickets imported, the chatbot will provide corresponding solution. what would be the best way / efficient method to do it ? I never found how chatgpt was able to provide exactly the output I expected haha, I just want that, no more :slight_smile:

if you got a minute to provide few inputs it would be awesome, thanks !

Have an AI produce an output similar to the knowledge base articles and metadata fields, trying to replicate what would already be there.

Run embeddings against that. Return top ranks.

Ask your question.

You might be able to get similar suggested answers, but if you have a large database, you will not be able to feed it entirely into an AI model to produce a true negative, not without an extensive search.


Has this question never been answered before in the forum? Very hard to produce a “yes” answer to.

1 Like

hey thanks for your help man. well let me give a concrete example, when I upload file with chagpt4 model (txt file just contains relevant sentences with usernames, random values, like if you ctrl+A copied/pasted the whole web page when you are in a ticket discussion… a real mess, and here is what chatgpt gives as output :


The second document outlines another issue with Lokia-Mobile, specifically regarding comment acceptance not being fully transmitted between systems.

Issue Summary
Problem: The comment acceptance field in Lokia-Mobile allows more characters than are transmitted to Lokia and further to the  GMAO system.
Character Limits:
Lokia Mobile: Allows approximately 300 characters.
Lokia (server-side): Only 160 characters are retained.

Impact: The full comment entered by a technician is not completely sent to the SYNERGEE system.
People Involved
Reported by: kurt (Testeur Logiciel, testcompany)

Action Items Identified
Modification Needed in Lokia:
Adjust column size and text field size in Lokia to match the maximum characters allowed in Lokia Mobile.
John was assigned this task.
Validation Required:
mark to confirm the exact character limit in Lokia Mobile.
Suggested Next Steps
Technical Fix: Ensure that Lokia' database and API allow at least 300 characters in the relevant field.
Testing: Verify that:
The full comment is stored correctly in Lokia.
The full comment transmits to SYNEEE without truncation.
Deployment: Once fixed, validate in PREPROD before pushing to production

concretely, how can I generate this output from a txt file where its a real mess inside ? then what kind of database I should use to store these information ? I started to test with Pynecone but is it a good choice ?
thanks

It sounds like you need to develop a structured schema for your data. That’s not just someone writing random stuff, there are field being filled and reported.

Extraction from the problem database directly into a JSON, or an AI transformation into an output schema, will unify the data and produce better semantic search results.

{
  "name": "issue_report",
  "schema": {
    "type": "object",
    "properties": {
      "issue_summary": {
        "type": "object",
        "properties": {
          "problem": {
            "type": "string",
            "description": "Description of the problem related to comment acceptance."
          },
          "character_limits": {
            "type": "object",
            "properties": {
              "lokia_mobile": {
                "type": "string",
                "description": "Character limit allowed in Lokia Mobile."
              },
              "lokia_server_side": {
                "type": "string",
                "description": "Character limit retained in Lokia server-side."
              }
            },
            "required": [
              "lokia_mobile",
              "lokia_server_side"
            ],
            "additionalProperties": false
          },
          "impact": {
            "type": "string",
            "description": "Impact of the issue on the transmission to the SYNERGEE system."
          }
        },
        "required": [
          "problem",
          "character_limits",
          "impact"
        ],
        "additionalProperties": false
      },
      "people_involved": {
        "type": "object",
        "properties": {
          "reported_by": {
            "type": "string",
            "description": "Name of the person who reported the issue."
          }
        },
        "required": [
          "reported_by"
        ],
        "additionalProperties": false
      },
      "action_items_identified": {
        "type": "object",
        "properties": {
          "modification_needed_in_lokia": {
            "type": "string",
            "description": "Description of the modification needed in Lokia."
          },
          "assigned_person": {
            "type": "string",
            "description": "Name of the person assigned to the task."
          },
          "validation_required": {
            "type": "string",
            "description": "Description of what needs to be validated."
          }
        },
        "required": [
          "modification_needed_in_lokia",
          "assigned_person",
          "validation_required"
        ],
        "additionalProperties": false
      },
      "suggested_next_steps": {
        "type": "object",
        "properties": {
          "technical_fix": {
            "type": "string",
            "description": "Description of the technical fix needed."
          },
          "testing": {
            "type": "array",
            "description": "List of testing requirements.",
            "items": {
              "type": "string"
            }
          },
          "deployment": {
            "type": "string",
            "description": "Description of the deployment process."
          }
        },
        "required": [
          "technical_fix",
          "testing",
          "deployment"
        ],
        "additionalProperties": false
      }
    },
    "required": [
      "issue_summary",
      "people_involved",
      "action_items_identified",
      "suggested_next_steps"
    ],
    "additionalProperties": false
  },
  "strict": true
}

The problem still remains, if you want to answer “create a new problem report ONLY if nothing like this exists at all”.

You’d be able to answer “create a new problem report if the search finds nothing and would continue to find nothing for like inputs.”

yeah thats what I was planning at first, to recap:

:white_check_mark: Automatic Categorization → Uses keywords to classify tickets
:white_check_mark: Entity Extraction → Extracts important names & terms
:white_check_mark: Summarization → AI generates short descriptions
:white_check_mark: Pinecone IntegrationStores embeddings (1024D) automatically
:white_check_mark: AI Chatbot (GPT-4)Finds & suggests past solutions
:white_check_mark: FastAPI EndpointDeploys as a REST API
:white_check_mark: Versioning → Tracks changes & upgrades

but I didnt succeed for now :frowning: so based on your thoughts, you would recommend at very first to transform my text files in corresponding .json file ? do you see what is missing in my script to process text file content in the same way chatgpt does it ? (I mean classify / categorizing / extract relevant sentences / apply “situation” “context” and so on labels ?

# ✅ INITIALIZE PINECONE
pc = pinecone.Pinecone(api_key=PINECONE_API_KEY)

# ✅ CHECK IF INDEX EXISTS, CREATE IF MISSING
if INDEX_NAME not in pc.list_indexes().names():
    from pinecone import ServerlessSpec
    pc.create_index(
        name=INDEX_NAME,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region=PINECONE_ENV)
    )
    logging.info(f"✅ Created new Pinecone index: {INDEX_NAME}")

index = pc.Index(INDEX_NAME)

# ✅ INITIALIZE NLP MODELS
openai.api_key = OPENAI_API_KEY
nlp = spacy.load("fr_core_news_md")
embedding_model = SentenceTransformer("intfloat/multilingual-e5-large")

# ✅ FUNCTION: NORMALIZE TICKET ID (ASCII-SAFE)
def normalize_id(ticket_id):
    ticket_id = unicodedata.normalize('NFKD', ticket_id).encode('ascii', 'ignore').decode('ascii')
    return re.sub(r'\W+', '_', ticket_id)

# ✅ FUNCTION: DETECT FILE ENCODING
def detect_encoding(file_path):
    with open(file_path, "rb") as f:
        return chardet.detect(f.read())["encoding"]

# ✅ FUNCTION: EXTRACT RELEVANT TEXT FOR SUMMARIZATION
def extract_relevant_text(text):
    """
    Extracts only relevant content from the ticket, removing names, emails, and metadata.
    """
    lines = text.split("\n")
    relevant_lines = [line for line in lines if ":" not in line and "@" not in line and not line.strip().isdigit()]
    return "\n".join(relevant_lines[:10])  # Keep first 10 meaningful lines

# ✅ FUNCTION: SUMMARIZE TICKET CONTENT USING GPT-4
def summarize_with_gpt4(text):
    """
    Summarizes the given French text using GPT-4.
    """
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Tu es un assistant qui génère des résumés précis et concis en français pour des tickets informatiques."},
            {"role": "user", "content": f"Résumé ce ticket de support en moins de 100 mots :\n{text}"},
        ]
    )
    return response["choices"][0]["message"]["content"]

# ✅ FUNCTION: LOAD TICKETS FROM FOLDER
def load_tickets_from_folder(folder_path):
    tickets = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            encoding = detect_encoding(file_path)
            with open(file_path, 'r', encoding=encoding) as file:
                content = file.read()

            # Extract core ticket information (ignore metadata)
            ticket_content = extract_relevant_text(content)

            # ✅ Generate summary using GPT-4
            summary = summarize_with_gpt4(ticket_content)

            ticket_data = {
                "ticket_id": normalize_id(filename.replace(".txt", "")),
                "filename": filename,
                "content": content,
                "summary": summary,
                "timestamp": datetime.now().isoformat(),
                "version": VERSION
            }
            tickets.append(ticket_data)
    
    return tickets

# ✅ FUNCTION: SAVE TO CSV & JSON
def save_tickets_to_files(tickets):
    df = pd.DataFrame(tickets)
    df.to_csv(CSV_OUTPUT, index=False)
    with open(JSON_OUTPUT, "w", encoding="utf-8") as json_file:
        json.dump(tickets, json_file, ensure_ascii=False, indent=4)

# ✅ FUNCTION: DELETE OLD DATA FROM PINECONE
def clear_pinecone():
    index.delete(delete_all=True)
    logging.info("✅ Pinecone index cleared. Ready to re-upload correct embeddings.")

# ✅ FUNCTION: UPLOAD TO PINECONE (BATCH UPLOAD FOR SPEED)
def upload_to_pinecone(tickets):
    batch_size = 10  # Upload in batches of 10 for efficiency
    for i in range(0, len(tickets), batch_size):
        batch = tickets[i:i+batch_size]
        index.upsert([
            (ticket["ticket_id"], embedding_model.encode(ticket["summary"]).tolist(),
             {"summary": ticket["summary"], "version": ticket["version"]})
            for ticket in batch
        ])
    logging.info(f"✅ Uploaded {len(tickets)} tickets to Pinecone.")

# ✅ FUNCTION: SEARCH SIMILAR TICKETS
def search_similar_tickets(query_text, top_k=5):
    query_embedding = embedding_model.encode(query_text).tolist()
    results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
    return results["matches"]

# ✅ FUNCTION: AI RESPONSE WITH GPT-4
def generate_ai_response(user_query):
    similar_tickets = search_similar_tickets(user_query)
    past_solutions = "\n".join([ticket['metadata']['summary'] for ticket in similar_tickets])

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Tu es un assistant IT francophone qui répond aux problèmes informatiques en utilisant des tickets de support passés. Tes réponses doivent toujours être en français."},
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": f"Basé sur des tickets similaires, voici des solutions passées:\n{past_solutions}"},
        ]
    )

    return response["choices"][0]["message"]["content"]

# ✅ RUN PIPELINE
if __name__ == "__main__":
    logging.info("🔹 Clearing old data from Pinecone...")
    clear_pinecone()
    
    logging.info("🔹 Loading tickets...")
    tickets = load_tickets_from_folder(TICKET_FOLDER)
    
    save_tickets_to_files(tickets)
    upload_to_pinecone(tickets)
    
    logging.info(f"✅ Processed {len(tickets)} tickets and uploaded embeddings to Pinecone."

)