Struggling to Convert Book Metadata from Text to Complete JSON Array Using GPT-3.5 - Seeking Advice!

Hello!

I’m currently working on a project where I need to organize a large amount of text of metadata from books (around 8000tokens) into a structured JSON format.

Each entry in the JSON object represents either a book or an article, and I’m aiming to create a unified JSON object that contains multiple arrays, each array representing a different book or article.

How-ever, the model seems to not understand my prompt and no matter how I tweak it, it only includes one or several books but not all of them.

Here are my prompt:

Organize the following text into a single JSON structure that contains multiple arrays, where each array represents a book or article. The JSON object should include fields like 'course_code', 'course_title', 'title', 'author', 'publisher', 'ISBN', 'year', 'edition', and 'article_or_book' for each entry. Here's the text from multiple reading lists that were PDFs: "{clean_text}". Present all course books as arrays within a single JSON object in the following format:
{json.dumps([
    {
        "course_code": "",
        "course-title": "",
        "title": "",
        "author": [],
        "publisher": "",
        "ISBN": "",
        "year": "",
        "edition": "",
        "article_or_book": ""
    },
    {
       "course_code": "",
        "course-title": "",
        "title": "",
        "author": [],
        "publisher": "",
        "ISBN": "",
        "year": "",
        "edition": "",
        "article_or_book": "" 
    }
], indent=4)}s
Follow these steps: 1. Create a JSON array entry for one book or article, 2. Add to the array with the next book or article, continuing until all items in the text you received are included in your JSON structure. Make sure to use this single object format, and for any missing values within an object, use 'n/a'. Ensure that the structure maintains consistent element counts across arrays"""

And here are the message:

messages=[{"role": "system", "content": "This is an exercise in extracting and structuring information from text to JSON-format, you shall provide JSON-format."},
                  {"role": "user", "content": json_prompt}]

Running the gpt-3.5-turbo-0125 model.

Have anyone been in similar situation or have a solution to this?

Converting book metadata from text to a complete JSON array involves turning information about books, like their titles, authors, publication years, etc., from a plain text format into a structured format that computers can easily understand. To do this, you need to first figure out what details you want to include for each book. Then, you break down the text containing the book information into smaller parts, like the title and author name. After that, you organize these parts into a special format called JSON, which stands for JavaScript Object Notation. JSON lets you represent data in a way that’s easy for computers to work with. Once you’ve organized all the book details into JSON format, you put them together into a list, which forms the JSON array. This array contains all the book information, neatly organized and ready to be used by computer programs.

1 Like
  1. I would give the AI a JSON Schema. It can have much more information provided about the acceptable output to generate.

  2. All information about how the AI should permanently operate should be in the system message. You can reinforce this with further instructions in the user role, especially more indicate that the default operation is to performed on specific data.

  3. I would give the AI the actual text from the book also. Your messages don’t include the book extract. :smiley:


I use GPT-4 (non-turbo) to transform your array with no examples into such a schema:

Let’s create a system message JSON schema for your data. Here’s a basic idea that should meet your needs:

system

// AI task

- Perform entity data extraction on the book info provided.
- Fill all fields of the output JSON directly from book metadata that is found.
- Leave field blank if indeterminate. 
- (more instructions...)

// output format: only JSON that complies with this schema

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/course.json",
    "title": "Course",
    "description": "A representation of a course and its associated book or article",
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "course_code": {
                "description": "The unique code for the course",
                "type": "string"
            },
            "course-title": {
                "description": "The title of the course",
                "type": "string"
            },
            "title": {
                "description": "The title of the book or article",
                "type": "string"
            },
            "author": {
                "description": "The list of authors of the book or article",
                "type": "array",
                "items": {
                    "type": "string"
                }
            },
            "publisher": {
                "description": "The publisher of the book or article",
                "type": "string"
            },
            "ISBN": {
                "description": "The ISBN of the book",
                "type": "string"
            },
            "year": {
                "description": "The year of publication of the book or article",
                "type": "string"
            },
            "edition": {
                "description": "The edition of the book",
                "type": "string"
            },
            "article_or_book": {
                "description": "Indicates whether the resource is a book or an article",
                "type": "string",
                "enum": ["book", "article"]
            }
        },
        "required": ["course_code", "course-title", "title", "author", "publisher", "ISBN", "year", "edition", "article_or_book"]
    }
}

This schema defines an array of objects, each with properties for course_code, course-title, title, author, publisher, ISBN, year, edition, and article_or_book. Each property is defined as a string, except for author which is an array of strings, and article_or_book which is an enum that must be either “book” or “article”. All properties are required. You can review and improve the descriptions.

Please note that the $id is a placeholder, and in a real schema, would be replaced with the actual URI where your schema will be hosted.

Then you will need to provide the book info

User message

Carefully examine the metadata extracted from multiple books below, finding the delineation between individual book entities. Particularly note the course being taught that the book corresponds to.

Produce a JSON array with an item for each book metadata, with no other output, with no markdown enclosing your output.

(documents inserted)

2 Likes

Hey, thanks for helping me!

I took your advice, but it’s simply not working…

Here is my system message

"role": "system", "content": """ 

System Prompt for Extracting Book Metadata into JSON
Task Explanation:
You are tasked with analyzing a continuous text that contains multiple entries of book metadata. This text includes several pieces of information for each book related to university courses, such as course codes, book titles, authors, publication years, and ISBN numbers. Note that some entries might also list editions and publishing houses. Your goal is to parse these details accurately from a raw textual format.

JSON Schema Description:
The output should conform to a specific JSON schema. Each book entry must be converted into a JSON object that includes the following properties:
- course_code: the unique identifier for the course associated with the book.
- course_title: the full title of the course.
- title: the title of the book.
- author: an array of authors associated with the book.
- publisher: the name of the publishing house.
- ISBN: the international standard book number.
- year: the year of publication.
- edition: the edition of the book, if applicable.
- article_or_book: a string indicating whether the item is a 'book' or an 'article'.

Detail Extraction:
Carefully extract and accurately parse the details such as the names of multiple authors, distinguishing between editions and publication years, and categorizing the resource as either a book or an article. Ensure that each attribute is placed correctly according to the specified JSON schema.

Output Requirements:
The final output must be formatted as a JSON array, where each book entry is a separate object within the array. Emphasize the importance of strict adherence to the JSON formatting and accuracy in the representation of extracted data.

Handling Ambiguities:
In cases where certain information is missing or ambiguous, leave the respective field blank in the JSON object. If there are educated guesses to be made based on the context of the information provided, do so cautiously and mention these assumptions explicitly in your processing.

Iterative Improvement:
You are encouraged to refine and optimize your data parsing logic based on the accuracy of initial outputs. If the initial data extraction contains errors or inaccuracies, adjust the parsing mechanisms to improve data quality in subsequent attempts.
       
        JSON Schema:{
            "$schema": "https://json-schema.org/draft/2020-12/schema",
            "$id": "https://example.com/course.json",
            "title": "Course",
            "description": "A representation of a course and its associated book or article",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "course_code": {
                        "description": "The unique code for the course",
                        "type": "string"
                    },
                    "course_title": {
                        "description": "The title of the course",
                        "type": "string"
                    },
                    "title": {
                        "description": "The title of the book or article",
                        "type": "string"
                    },
                    "author": {
                        "description": "The list of authors of the book or article",
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    },
                    "publisher": {
                        "description": "The publisher of the book or article",
                        "type": "string"
                    },
                    "ISBN": {
                        "description": "The ISBN of the book",
                        "type": "string"
                    },
                    "year": {
                        "description": "The year of publication of the book or article",
                        "type": "string"
                    },
                    "edition": {
                        "description": "The edition of the book",
                        "type": "string"
                    },
                    "article_or_book": {
                        "description": "Indicates whether the resource is a book or an article",
                        "type": "string",
                        "enum": ["book", "article"]
                    }
                },
                "required": ["course_code", "course_title", "title", "author", "publisher", "ISBN", "year", "edition", "article_or_book"]
            }
        }
        """
    },

Here is my user message


        {"role": "user", "content": """
            Introduction:
            You are provided with a text that includes a series of book entries extracted from university course syllabi. This text is rich in metadata detailing the books assigned or recommended for various courses. Each entry is distinct and corresponds to specific courses offered at the university.

            Instruction on Details:
            As you review the text, please note that each book entry contains comprehensive metadata such as the course code it is associated with, the book title, author(s), publication details including year and publisher, ISBN, and possibly the edition. Each entry is meant to be parsed and then structured according to these details.

            Submission Format:
            The text is presented in a plain, unformatted manner. It does not include any additional markup or styling, which ensures that the focus remains strictly on the raw data provided. This setup is crucial for accurately parsing the text into the required JSON format.

            Feedback Request:
            After processing, if any part of the text or its details appear unclear or incomplete, feedback may be requested to ensure the accuracy of the information before final processing. If you anticipate any ambiguities or potential issues with the entries, please indicate these upfront to facilitate a smoother extraction process. Here is the text of all the books: """ + clean_text},

Note that the variable clean_text contains the long text with all the titles. Here is how the structure of clean_text looks, even if it’s longer:

Here comes the next reading list: Lund University

Reading list for IBUG41, International Business: Business
Ethics and Sustainability, effective from the spring semester
2022
The reading list is established by the Director of Studies at the Department of Business Administration on 2023-09-15 to be effective from 2023-09-15

Scientific articles on Business Ethics and Sustainability. About 250 pages

Here comes the next reading list: Lund University

Reading list for ENTA70, Entrepreneurship and
Project Management, effective from the autumn semester 2022
The reading list is established by the Director of Studies at the Department of Business Administration on 2022-05-01 to be effective from 2022-05-01

Landström, H & Löwegren, M (2022): Entrepreneurship - from thought to action.
Student literature

How should I go at it?
The system keeps giving me a response with the correct JSON Array format and understanding of the task but it only includes two titles.

here is what you can try, To convert book metadata from text to a JSON array using GPT-3.5, follow these simple steps. First, make sure all the information about the books, like the title, author, ISBN, and other details, is written consistently and clearly. You can separate each book’s details using a special marker like “—”. Next, you’ll need to create a prompt to tell GPT-3.5 exactly what to do. Your prompt should instruct GPT-3.5 to read the book details and turn them into a structured list called a JSON array, where each book’s details are neatly organized in a format that computers can easily understand.

To execute this, you can use the OpenAI API if you’re coding, by sending your prompt to GPT-3.5 and asking it to process the text. After GPT-3.5 returns the JSON array, check to make sure it has all the book details correctly formatted. You might need to adjust your prompt a bit or fix how the books are described in your text to get the best results. This way, you can turn a list of book details into a neat JSON array that can be used in databases, websites, or any application that handles data.

Something to note here: if you indent within a docstring, the indents will be passed just like you wrote along with leading and trailing whitespace.

They must be jammed against the left margin, and you can use “”“string”“”.strip() to not send confusing linefeeds before and after.

JSON Schema can have description, can have examples. I would pick either your language description or draft 2020 schema, not both, which divides AI attention.

1 Like