How can I Write a good article based on scrape content?

Hi, I scraped content on many websites. How can use these data to write a good article? Is there any framework to do it?

If I just feed all data to OpenAi and use a prompt, it can be out of tokens. I tried summarize the content of each website before feeding them to OpenAi but this cause lost some usefull data. Any better approach?


Use Embeddings API for each scraped content and store the resulting vector data somewhere. Then using RAG and specialized system prompts, it can generate you with whatever article based on the scraped data.

To test this idea, I copied the wikipedia articles about mango and coconut and got their embeddings and stored it. Note that I also have other files in store. Then I asked the AI:

user: write a brief paragraph about the mango fruit and coconut and their significance in asian culture?

assistant: The mango fruit and coconut hold significant cultural importance in Asian cultures. The mango, originating from Myanmar, Bangladesh, and India, has been cultivated in South and Southeast Asia for centuries. It is revered for its delicious taste and is used in various culinary dishes, desserts, and beverages. The coconut, on the other hand, has religious significance in South Asian cultures, particularly in Hindu rituals. It is also used in wedding and worship ceremonies. Additionally, coconuts played a crucial role in the long sea voyages of Austronesians, providing them with food, water, and building materials for their boats. These fruits have become integral parts of Asian culture, both in terms of their practical uses and symbolic meanings. (Sources: “mango.txt” and “coconut.txt”)

1 Like

It depends on if you know the articles by name you wish to reference, I suppose, and the quality of your database.

Suppose you have lots of OpenAI papers and web scrapings, and then you put some functions in to search and retrieve.

You then could ask the AI a simple question like this:

"From your knowledgebase, get and use the OpenAI blog about WebGPT and the OpenAI blog about function calling models to synthesize a new article “How to use functions to browse your own database of knowledge”.

1 Like

Good suggestions and hints already in the thread!
If there is a database with articles then what you really have is a unsorted pile of data.
And if there is the goal “to write articles” then you have a unclear expectations regarding the output.
Looking at the task from this perspective you now have two distinct entry points to arrive at a good solution:

  1. structure your input data by summarizing, labeling, vectorizing, adding meta data etc…
  2. creating a few sample outputs that match your desired level of quality and define what exactly is needed to arrive at this quality.

Then you can take the more hands-on approaches like the ones suggested above and work towards a process that enables the LLM to produce the desired outputs.



I’ll add my own tip to this:

Tip: you can improve the summarization a lot by telling GPT what the summary is for, ie “summarize the following for a blogpost about x” :laughing: