How would you write this jsonl file?

I’d like your advice. I am not an expert on jsonl. I want to be able to do semantic search and properly write a file. I have this large document and it has some structure. The navigation tree to the document looks like this.

INTRODUCTION
PREAMBLE
TITLE 1
—CHAPTER 1…n
------SECTION 1…n
---------ARTICLE 1…n
------------TEXT IN ARTICLE n

My concern is certain parts of the document do not follow that structure. On rare occasions I find this,

A title with articles:
TITLE n
—(empty)
------(empty)
---------ARTICLE n
------------TEXT IN ARTICLE n

A title with no sections:
TITLE n
—CHAPTER n
------(empty)
---------ARTICLE n
------------TEXT IN ARTICLE n

A title with no chapters:
TITLE n
—(empty)
------SECTION n
---------ARTICLE n
------------TEXT IN ARTICLE n

@dandrade.jose can you describe your task a little more? What are you searching for? Is the navigation tree structure part of the text itself, or just metadata?

1 Like

Hi @asabet Thanks for your attention.

It is metadata, but I’d like the information in the tree to be part of the text in a reply if it comes to that.

I’ll give you an example. Consider the following text, part of a transcription of the Constitution as it was inscribed by Jacob Shallus.

We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.

Article. I.

Section. 1.

All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.

Possible user queries:

1: What does Article. I. Section. 1. says?
Response: All legislative Powers herein …

2: Where in our Constitution can I find references to the institution of the Congress of the United States?
Response:
Article. I. Section. 1.
Article. I. Section. 2.
Article. I. Section. 4.

Article. II. Section. 1.

jsonl file right now:

{ “DOCUMENT”: “name of a legal document”,
“TITLE”: “title”,
“CHAPTER”: “chapter”,
“SECTION”: “section”,
“ARTICLE”: “85”,
“TEXT”: “Article 85.- Opinion immunity. The members of both chambers enjoy immunity for the opinions they express in the sessions.” }

I was thinking about putting everything that is not TEXT as metadata in an array. Is this ok?

2 Likes

I see. Depends on how often you have missing headings ie can you fix it by hand? It might not be worth the effort to worry about missing metadata if it’s negligible, maybe save for another stage unless you can identify why it’s missing and can fix it easily, or it will seriously degrade UX. If a user query references a missing heading, you can try handling that case by prompting the user for more information to enhance the query.

1 Like

Well, I got something working today. I followed the advice of putting article number in metadata field and article content in text field.

Now I am reading the documentation for /answers endpoint.

The metadata property is optional and does not alter answers behavior. It is instead arbitrary data that you can choose to return alongside each document in the response by setting return_metadata parameter to true .

I am thinking about putting TITLE, CHAPTER, SECTION and ARTICLE NUMBER in metadata field. If it doesn’t work, then in text field, but preceding article content.

I’m working on a similar problem. I use the metadata field to assign a unique code to each text passage. So my metadata fields look something like “ni-45-102-pt-4-s3”. In your case, it could be something like “doc-X-title-y-Ch-z-Sec-a-Art-b”. Or, it could just be a random unique code. Then, after retrieving the results from the semantic search (with the “return metadata” parameter set to true), you can feed the unique codes into a separate database to retrieve the “real” metadata ie your full document titles, chapters, sections etc. I didn’t see another solution with json lines, since only two fields are permitted and the metadata field is meaningless from the search/answers perspective. Since the text field is the key for semantic search and answers, it’s better not to clutter it up with metadata that will also cost you tokens. After you fetch the metadata in the second step, you can give the user the text/answer along with the document, chapter, title etc. What do you think? Leslie

2 Likes

I think it’s a sound solution. I’ll give it a try.

I like it as well, the thing with the metadata is a could idea…

On my end, I found this, it may be useful:

it is pretty old (2013) but still help to think about the best way to handle it.

Let me know what you think

1 Like

Hi @dandrade.jose

I’m working on an experiment and I have a document very similar to yours.
Did you have good results? From your example of your jsonl file (in the 1st post) I was not clear if you are repeating the section, chapter and article in each line. Wouldn’t it be possible to have an hierachical json object to do that?
something like

{ "chapter 1":   
   { "section 1": 
       { "article 85": "text of article 85" }
   }
}

would this work? or is metadata better?

Will have to try, but for now I followed @lmccallum approach.

The file has to be json lines, json is not allowed. It would certainly be easier if we could use json and specifiy which field(s) is to be used for search.

Yes, but json lines means only that each line has a JSON object. As far as I understand, I can have a JSON object with hierarchy… no?

My background is not technical, but my understanding is that the json lines file format disallows a hierarchy.

1 Like