As this was my first attempt, I decided to take a pretty basic approach, see what the results were like and optimise it later.
Content is stored in Django as posts, so I wrote a custom document reader that created a new LlamaIndex document for each post, attaching the post id, title, link and published date as metadata. This gave better results than just loading in all the content as a text or CSV file, which I tried first.
I did try with a bunch of different techniques to split the chunks, including by sentence count and a larger and smaller number of tokens. In the end I decided to leave it to the LlamaIndex default just to get it working.
Content is stored in Django as posts, so I wrote a custom document reader that created a new LlamaIndex document for each post, attaching the post id, title, link and published date as metadata. This gave better results than just loading in all the content as a text or CSV file, which I tried first.
I did try with a bunch of different techniques to split the chunks, including by sentence count and a larger and smaller number of tokens. In the end I decided to leave it to the LlamaIndex default just to get it working.