I have close to 10K JSON files (very small). I would like to provide search functionality. Since these JSON files are fixed for specific release, I am thinking to pre-index files and load index during startup of website. I don't want to use external search engine.
I am searching for libraries to support this. lucene.Net is one popular library. I am not sure whether this library supports loading pre-index data.
I am not sure this is possible or not. What are the possible options available?
Since S3 is not a .NET-specific technology and Lucene.NET is a line-by-line port of Lucene, you can expand your search to include Lucene-related questions. There is an answer here that points to an S3 implementation meant for Lucene that could be ported to .NET. But, by the author's own admission, performance of the implementation is not great.
NOTE: I don't consider this to be a duplicate question due to the fact that the answer most appropriate to you is not the accepted answer, since you explicitly stated you don't want to use an external solution.
There are a couple of implementations for Lucene.NET that use Azure instead of AWS here and here. You may be able to get some ideas that help you to create a more optimal solution for S3, but creating your own Directory
implementation is a non-trivial task.
Can
IndexReader
read index file from in-memory string?
It is possible to use a RAMDirectory
, which has a copy constructor that moves the entire index from disk into memory. The copy constructor is only useful if your files are on disk, though. You could potentially read the files from S3 and put them into RAMDirectory
. This option is fast for small indexes but will not scale if your index is growing over time. It is also not optimized for high-traffic websites that have multiple concurrent threads performing searches.
From the documentation:
Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory
, which is a high-performance directory implementation working directly on the file system cache of the operating system, so copying data to heap space is not useful.
When you call the FSDirectory.Open()
method, it chooses a directory that is optimized for the current operating system. In most cases it returns MMapDirectory
, which is an implementation that uses the System.IO.MemoryMappedFiles.MemoryMappedFile
class under the hood with multiple views. This option will scale much better if the size of the index is large or if there are many concurrent users.
To use Lucene.NET's built-in index file optimizations, you must put the index files in a medium that can be read like a normal file system. Rather than trying to roll a Lucene.NET solution that uses S3's APIs, you might want to check into using S3 as a file system instead. Although, I am not sure how that would perform compared to a local file system.
Thanks @NightOwl888. Now i have fair understanding of possible options. Simplest solution is download and extract to dictionary during first time starting of the application then point out IndexReader to that directory.