Notes and Leaves

IGeLU 2014
Deep search or harvest, how do we decide
Simon Moffatt and Illaria Corda, British Library

Context: increasing content, especially digital from a diversity of sources as well as migration form other systems. So there are two options to integrate all this data: through harvesting or with a deep search.
Harvest: put all the data in the central index (Primo)
Deep search: add a new index, so user searches Primo and this other index in a federated way

The decision that the BL had to make was for two major indexes: one was for the replacement of Content Management System and the other one was for the Web Archive

CMS: harvest

The main reasons for choosing harvesting is that the CMS has it's own index, updated daily. Work involved for deep search would be normalisation rules, setting up daily processes etc. The index is not good enough. It only has 50.000 pages which for Primo is not that much.

Records in the CMS
13m from Aleph
5m Sirsi Dynix for sound archive
48m articles from own article catalogue
1m records from other catalogues

Service challenges for 67m records
The index is a very big file (100GB). Overnight schedules are tight. Fast data updates are not possible. Re-indexing is always at least 3hrs, sometimes more. System restart takes 5hrs, re-synch of the search schema is a whole day and the failover system must be available at all times. Must be careful for Primo service packs and hotfixes. Doing the whole index in one go wouldn't be possible so beware when the documentation says "delete index and recreate".

Development challenges
Major normalisations only 3 times a day, need to be careful about impact on services. Implementing Primo enhancements also needs to be considered. In their development environment they have created a sample of all their different datasets. It's also used to test which version of Primo to use etc. Errors have to be found at this point.

But the compensations of big index are:
Speed
Control over the data
Consistency of rules
Common development skills
Single point of failure

Web Archive : deep search

Figures
1.3m documents (pages)
Regular crawl processes
Index size 3.5 terabytes
80-node hadoop cluster where crawls are processed through submission

Implications of choosing the deep search
On the Primo interface there is a tab for the local content and one for the web archive, but otherwise the GUI is the same as are the type of facets etc. Clicking on a web archive record takes to the Ericom system (digitals rights system for e-legal deposit material).

Service challenges for deep search
Introduction of a different point of failure
New troubleshooting procedures
Changes to the solr schema and could break the search
Primo updates can also potentially break the search

Development challenges
Significant development work, e.g. accommodate Primo features such as multi-select faceting, consistency between local and deep search, etc.

But the compensations are:
Ideal for a large index with frequent updates
Indepth indexing
Maintenance of existing scheduled processes
Active community of api developers

Conclusion: Main criteria to decide

Frequency of updates and lots of records then deep is probably better
Budget to buy servers etc.
Development time and skills expertise
Impact on current processes

Questions
Security layer in development api that doesn't expose the solr index
Limitations of Primo: scalability; would exl make local indexes bigger? BL's primo is not hosted. Before work on primo central, we had genuine problems and we worked on them to resolve them. Primo is actually scalable.

Notes and Leaves

Pages

Wednesday 17 September 2014

No comments:

Post a Comment