IGeLU 2014
Deep search or harvest, how do we decide
Simon Moffatt and Illaria Corda, British Library
Context: increasing content, especially digital from a diversity of sources as well as migration form other systems. So there are two options to integrate all this data: through harvesting or with a deep search.
Harvest: put all the data in the central index (Primo)
Deep search: add a new index, so user searches Primo and this other index in a federated way
The decision that the BL had to make was for two major indexes: one was for the replacement of Content Management System and the other one was for the Web Archive
CMS: harvest
The main reasons for choosing harvesting is that the CMS has it's own index, updated daily. Work involved for deep search would be normalisation rules, setting up daily processes etc. The index is not good enough. It only has 50.000 pages which for Primo is not that much.
Records in the CMS
13m from Aleph
5m Sirsi Dynix for sound archive
48m articles from own article catalogue
1m records from other catalogues
Service challenges for 67m records
The index is a very big file (100GB). Overnight schedules are tight. Fast data updates are not possible. Re-indexing is always at least 3hrs, sometimes more. System restart takes 5hrs, re-synch of the search schema is a whole day and the failover system must be available at all times. Must be careful for Primo service packs and hotfixes. Doing the whole index in one go wouldn't be possible so beware when the documentation says "delete index and recreate".
Development challenges
Major normalisations only 3 times a day, need to be careful about impact on services. Implementing Primo enhancements also needs to be considered. In their development environment they have created a sample of all their different datasets. It's also used to test which version of Primo to use etc. Errors have to be found at this point.
But the compensations of big index are:
Speed
Control over the data
Consistency of rules
Common development skills
Single point of failure
Web Archive : deep search
Figures
1.3m documents (pages)
Regular crawl processes
Index size 3.5 terabytes
80-node hadoop cluster where crawls are processed through submission
Implications of choosing the deep search
On the Primo interface there is a tab for the local content and one for the web archive, but otherwise the GUI is the same as are the type of facets etc. Clicking on a web archive record takes to the Ericom system (digitals rights system for e-legal deposit material).
Service challenges for deep search
Introduction of a different point of failure
New troubleshooting procedures
Changes to the solr schema and could break the search
Primo updates can also potentially break the search
Development challenges
Significant development work, e.g. accommodate Primo features such as multi-select faceting, consistency between local and deep search, etc.
But the compensations are:
Ideal for a large index with frequent updates
Indepth indexing
Maintenance of existing scheduled processes
Active community of api developers
Conclusion: Main criteria to decide
Frequency of updates and lots of records then deep is probably better
Budget to buy servers etc.
Development time and skills expertise
Impact on current processes
Questions
Security layer in development api that doesn't expose the solr index
Limitations of Primo: scalability; would exl make local indexes bigger? BL's primo is not hosted. Before work on primo central, we had genuine problems and we worked on them to resolve them. Primo is actually scalable.
Wednesday, 17 September 2014
Tuesday, 16 September 2014
IGeLU 2014
FINNA, the best of both worlds
National Library of Finland
FINNA includes material from libraries, museum and archives. It is a discovery interface, though it's more than that. It is a "national view" of the collections but it enables each institution to have it's own view and that's about one hundred that have harvested in test, with 50 that are in production. FINNA is based on the open source discovery service VuFind.
FINNA architecture
Usually the various indexes have their own repositories. The description and the links of the electronic resources are kept in MetaLib and the linking is done with SFX. Universities are always worried that the Primo central index doesn't have all the content they want so that's why there's a need to keep MetaLib. But Primo central index is included in the architecture.
The interface contains tabs for each separate index, the only material harvested is that which can be accessed by everyone, one of the reason being that it's difficult to integrate them together. That's one of the reasons why they didn't make a deal with ProQuest (before they had an agreement with ExLibris). So there's as "split view" between the local content and that from Primo central index. With a combined view it's not possible to say to the user all that we want so it's a showcase for a single search box and filtering happens later. Blending doesn't allow a proper relevancy ranking.
Another problem is that there is some duplication in different indexes, e.g. ebooks. It has also been a challenge naming the tabs so in the end the universities can do on which ever way they want. Finally how to make sure users get the best or correct content? One solution is authentication (? I think).
FINNA, the best of both worlds
National Library of Finland
FINNA includes material from libraries, museum and archives. It is a discovery interface, though it's more than that. It is a "national view" of the collections but it enables each institution to have it's own view and that's about one hundred that have harvested in test, with 50 that are in production. FINNA is based on the open source discovery service VuFind.
FINNA architecture
Usually the various indexes have their own repositories. The description and the links of the electronic resources are kept in MetaLib and the linking is done with SFX. Universities are always worried that the Primo central index doesn't have all the content they want so that's why there's a need to keep MetaLib. But Primo central index is included in the architecture.
The interface contains tabs for each separate index, the only material harvested is that which can be accessed by everyone, one of the reason being that it's difficult to integrate them together. That's one of the reasons why they didn't make a deal with ProQuest (before they had an agreement with ExLibris). So there's as "split view" between the local content and that from Primo central index. With a combined view it's not possible to say to the user all that we want so it's a showcase for a single search box and filtering happens later. Blending doesn't allow a proper relevancy ranking.
Another problem is that there is some duplication in different indexes, e.g. ebooks. It has also been a challenge naming the tabs so in the end the universities can do on which ever way they want. Finally how to make sure users get the best or correct content? One solution is authentication (? I think).
IGeLU 2014
The Primo graphical user interface
Matthias Einbrodt, university of Bolzano
Relatively small library. Live with Primo since 2011, one of the early adopters, so live in Jan 2013. New interface was developed and will go live soon.
Object Oriented CSS
The Primo graphical user interface
Matthias Einbrodt, university of Bolzano
Relatively small library. Live with Primo since 2011, one of the early adopters, so live in Jan 2013. New interface was developed and will go live soon.
- First had custom stylesheet but it became too big and difficult to maintain. We're aware that users don't wait longer than 2 or 3 seconds on a webpage for it to load
- Broadband not very performing so not helping
- Reduction of http requests
- Only loaded one dontfile
- Didn't use css images
- Rewrote css from scratch using dynamic stylesheets, media queries, object oriented css, variables, nesting, mixins, etc.
- Produce LESS files which are less numerous than css
- Advantages of dynamic stylesheets language: you can break down into parts
- Creation of central configuration files
Object Oriented CSS
- don't repeat unnecessarily
- code reuse
- separation of structure from skin: build instructions (structure), then build the outside look (visual features) so that means define a base class so whatever the context/containers is, the content/properties shouldn't change
- iFrames create problems, no control over those elements,
IGeLU 2014
Meeting the discovery challenge: user expectations, usability and Primo
Andy Land, Manchester University
Collection is about 4millions. Use Primo since 2010 and move to Alma in 2013. Carried out a research into students needs and usability of Primo in this context. They call it "Library Search". The results show:
Meeting the discovery challenge: user expectations, usability and Primo
Andy Land, Manchester University
Collection is about 4millions. Use Primo since 2010 and move to Alma in 2013. Carried out a research into students needs and usability of Primo in this context. They call it "Library Search". The results show:
- Most heavily used service in the library
- Google comes first but LS is second
- Same for accessing digital content although LS comes first for science related subjects
- LS comes first when wanting to find a known item
- Some functionalities we're liked e.g. facets
- Less liked e.g pre-search filtering
- Changed pre-filters with tabs: search everything vs library catalogue only
- Modified display of facets e.g. their order
- Frberisation was a problem for users so work was done to keep things simpler
- Improved display with visible colours, use of white spaces, tabs etc. to better highlight things
- Spelling suggestions made more visible
- Add "sign in" links in numerous places
- Suggestions to improve metadata in the community zone e.g. authors presented with different names
- Launched a live chat service with library experts
- Other projects o use stuff as crowdsourcing etc.
Monday, 15 September 2014
Igelu 2014 - Oxford
How subversive! And how it takes tosubert...
Alma Swan, director of SPARC Europe, Key Perspectives Ltd, Enabling Open Scholarship
Open access was proposed as an idea in June 1994 though it wasn't called that way yet. It was recommended authors post their papers on anonymous ftp sites. Shortly after the World Wide Web came along. Now there is over 35% of all disciplines publication open access, the majority is green (in depositories), a small part is gold (in subscription journals) and the rest is delayed open access, i.e. where publishers make the content of their journal available after a period of time (maybe 12 to 18 months).
But the trend is that the increase has been very slow and certainly has not progressed in the way that was expected. In terms of authors, there is a lot of misunderstanding or lack of awareness as well as fear of repurcussions especially from the publishers or for their career - changes that came with the web has been rather coldly received in academia... In terms of publishers, there has been some hindrance which reinforces the sense of uncertainty. Finally libraries have been hooked into big deals and therefore the space of manoeuvre has been limited, budgets have been frozen, policies made elsewhere (e.g. at the funder level which can be national), there has also been varying levels of buy-in in the notion of OA as everywhere else.
The drivers have been advocacy, as well as technical developments for appropriate infrastructure, new publishing venues and policies. Advocacy has collected evidence about benefits for authors to the point that it has to become the way of working in the digital age. Benefits include: visibility, usage, impact. This is a lso valid for institutions because they can better monitor and assess usage, it gives them competitive intelligence and facilitates outreach and better return for investment. Benefits for funders is that they can also monitor and assess their investment (ROI).
In terms of the infrastructure, we've developed systems from print to electronic, hyperlinking, interoperability and linked data (possibly). The EU has done some research on OA. Amongst other things, they've built OpenAIRE, its a harvester for metadata as well as content, readers can go there to collect articles they're interested in. Open Access policies have also been developed. There are now 222 institutional, 44 sub-institutional and 90 funder policies, so significant things are happening.
The areas of promise and their issues and challenges are:
How subversive! And how it takes tosubert...
Alma Swan, director of SPARC Europe, Key Perspectives Ltd, Enabling Open Scholarship
Open access was proposed as an idea in June 1994 though it wasn't called that way yet. It was recommended authors post their papers on anonymous ftp sites. Shortly after the World Wide Web came along. Now there is over 35% of all disciplines publication open access, the majority is green (in depositories), a small part is gold (in subscription journals) and the rest is delayed open access, i.e. where publishers make the content of their journal available after a period of time (maybe 12 to 18 months).
But the trend is that the increase has been very slow and certainly has not progressed in the way that was expected. In terms of authors, there is a lot of misunderstanding or lack of awareness as well as fear of repurcussions especially from the publishers or for their career - changes that came with the web has been rather coldly received in academia... In terms of publishers, there has been some hindrance which reinforces the sense of uncertainty. Finally libraries have been hooked into big deals and therefore the space of manoeuvre has been limited, budgets have been frozen, policies made elsewhere (e.g. at the funder level which can be national), there has also been varying levels of buy-in in the notion of OA as everywhere else.
The drivers have been advocacy, as well as technical developments for appropriate infrastructure, new publishing venues and policies. Advocacy has collected evidence about benefits for authors to the point that it has to become the way of working in the digital age. Benefits include: visibility, usage, impact. This is a lso valid for institutions because they can better monitor and assess usage, it gives them competitive intelligence and facilitates outreach and better return for investment. Benefits for funders is that they can also monitor and assess their investment (ROI).
In terms of the infrastructure, we've developed systems from print to electronic, hyperlinking, interoperability and linked data (possibly). The EU has done some research on OA. Amongst other things, they've built OpenAIRE, its a harvester for metadata as well as content, readers can go there to collect articles they're interested in. Open Access policies have also been developed. There are now 222 institutional, 44 sub-institutional and 90 funder policies, so significant things are happening.
The areas of promise and their issues and challenges are:
- Books, because up until now the focus has been on journals
- Policies are growing in number and they must be mandatory and supported by good implementation; there's also a strong recommendation about convergence and alignment at a European level
- Humanities are increasily a point of interest with lots of new developments for OA journals and OA monographs so publishers are changing their business model; funders are waking up and institutions are developing new initiatives, e.g. covering the costs, institutional publising (university presses)
- Libraries also have a role to play, they have the right skills, the knowledge about users needs
- Technical initiatives, e.g. hypothes.is (kind of interactive book?...)
- Data developments with massive interest in Open Data and it may be the basis of open scholarship in the future - However the preservation and curation of data is a challenge!
- Changes in legislation and thinking about licensing and copyright which frees up the research community, although more thinking is required; we need a responsible licensing (e.g. not wise to sign agreements with publishers that limit OA or obsturct its aims)
- Text and data mining - we need a full research literature that can be open and mined - it is institutions' responsibility to support this; see TDM
- Ensuring that the OA system is sustainable
Subscribe to:
Posts (Atom)