Pages

Tuesday, 9 June 2015

Datamazed – analysing library data flows, manipulations and redundancies

ELAG 2015
By Lukas Koster, Library Systems Coordinator at the Library of the University of Amsterdam

View slides of presentation


It’s more about transformations than manipulations. Tried to build a dataflow repository for efficiencies and blueprint for improvement.Initial problem is that system environment is complex. Lots of things happening to maintain this environment. Data is all over the place. Labyrinths are easy, but a maze is much more complex and that’s what our systems look like. Worth spending time to develop new environments because currently it is all very fragmented. The data is hostage and we need to free it.

Goal of the project: describe the nature and content of all internal and external datastores and workflows between internal and external systems in terms of object types and data formats, thereby identifying overlap, redundancy and bottleneck that stand in the way of efficient data and service management.

Methodology used is enterprise architecture. Distinguish between business (what), enterprise (how) and technology. Looked at other similar experiences and knew of BIBNET Flemish Public Library Network and their Architecture Study, focusing on the big picture rather than dataflows.

DFD = dataflow diagramming is a fairly easy model. Also used tools such as data modelling, visualisation etc.  Chose Business System Modelling, a relatively open tool with a number of export/import and a lot of documentation and reports. 

Dataflow repository describes all elements, including the systems they use etc. Their Visual Paradigm Project Model is subdivided into meaningful folders that can also be used to generate reports. They also have made a data dictionary for all object types, data elements and so on.
Business layer top level
Business layer level 2
Business layer level 3: data management
Application layer: data exchange


Dataflows can be defined by type (they had 5). In all data flows there’s an element of selection on what you do and with what. It has to be documented to help for decisions and so that you know what to expect and what happens (especially if you’re going to change systems.) Same for transformations – has to be transparent. 

Data redundancy is also an important issue and can be caused for various reasons. The unique solution: linked data!
Mostly benefits of all of this is not only having a good overview of available data, dataflow dependencies and efficiencies but also experimenting with linked data. It may be the beginning of something else, such as data consolidation exchange. Descriptions of how things are more automated should also be recorded.

ELAG 2015 - Stockholm - Opening presentations


Opening note – Gunilla Herdenberg, KB National Librarian
Older collections and prints are stored in two buildings 40m underground. Legal deposit from 1661. 1979 audio-visual material was added (films, moving image etc.) and last year digital material. Want to unlock data to facilitate use and reuse.

Faith, hope and codification – Janis Kremlin, Senior librarian for academic affairs
Libraries dream of comprehensiveness. Best memo stick is [picture of] tree in a garden because the purpose of a garden is to create a whole world. We are seeking permanency. The new keyword is codification. Libraries are always concerned about objects. We’re missing the way we are communicating, faced with the challenge of losing memory, that’s why codification is so important. It concentrates on little objects, we’re not talking of collections any more but codes, like a garden that can create the entire world. At the end of the day it has to be relevant.

Keynote: Sometimes I feel sorry forthe data – Magnus Oman, Daniel Gillard
What big data really is is scalability, that’s the main point. At least for the techies. For the memory people it’s the fact it’s so big we can’t possibly work with it! Silly…
We throw away so much data and don’t have to. There’s a whole load we can know now, we have the data. But we’re not using it. For example we could use data from people’s movements for public transport planning. But there are moral issues even with anonymity and that’s why it’s not being exploited.
Another example is about how computers understand language. Taking data from Wikipedia for example, which is really not that much, we can analyse the proximity of words or similarity and make connections. This is a mechanistic view of course but is an example. Analytics is what can make sense or get things wrong…
One question was asked about data on text rather than numbers and can we do that? Answer is that it’s not that easy and reliable, even if some companies say they can do it but the algorithms are not straightforward.

Wednesday, 17 September 2014

IGeLU 2014
Deep search or harvest, how do we decide
Simon Moffatt and Illaria Corda, British Library

Context: increasing content, especially digital from a diversity of sources as well as migration form other systems. So there are two options to integrate all this data: through harvesting or with a deep search.
Harvest: put all the data in the central index (Primo)
Deep search: add a new index, so user searches Primo and this other index in a federated way

The decision that the BL had to make was for two major indexes: one was for the replacement of Content Management System and the other one was for the Web Archive

CMS: harvest

The main reasons for choosing harvesting is that the CMS has it's own index, updated daily. Work involved for deep search would be normalisation rules, setting up daily processes etc. The index is not good enough. It only has 50.000 pages which for Primo is not that much.

Records in the CMS
13m from Aleph
5m Sirsi Dynix for sound archive
48m articles from own article catalogue
1m records from other catalogues

Service challenges for 67m records
The index is a very big file (100GB). Overnight schedules are tight. Fast data updates are not possible. Re-indexing is always at least 3hrs, sometimes more. System restart takes 5hrs, re-synch of the search schema is a whole day and the failover system must be available at all times. Must be careful for Primo service packs and hotfixes. Doing the whole index in one go wouldn't be possible so beware when the documentation says "delete index and recreate".

Development challenges
Major normalisations only 3 times a day, need to be careful about impact on services. Implementing Primo enhancements also needs to be considered. In their development environment they have created a sample of all their different datasets. It's also used to test which version of Primo to use etc. Errors have to be found at this point.

But the compensations of big index are:
Speed
Control over the data
Consistency of rules
Common development skills
Single point of failure

Web Archive : deep search

Figures
1.3m documents (pages)
Regular crawl processes
Index size 3.5 terabytes
80-node hadoop cluster where crawls are processed through submission

Implications of choosing the deep search
On the Primo interface there is a tab for the local content and one for the web archive, but otherwise the GUI is the same as are the type of facets etc. Clicking on a web archive record takes to the Ericom system (digitals rights system for e-legal deposit material).

Service challenges for deep search
Introduction of a different point of failure
New troubleshooting procedures
Changes to the solr schema and could break the search
Primo updates can also potentially break the search

Development challenges
Significant development work, e.g. accommodate Primo features such as multi-select faceting, consistency between local and deep search, etc.

But the compensations are:
Ideal for a large index with frequent updates
Indepth indexing
Maintenance of existing scheduled processes
Active community of api developers

Conclusion: Main criteria to decide

Frequency of updates and lots of records then deep is probably better
Budget to buy servers etc.
Development time and skills expertise
Impact on current processes

Questions
Security layer in development api that doesn't expose the solr index
Limitations of Primo: scalability; would exl make local indexes bigger? BL's primo is not hosted. Before work on primo central, we had genuine problems and we worked on them to resolve them. Primo is actually scalable.

Tuesday, 16 September 2014

IGeLU 2014
FINNA, the best of both worlds
National Library of Finland

FINNA includes material from libraries, museum and archives. It is a discovery interface, though it's more than that. It is a "national view" of the collections but it enables each institution to have it's own view and that's about one hundred that have harvested in test, with 50 that are in production. FINNA is based on the open source discovery service VuFind.

FINNA architecture
Usually the various indexes have their own repositories. The description and the links of the electronic resources are kept in MetaLib and the linking is done with SFX. Universities are always worried that the Primo central index doesn't have all the content they want so that's why there's a need to keep MetaLib. But Primo central index is included in the architecture.

The interface contains tabs for each separate index, the only material harvested is that which can be accessed by everyone, one of the reason being that it's difficult to integrate them together. That's one of the reasons why they didn't make a deal with ProQuest (before they had an agreement with ExLibris). So there's as "split view" between the local content and that from Primo central index. With a combined view it's not possible to say to the user all that we want so it's a showcase for a single search box and filtering happens later. Blending doesn't allow a proper relevancy ranking.

Another problem is that there is some duplication in different indexes, e.g. ebooks. It has also been a challenge naming the tabs so in the end the universities can do on which ever way they want. Finally how to make sure users get the best or correct content? One solution is authentication (? I think).