IGeLU 2014
Deep search or harvest, how do we decide
Simon Moffatt and Illaria Corda, British Library
Context: increasing content, especially digital from a diversity of sources as well as migration form other systems. So there are two options to integrate all this data: through harvesting or with a deep search.
Harvest: put all the data in the central index (Primo)
Deep search: add a new index, so user searches Primo and this other index in a federated way
The decision that the BL had to make was for two major indexes: one was for the replacement of Content Management System and the other one was for the Web Archive
CMS: harvest
The main reasons for choosing harvesting is that the CMS has it's own index, updated daily. Work involved for deep search would be normalisation rules, setting up daily processes etc. The index is not good enough. It only has 50.000 pages which for Primo is not that much.
Records in the CMS
13m from Aleph
5m Sirsi Dynix for sound archive
48m articles from own article catalogue
1m records from other catalogues
Service challenges for 67m records
The index is a very big file (100GB). Overnight schedules are tight. Fast data updates are not possible. Re-indexing is always at least 3hrs, sometimes more. System restart takes 5hrs, re-synch of the search schema is a whole day and the failover system must be available at all times. Must be careful for Primo service packs and hotfixes. Doing the whole index in one go wouldn't be possible so beware when the documentation says "delete index and recreate".
Development challenges
Major normalisations only 3 times a day, need to be careful about impact on services. Implementing Primo enhancements also needs to be considered. In their development environment they have created a sample of all their different datasets. It's also used to test which version of Primo to use etc. Errors have to be found at this point.
But the compensations of big index are:
Speed
Control over the data
Consistency of rules
Common development skills
Single point of failure
Web Archive : deep search
Figures
1.3m documents (pages)
Regular crawl processes
Index size 3.5 terabytes
80-node hadoop cluster where crawls are processed through submission
Implications of choosing the deep search
On the Primo interface there is a tab for the local content and one for the web archive, but otherwise the GUI is the same as are the type of facets etc. Clicking on a web archive record takes to the Ericom system (digitals rights system for e-legal deposit material).
Service challenges for deep search
Introduction of a different point of failure
New troubleshooting procedures
Changes to the solr schema and could break the search
Primo updates can also potentially break the search
Development challenges
Significant development work, e.g. accommodate Primo features such as multi-select faceting, consistency between local and deep search, etc.
But the compensations are:
Ideal for a large index with frequent updates
Indepth indexing
Maintenance of existing scheduled processes
Active community of api developers
Conclusion: Main criteria to decide
Frequency of updates and lots of records then deep is probably better
Budget to buy servers etc.
Development time and skills expertise
Impact on current processes
Questions
Security layer in development api that doesn't expose the solr index
Limitations of Primo: scalability; would exl make local indexes bigger? BL's primo is not hosted. Before work on primo central, we had genuine problems and we worked on them to resolve them. Primo is actually scalable.
Wednesday, 17 September 2014
Tuesday, 16 September 2014
IGeLU 2014
FINNA, the best of both worlds
National Library of Finland
FINNA includes material from libraries, museum and archives. It is a discovery interface, though it's more than that. It is a "national view" of the collections but it enables each institution to have it's own view and that's about one hundred that have harvested in test, with 50 that are in production. FINNA is based on the open source discovery service VuFind.
FINNA architecture
Usually the various indexes have their own repositories. The description and the links of the electronic resources are kept in MetaLib and the linking is done with SFX. Universities are always worried that the Primo central index doesn't have all the content they want so that's why there's a need to keep MetaLib. But Primo central index is included in the architecture.
The interface contains tabs for each separate index, the only material harvested is that which can be accessed by everyone, one of the reason being that it's difficult to integrate them together. That's one of the reasons why they didn't make a deal with ProQuest (before they had an agreement with ExLibris). So there's as "split view" between the local content and that from Primo central index. With a combined view it's not possible to say to the user all that we want so it's a showcase for a single search box and filtering happens later. Blending doesn't allow a proper relevancy ranking.
Another problem is that there is some duplication in different indexes, e.g. ebooks. It has also been a challenge naming the tabs so in the end the universities can do on which ever way they want. Finally how to make sure users get the best or correct content? One solution is authentication (? I think).
FINNA, the best of both worlds
National Library of Finland
FINNA includes material from libraries, museum and archives. It is a discovery interface, though it's more than that. It is a "national view" of the collections but it enables each institution to have it's own view and that's about one hundred that have harvested in test, with 50 that are in production. FINNA is based on the open source discovery service VuFind.
FINNA architecture
Usually the various indexes have their own repositories. The description and the links of the electronic resources are kept in MetaLib and the linking is done with SFX. Universities are always worried that the Primo central index doesn't have all the content they want so that's why there's a need to keep MetaLib. But Primo central index is included in the architecture.
The interface contains tabs for each separate index, the only material harvested is that which can be accessed by everyone, one of the reason being that it's difficult to integrate them together. That's one of the reasons why they didn't make a deal with ProQuest (before they had an agreement with ExLibris). So there's as "split view" between the local content and that from Primo central index. With a combined view it's not possible to say to the user all that we want so it's a showcase for a single search box and filtering happens later. Blending doesn't allow a proper relevancy ranking.
Another problem is that there is some duplication in different indexes, e.g. ebooks. It has also been a challenge naming the tabs so in the end the universities can do on which ever way they want. Finally how to make sure users get the best or correct content? One solution is authentication (? I think).
IGeLU 2014
The Primo graphical user interface
Matthias Einbrodt, university of Bolzano
Relatively small library. Live with Primo since 2011, one of the early adopters, so live in Jan 2013. New interface was developed and will go live soon.
Object Oriented CSS
The Primo graphical user interface
Matthias Einbrodt, university of Bolzano
Relatively small library. Live with Primo since 2011, one of the early adopters, so live in Jan 2013. New interface was developed and will go live soon.
- First had custom stylesheet but it became too big and difficult to maintain. We're aware that users don't wait longer than 2 or 3 seconds on a webpage for it to load
- Broadband not very performing so not helping
- Reduction of http requests
- Only loaded one dontfile
- Didn't use css images
- Rewrote css from scratch using dynamic stylesheets, media queries, object oriented css, variables, nesting, mixins, etc.
- Produce LESS files which are less numerous than css
- Advantages of dynamic stylesheets language: you can break down into parts
- Creation of central configuration files
Object Oriented CSS
- don't repeat unnecessarily
- code reuse
- separation of structure from skin: build instructions (structure), then build the outside look (visual features) so that means define a base class so whatever the context/containers is, the content/properties shouldn't change
- iFrames create problems, no control over those elements,
IGeLU 2014
Meeting the discovery challenge: user expectations, usability and Primo
Andy Land, Manchester University
Collection is about 4millions. Use Primo since 2010 and move to Alma in 2013. Carried out a research into students needs and usability of Primo in this context. They call it "Library Search". The results show:
Meeting the discovery challenge: user expectations, usability and Primo
Andy Land, Manchester University
Collection is about 4millions. Use Primo since 2010 and move to Alma in 2013. Carried out a research into students needs and usability of Primo in this context. They call it "Library Search". The results show:
- Most heavily used service in the library
- Google comes first but LS is second
- Same for accessing digital content although LS comes first for science related subjects
- LS comes first when wanting to find a known item
- Some functionalities we're liked e.g. facets
- Less liked e.g pre-search filtering
- Changed pre-filters with tabs: search everything vs library catalogue only
- Modified display of facets e.g. their order
- Frberisation was a problem for users so work was done to keep things simpler
- Improved display with visible colours, use of white spaces, tabs etc. to better highlight things
- Spelling suggestions made more visible
- Add "sign in" links in numerous places
- Suggestions to improve metadata in the community zone e.g. authors presented with different names
- Launched a live chat service with library experts
- Other projects o use stuff as crowdsourcing etc.
Monday, 15 September 2014
Igelu 2014 - Oxford
How subversive! And how it takes tosubert...
Alma Swan, director of SPARC Europe, Key Perspectives Ltd, Enabling Open Scholarship
Open access was proposed as an idea in June 1994 though it wasn't called that way yet. It was recommended authors post their papers on anonymous ftp sites. Shortly after the World Wide Web came along. Now there is over 35% of all disciplines publication open access, the majority is green (in depositories), a small part is gold (in subscription journals) and the rest is delayed open access, i.e. where publishers make the content of their journal available after a period of time (maybe 12 to 18 months).
But the trend is that the increase has been very slow and certainly has not progressed in the way that was expected. In terms of authors, there is a lot of misunderstanding or lack of awareness as well as fear of repurcussions especially from the publishers or for their career - changes that came with the web has been rather coldly received in academia... In terms of publishers, there has been some hindrance which reinforces the sense of uncertainty. Finally libraries have been hooked into big deals and therefore the space of manoeuvre has been limited, budgets have been frozen, policies made elsewhere (e.g. at the funder level which can be national), there has also been varying levels of buy-in in the notion of OA as everywhere else.
The drivers have been advocacy, as well as technical developments for appropriate infrastructure, new publishing venues and policies. Advocacy has collected evidence about benefits for authors to the point that it has to become the way of working in the digital age. Benefits include: visibility, usage, impact. This is a lso valid for institutions because they can better monitor and assess usage, it gives them competitive intelligence and facilitates outreach and better return for investment. Benefits for funders is that they can also monitor and assess their investment (ROI).
In terms of the infrastructure, we've developed systems from print to electronic, hyperlinking, interoperability and linked data (possibly). The EU has done some research on OA. Amongst other things, they've built OpenAIRE, its a harvester for metadata as well as content, readers can go there to collect articles they're interested in. Open Access policies have also been developed. There are now 222 institutional, 44 sub-institutional and 90 funder policies, so significant things are happening.
The areas of promise and their issues and challenges are:
How subversive! And how it takes tosubert...
Alma Swan, director of SPARC Europe, Key Perspectives Ltd, Enabling Open Scholarship
Open access was proposed as an idea in June 1994 though it wasn't called that way yet. It was recommended authors post their papers on anonymous ftp sites. Shortly after the World Wide Web came along. Now there is over 35% of all disciplines publication open access, the majority is green (in depositories), a small part is gold (in subscription journals) and the rest is delayed open access, i.e. where publishers make the content of their journal available after a period of time (maybe 12 to 18 months).
But the trend is that the increase has been very slow and certainly has not progressed in the way that was expected. In terms of authors, there is a lot of misunderstanding or lack of awareness as well as fear of repurcussions especially from the publishers or for their career - changes that came with the web has been rather coldly received in academia... In terms of publishers, there has been some hindrance which reinforces the sense of uncertainty. Finally libraries have been hooked into big deals and therefore the space of manoeuvre has been limited, budgets have been frozen, policies made elsewhere (e.g. at the funder level which can be national), there has also been varying levels of buy-in in the notion of OA as everywhere else.
The drivers have been advocacy, as well as technical developments for appropriate infrastructure, new publishing venues and policies. Advocacy has collected evidence about benefits for authors to the point that it has to become the way of working in the digital age. Benefits include: visibility, usage, impact. This is a lso valid for institutions because they can better monitor and assess usage, it gives them competitive intelligence and facilitates outreach and better return for investment. Benefits for funders is that they can also monitor and assess their investment (ROI).
In terms of the infrastructure, we've developed systems from print to electronic, hyperlinking, interoperability and linked data (possibly). The EU has done some research on OA. Amongst other things, they've built OpenAIRE, its a harvester for metadata as well as content, readers can go there to collect articles they're interested in. Open Access policies have also been developed. There are now 222 institutional, 44 sub-institutional and 90 funder policies, so significant things are happening.
The areas of promise and their issues and challenges are:
- Books, because up until now the focus has been on journals
- Policies are growing in number and they must be mandatory and supported by good implementation; there's also a strong recommendation about convergence and alignment at a European level
- Humanities are increasily a point of interest with lots of new developments for OA journals and OA monographs so publishers are changing their business model; funders are waking up and institutions are developing new initiatives, e.g. covering the costs, institutional publising (university presses)
- Libraries also have a role to play, they have the right skills, the knowledge about users needs
- Technical initiatives, e.g. hypothes.is (kind of interactive book?...)
- Data developments with massive interest in Open Data and it may be the basis of open scholarship in the future - However the preservation and curation of data is a challenge!
- Changes in legislation and thinking about licensing and copyright which frees up the research community, although more thinking is required; we need a responsible licensing (e.g. not wise to sign agreements with publishers that limit OA or obsturct its aims)
- Text and data mining - we need a full research literature that can be open and mined - it is institutions' responsibility to support this; see TDM
- Ensuring that the OA system is sustainable
Friday, 13 June 2014
Closed stacks - improved workflow
ELAG 2014
Going digital in the closed stacks – Library logistics with a smart phone
Eva Dahlbäck and Theodor Tolstoy, Stockholm University Library
(see description of the talk)
Approx. 300 orders a day. Up until recently, orders would have been printed on paper. Orders from closed stacks, interlibrary loans and missing books. Before: digital -> paper -> digital (email sent to user to let them know item available).
The workflow is essentially the same, even if the orders come in differently and come out differently. So now a new digital workflow: All the orders are collected together in a list managed by the Viola programme, in which they are sorted. The librarian makes downloads the list, goes to the stacks and comes back with items. It's managed by a portable phone.
The list can be ordered by location. It is downloaded in the mobile phone. Each book has a separate entry and includes the shelfmark info. The librarians has also a small printer for slips. The phone scans the barcode and a slip is printed and goes in the book. Once a book has been scanned, it receives a green star (they can be various colours for various situations).
Benefits are fewer manual steps and a unique workflow. It's faster and easier to follow the order steps. It requires less people. Viola is also connected to the invoice-system (what does this imply?...). However collection is only twice a day...
A close collaboration between developpers and librarians was necessary. They worked with "user stories" to help in the collaborative work, to prioritise tasks and break them into smaller steps and to follow progress. The user stories included staff, such as the book fetcher (what does he/she need?) Intention is to release this as open source!
The technology used was ASP.NET MVC, the database is a SQL server (replaceable thanks to the ORM PetaPoco) and Android APP (Xmarin Monodroid). Most of these are open source.
Going digital in the closed stacks – Library logistics with a smart phone
Eva Dahlbäck and Theodor Tolstoy, Stockholm University Library
(see description of the talk)
Approx. 300 orders a day. Up until recently, orders would have been printed on paper. Orders from closed stacks, interlibrary loans and missing books. Before: digital -> paper -> digital (email sent to user to let them know item available).
The workflow is essentially the same, even if the orders come in differently and come out differently. So now a new digital workflow: All the orders are collected together in a list managed by the Viola programme, in which they are sorted. The librarian makes downloads the list, goes to the stacks and comes back with items. It's managed by a portable phone.
The list can be ordered by location. It is downloaded in the mobile phone. Each book has a separate entry and includes the shelfmark info. The librarians has also a small printer for slips. The phone scans the barcode and a slip is printed and goes in the book. Once a book has been scanned, it receives a green star (they can be various colours for various situations).
Benefits are fewer manual steps and a unique workflow. It's faster and easier to follow the order steps. It requires less people. Viola is also connected to the invoice-system (what does this imply?...). However collection is only twice a day...
A close collaboration between developpers and librarians was necessary. They worked with "user stories" to help in the collaborative work, to prioritise tasks and break them into smaller steps and to follow progress. The user stories included staff, such as the book fetcher (what does he/she need?) Intention is to release this as open source!
The technology used was ASP.NET MVC, the database is a SQL server (replaceable thanks to the ORM PetaPoco) and Android APP (Xmarin Monodroid). Most of these are open source.
Identifying and identifiers
ELAG 2014
Integrating ORCiD – A two way conversation
Tom Demeranville, software engineer specialising in digital identifiers and identities at the British Library
(see description of the talk)
ODIN (DataCite Interoperability Network) is concerned with linking authors with research output and is a 2-year project. It's also looking at datasets, grey literature, etc. What do we mean by identifying authors? The answer varies. One person can have lots of identifiers and profiles, including institutional profile, an ISNI, a ScopusID etc. or an ORCiD.
So the first distinction is between an identifier and a profile. We usually think of identifiers as unique ID but a profile can be much more. Another important point is that no one wants to type the same thing twice. Profiles can be automated or manual. Then there is the difference of identifiers as Institutions or Users, with conflicting notions of control but we all want disambiguation...
So what we need... One identifier and many profiles that solve different use cases.
ORCiD is meant to be a more open identifying system, managed for people with many different use cases. Relevant to publishers, unis, funders and libraries. It would help systems to talk to each other.
Ethos is e-Thesis Online Import. See demo at http://ethos-orcid.appspot.com
The ODIN project is working to integrate ORCiD and DataCite.
Integrating ORCiD – A two way conversation
Tom Demeranville, software engineer specialising in digital identifiers and identities at the British Library
(see description of the talk)
ODIN (DataCite Interoperability Network) is concerned with linking authors with research output and is a 2-year project. It's also looking at datasets, grey literature, etc. What do we mean by identifying authors? The answer varies. One person can have lots of identifiers and profiles, including institutional profile, an ISNI, a ScopusID etc. or an ORCiD.
So the first distinction is between an identifier and a profile. We usually think of identifiers as unique ID but a profile can be much more. Another important point is that no one wants to type the same thing twice. Profiles can be automated or manual. Then there is the difference of identifiers as Institutions or Users, with conflicting notions of control but we all want disambiguation...
So what we need... One identifier and many profiles that solve different use cases.
ORCiD is meant to be a more open identifying system, managed for people with many different use cases. Relevant to publishers, unis, funders and libraries. It would help systems to talk to each other.
Ethos is e-Thesis Online Import. See demo at http://ethos-orcid.appspot.com
The ODIN project is working to integrate ORCiD and DataCite.
The Mechanical Curator
ELAG 2014
The Surprising Adventures of the Mechanical Curator, and Other Tales
Ben O’Steen, technical leader of British Library Labs
(see description of the talk and the slides)
This project started last year, as an accident! Taking the stuff that's technically accessible... and making it accessible! Engaged with the researchers, formally and informally through yearly competitions. What they win is our time and effort! The unifying theme to (pretty much) all the requests is: Give us everything! But this is quite depressing: so librarians don't take part in research, they're only there to provide content? Another theme is to have tools to interpret the content, to be able to work on broad sweeps of content rather than one at a time.
The Sample Generator shows the chasm between the collection and the digitised material. Not only is the content not as much digitised but it is also not as accessible as it could be.
The challenge was that research didn't want to work with api's but access large amounts of data. Made an experiment: face detection on 19thC illustrations - it wasn't very successful. The depiction is usually "clean" and posed, males represented differently from females and therefore less often detected etc. But it gave the idea of the mechanical curator, who digs in the collection of digitised images and tries to find visually similar images, based on a calculated match. It has now been doing that for a couple of months (and tweets about it). An unguided way of discovering material.
Images published on Flickr, many views in 4 days. They are published as CC0 and there are already examples of creative re-uses, such as colouring-in for children, an artit's interpretation etc. But this doesn't bring money to the Library, which is always hard to justify. But this is encouraging creativity, it may not be research but it's not less imporatnt. An animation student used images to represent them in 3-D. Moments, by Joe Bell
So the impact is hard to measure. Accessible is great, can we make it more useful? A group of UCL big Data CS students will be given access to all the book data, cloud computing and will make an experiment for broader and more direct access to the collections.
The Surprising Adventures of the Mechanical Curator, and Other Tales
Ben O’Steen, technical leader of British Library Labs
(see description of the talk and the slides)
This project started last year, as an accident! Taking the stuff that's technically accessible... and making it accessible! Engaged with the researchers, formally and informally through yearly competitions. What they win is our time and effort! The unifying theme to (pretty much) all the requests is: Give us everything! But this is quite depressing: so librarians don't take part in research, they're only there to provide content? Another theme is to have tools to interpret the content, to be able to work on broad sweeps of content rather than one at a time.
The Sample Generator shows the chasm between the collection and the digitised material. Not only is the content not as much digitised but it is also not as accessible as it could be.
The challenge was that research didn't want to work with api's but access large amounts of data. Made an experiment: face detection on 19thC illustrations - it wasn't very successful. The depiction is usually "clean" and posed, males represented differently from females and therefore less often detected etc. But it gave the idea of the mechanical curator, who digs in the collection of digitised images and tries to find visually similar images, based on a calculated match. It has now been doing that for a couple of months (and tweets about it). An unguided way of discovering material.
Images published on Flickr, many views in 4 days. They are published as CC0 and there are already examples of creative re-uses, such as colouring-in for children, an artit's interpretation etc. But this doesn't bring money to the Library, which is always hard to justify. But this is encouraging creativity, it may not be research but it's not less imporatnt. An animation student used images to represent them in 3-D. Moments, by Joe Bell
So the impact is hard to measure. Accessible is great, can we make it more useful? A group of UCL big Data CS students will be given access to all the book data, cloud computing and will make an experiment for broader and more direct access to the collections.
Thursday, 12 June 2014
Data visualisation
ELAG 2014
Data visualization as a library service? Examples from Chalmers library
Stina Johansson, Librarian/bibliometrician at Chalmers Library, Sweden
(see description of the talk)
Father of Data? William Playfair (1759-1823), Scottish engineer and political economist. Developed first data charts. "The king at once undrestood the charts and was hihgly pleased. He said they spoke all languages..."
Graphics can make data much more "readable". A lot of communication is done through images. Chalmers Library uses a lot of visualisation to communicate about its data. E.g. topical analysis through keywords or citations networks or geospatial visualisations. E.g. Author co-citation analysis: a map representing the most cited authors, with links between them meaning various things, size or centrality of "dots" providing visual information.
Some of the visualisations are interactive - clicking brings to additional or linked info. Also visualisations of country level collaboration with environmental department data, for example.
See: http://chalmeriana.lib.chalmers.se/visuals/journal_citation/
Tools
Gephi: open source visualisation tool
Raw: open web app to create vector-based visualisations (from spreadsheets, potentially?) used on top of D.js.library (java) through a simple interface
VOSviewer: can use with a raw text file, easy to use
Data has to be clean and structured though! Be creative and play!
Data visualization as a library service? Examples from Chalmers library
Stina Johansson, Librarian/bibliometrician at Chalmers Library, Sweden
(see description of the talk)
Father of Data? William Playfair (1759-1823), Scottish engineer and political economist. Developed first data charts. "The king at once undrestood the charts and was hihgly pleased. He said they spoke all languages..."
Graphics can make data much more "readable". A lot of communication is done through images. Chalmers Library uses a lot of visualisation to communicate about its data. E.g. topical analysis through keywords or citations networks or geospatial visualisations. E.g. Author co-citation analysis: a map representing the most cited authors, with links between them meaning various things, size or centrality of "dots" providing visual information.
Some of the visualisations are interactive - clicking brings to additional or linked info. Also visualisations of country level collaboration with environmental department data, for example.
See: http://chalmeriana.lib.chalmers.se/visuals/journal_citation/
Tools
Gephi: open source visualisation tool
Raw: open web app to create vector-based visualisations (from spreadsheets, potentially?) used on top of D.js.library (java) through a simple interface
VOSviewer: can use with a raw text file, easy to use
Data has to be clean and structured though! Be creative and play!
EuropeanaBot
ELAG 2014
EuropeanaBot – using open data and open APIs to present digital collections
Peter Mayr, administrator for the ILL-system at the North Rhine- Westphalian library consortium (hbz) in Cologne
(see description of talk)
Serendipity vs standard search. The library is a "precious provider of unpredictability"
TwitterBots are a class of software. EuropeanaBot is based on Europeana api - uses open data collections to sweep interesting things, and make some kind of catalogue enrichment. E.g. list of Nobel Prize winners, Guardian api, place names etc.The Guradian api allows to get news and keywords with corresponding images. Wordnik api: every day at 1pm a word is published and Europeana looks for relevant images. Wikipedia api works in a similar way.
There's of course a Digital Persona behind the EuropeanaBot (he likes to post images of cats).
Conclusion: we hide great objects behind search forms, so we need more serendipity! Let our collections speak for themselves. Not too much work and maintenance is required and it brings results. People out there will listen.
Code behind the EuropeanaBot api: https://github.com/hatorikibble/twitter-europeanabot
EuropeanaBot – using open data and open APIs to present digital collections
Peter Mayr, administrator for the ILL-system at the North Rhine- Westphalian library consortium (hbz) in Cologne
(see description of talk)
Serendipity vs standard search. The library is a "precious provider of unpredictability"
TwitterBots are a class of software. EuropeanaBot is based on Europeana api - uses open data collections to sweep interesting things, and make some kind of catalogue enrichment. E.g. list of Nobel Prize winners, Guardian api, place names etc.The Guradian api allows to get news and keywords with corresponding images. Wordnik api: every day at 1pm a word is published and Europeana looks for relevant images. Wikipedia api works in a similar way.
There's of course a Digital Persona behind the EuropeanaBot (he likes to post images of cats).
Conclusion: we hide great objects behind search forms, so we need more serendipity! Let our collections speak for themselves. Not too much work and maintenance is required and it brings results. People out there will listen.
Code behind the EuropeanaBot api: https://github.com/hatorikibble/twitter-europeanabot
The Revolution
ELAG 2014
The black box opens
Marina Muilwijk, software developer University of Utrecht
(see description of talk)
Web services to replace combination of files and sql. E.g. system that can talk from repository to digitisation process or to the catalogue etc. Without an api it would be really complicated to do this. So must think of what should the api do, where is the data coming from?
Example: mobile view of an individual's loans. Without needing to talk directly to the circulation system, it talks to the outside layer via the api and finds the info.
The "revolution", inspired by book "The Lean Startup", start with hypothesis, test it and measure, i.e. build, measure, learn and so on (in a circle - no starting point). At startup, not sure what the end product will be. If you ask users you may not get very far... So from wrong to right: success when requirements are implemented vs success when users use it!
Revolution represented here
The black box opens
Marina Muilwijk, software developer University of Utrecht
(see description of talk)
Web services to replace combination of files and sql. E.g. system that can talk from repository to digitisation process or to the catalogue etc. Without an api it would be really complicated to do this. So must think of what should the api do, where is the data coming from?
Example: mobile view of an individual's loans. Without needing to talk directly to the circulation system, it talks to the outside layer via the api and finds the info.
The "revolution", inspired by book "The Lean Startup", start with hypothesis, test it and measure, i.e. build, measure, learn and so on (in a circle - no starting point). At startup, not sure what the end product will be. If you ask users you may not get very far... So from wrong to right: success when requirements are implemented vs success when users use it!
Revolution represented here
Useful and Usable Web Services
ELAG 2014
Building Useful & Usable Web Services
Steve Meyer, Technical Product Manager OCLC WorldShare Platform
(see also description of this talk)
API = a system of tools and resources in an OS enabling developers to create software applications. For data an example would be to create an aspirational view of what your data could look like, i.e. expose it as you want it to look like. Standards should be used to bring a clear understanding of your data model, serialisation and statements you want to make. Think of the community you're aiming at but also the one you want to belong to. A good standard will be stable.
sql is a way of creating api's. The language is not that distant from http commands (such as post, get etc.) to provide access to data. As api creator we can use any respectable programming language.
WorldCat metadata as case study. Is made of core assets with an intuitive api.Use of data modelling to carry out all sorts of tasks. Example of an api would be the validation of MARC record, with messages such as "008 must be present" etc.
Issues around authentication. When in a web service context, it's not a person you're authenticating. Access to a dataset by a machine is not without consequence. An api "key" allows me to enter but still need to provide identification. At OCLC try to provide equivalence api to web service after authentication (?)
Other issues: software is never perfect, documentation always with biaises...Ex. OCLC's api "holding availability", with intended use: connect a patron to a library that holds an item but actual use: read high volume holindgs info for analysis. Sometimes things go wrong.
Consider principles of useability in the way that you would in a unser interface context.
Questions:
Is it best to build api on top of your own system rather than someone else's? Yes though not always possible.
How to guarantee openess and re-sharing? Most OCLC api's are operational data, so we don't think about that. But we think of licencing and rights can be integrated in the data itself.
Who should write the documentation, the technical people, an editor etc.? It can be done as part of the process but mostly useful for highly technical people. Other option is to use people closer to the customers, such as product managers etc.
Building Useful & Usable Web Services
Steve Meyer, Technical Product Manager OCLC WorldShare Platform
(see also description of this talk)
API = a system of tools and resources in an OS enabling developers to create software applications. For data an example would be to create an aspirational view of what your data could look like, i.e. expose it as you want it to look like. Standards should be used to bring a clear understanding of your data model, serialisation and statements you want to make. Think of the community you're aiming at but also the one you want to belong to. A good standard will be stable.
sql is a way of creating api's. The language is not that distant from http commands (such as post, get etc.) to provide access to data. As api creator we can use any respectable programming language.
WorldCat metadata as case study. Is made of core assets with an intuitive api.Use of data modelling to carry out all sorts of tasks. Example of an api would be the validation of MARC record, with messages such as "008 must be present" etc.
Issues around authentication. When in a web service context, it's not a person you're authenticating. Access to a dataset by a machine is not without consequence. An api "key" allows me to enter but still need to provide identification. At OCLC try to provide equivalence api to web service after authentication (?)
Other issues: software is never perfect, documentation always with biaises...Ex. OCLC's api "holding availability", with intended use: connect a patron to a library that holds an item but actual use: read high volume holindgs info for analysis. Sometimes things go wrong.
Consider principles of useability in the way that you would in a unser interface context.
Questions:
Is it best to build api on top of your own system rather than someone else's? Yes though not always possible.
How to guarantee openess and re-sharing? Most OCLC api's are operational data, so we don't think about that. But we think of licencing and rights can be integrated in the data itself.
Who should write the documentation, the technical people, an editor etc.? It can be done as part of the process but mostly useful for highly technical people. Other option is to use people closer to the customers, such as product managers etc.
Europeana: Collection level description
ELAG 2014
Discovering libraries’ gold through collections-level descriptions
Valentine Charles, Data Specialist at The European Library and Europeana
(See also description of talk)
Europeana work in a large scale aggregation ecosystem and now works in collaboration with Cendari. Digitisation is still only the tip of the iceberg of the content of European libraries. Digital objects are displayed attractively as they are very visual. There is also full text available. But most of this material is disconnected from one another. Mostly it is item-level description, with different levels of quality. Wouldn't it be nice to link an image to the relevant journal page?
New strategy for collection-level description. Looking at specific topics, talking to historians, surveying members. E.g. Old Slavic Manuscripts - not digitised but at least there is some description. Another example would be not so much a subject but the specific collections from a particular library, e.g. National Library of Serbia.
Collaboration with Cendari with libraries and archives, about integrating digital data on the mediaval times and the First World War. Researchers working on a project should use this data and we try to facilitate the research activity.The aim is to link the material from different libraries. An environment called Archival research guide is being built to support research. Idea is: when a researcher starts on a topic, he/she writes some paragraphs and it would be incorporated in the guide so that it becomes a narrative, which links to specific collections (the sources). Encouraged to use and re-use data directly in the research, rather than only talking about it. Tools such as NER (name-entity recognition) technology to help identify the entities used in the research or facets etc. are made available. The Archival research guides provide access points to relevant contemporary research, connect collection description to other resources via domain specific ontologies.
Beyond collection description, the most interesting is to link it to other data and other type of material. E.g. with some full text, enriching with annotations, vocabularies, NER etc. Also interested to get this data re-integrated in the various libraries. Cendari is a 4-year project, there are 2 more years to go.
Discovering libraries’ gold through collections-level descriptions
Valentine Charles, Data Specialist at The European Library and Europeana
(See also description of talk)
Europeana work in a large scale aggregation ecosystem and now works in collaboration with Cendari. Digitisation is still only the tip of the iceberg of the content of European libraries. Digital objects are displayed attractively as they are very visual. There is also full text available. But most of this material is disconnected from one another. Mostly it is item-level description, with different levels of quality. Wouldn't it be nice to link an image to the relevant journal page?
New strategy for collection-level description. Looking at specific topics, talking to historians, surveying members. E.g. Old Slavic Manuscripts - not digitised but at least there is some description. Another example would be not so much a subject but the specific collections from a particular library, e.g. National Library of Serbia.
Collaboration with Cendari with libraries and archives, about integrating digital data on the mediaval times and the First World War. Researchers working on a project should use this data and we try to facilitate the research activity.The aim is to link the material from different libraries. An environment called Archival research guide is being built to support research. Idea is: when a researcher starts on a topic, he/she writes some paragraphs and it would be incorporated in the guide so that it becomes a narrative, which links to specific collections (the sources). Encouraged to use and re-use data directly in the research, rather than only talking about it. Tools such as NER (name-entity recognition) technology to help identify the entities used in the research or facets etc. are made available. The Archival research guides provide access points to relevant contemporary research, connect collection description to other resources via domain specific ontologies.
Beyond collection description, the most interesting is to link it to other data and other type of material. E.g. with some full text, enriching with annotations, vocabularies, NER etc. Also interested to get this data re-integrated in the various libraries. Cendari is a 4-year project, there are 2 more years to go.
Wednesday, 11 June 2014
MIF and Europeana Inside
ELAG 2014
Metadata Interoperability Framework (MIF)
Naeem Muhammad, Software Architect at LIBIS KULeuven and Sam Alloing, Business Consultant at LIBIS KULeuven, Belgium
Made for Europeana inside. This is a technical project with different partners, content providers and software providers. It's to create a better integration of the different content in Europeana. So the goal is to create a component that developers of the different systems can directly add into their content management system and it will talk to Europeana. End of this project planned in Sept. 2014.
The content is enriched by Europeana and the content providers can get it back. This is still in discussion and development. The enriched metadata is not always correct so there are still issues to resolve.
ECK=Europeana Connection Kit. Technical providers use this to transform and push data to Europeana. The ECK local is the part to be integrated in the local system itself. Core ECK services include:
Mapping and Transformation supports MARC to EDM and LIDO to EDM because that's what used in Europeana. LIDO =XML fomrat used by museums, EDM=RDF format from Europeana. So the input has to be MARC XML or LIDO. The output is only EDM at the moment. There are core classes and additional classes. EDM uses Dublin Core. For MARC, it works like this:
[command],[marc tag + subfield],[edm field], e.g. COPY, marc506a,dc:rights
doesn't use indicators at this moment, it could change.
Commands are: COPY, APPEND, SPLIT, COMBINE (multiple source fields can be combined in one target field), LIMIT (to limit the number of characters in a field), PUT, REPLACE, CONDITION (combine different actions and use a conditional flow; can be used with IF.
The plan is that EDM has to be as easy for users as possible, even though some understanding will help. The important is to know the EDM field, not the format.
It's a wservice, so no user interface. Meant to be integrated in CMS or use a REST client. Parameters: records (can be a zip file, XML), mappingRulesFile, sourceFormat (LIDO or MARC), targetFormat (EDM)
Future: add input formats, such as csv, filemaker xml, some custom xml etc. Update/add output formats (add EDM contextual classes, other formats...), extend/add update actions, add queuing (near future), add mapping interface (or integrate with MINT, another Europeana project for mapping)
Metadata Interoperability Framework (MIF)
Naeem Muhammad, Software Architect at LIBIS KULeuven and Sam Alloing, Business Consultant at LIBIS KULeuven, Belgium
Made for Europeana inside. This is a technical project with different partners, content providers and software providers. It's to create a better integration of the different content in Europeana. So the goal is to create a component that developers of the different systems can directly add into their content management system and it will talk to Europeana. End of this project planned in Sept. 2014.
The content is enriched by Europeana and the content providers can get it back. This is still in discussion and development. The enriched metadata is not always correct so there are still issues to resolve.
ECK=Europeana Connection Kit. Technical providers use this to transform and push data to Europeana. The ECK local is the part to be integrated in the local system itself. Core ECK services include:
- Metadata definintion
- Set Mangaer
- Statistics
- PID generation
- Preview service
- Validation of metadata
- Data push (Sword) / data pull (OAI-PMH) because some content providers would rather push the data to Europeana rather than them taking it but Europeana is afraid of compabtibility issues so the data pull is still the one in use
- Mapping and Transformation
Mapping and Transformation supports MARC to EDM and LIDO to EDM because that's what used in Europeana. LIDO =XML fomrat used by museums, EDM=RDF format from Europeana. So the input has to be MARC XML or LIDO. The output is only EDM at the moment. There are core classes and additional classes. EDM uses Dublin Core. For MARC, it works like this:
[command],[marc tag + subfield],[edm field], e.g. COPY, marc506a,dc:rights
doesn't use indicators at this moment, it could change.
Commands are: COPY, APPEND, SPLIT, COMBINE (multiple source fields can be combined in one target field), LIMIT (to limit the number of characters in a field), PUT, REPLACE, CONDITION (combine different actions and use a conditional flow; can be used with IF.
The plan is that EDM has to be as easy for users as possible, even though some understanding will help. The important is to know the EDM field, not the format.
It's a wservice, so no user interface. Meant to be integrated in CMS or use a REST client. Parameters: records (can be a zip file, XML), mappingRulesFile, sourceFormat (LIDO or MARC), targetFormat (EDM)
Future: add input formats, such as csv, filemaker xml, some custom xml etc. Update/add output formats (add EDM contextual classes, other formats...), extend/add update actions, add queuing (near future), add mapping interface (or integrate with MINT, another Europeana project for mapping)
Details, links and interface
ELAG 2014
The LIBRIS upgrade
Niklas Lindström, Lina Westerling, Swedish National Libraray
Abstract: Starting in earnest in 2012, The Swedish National Library (Kungliga Biblioteket – KB) begun the development of a new infrastructure and system, based at its core on Linked Data. It directly employs the linked entity description model represented by RDF, and has the capacity to mesh with other linked data on the web, through minimal engineering efforts.
(see more description of talk)
Need for a modern produce, a platform for data for making is searchable and describing, a method for mapping exisitng data to contemporary models of description and a user interface for editing (cataloguing, curating, linking). Web-based cataloguing tool.
The platform is Open Source and works with all data formats, including RDF etc. Limits of MARC, especially hard to find things and to define things. RDF is not a solution but a means to help solve this problem, because of how it describes data. The tool is a simple expression independent of formats, terms etc. Transform of MARC in JSON-LD. Use of prefixes and uri's.
The Utter Denormalisation of turning JSON-LD back into MARC. This is a temporary measure, because needs to integrate with union catalogues, extract data etc. But the idea is that there's a new interface and new formats.
The design is intiuitve, simple, inspiring, user centered. See beta at devkat.libris.kb.se (test/test) It is quite similar to an end-user search tool. It is based on linked data. Needs to handle all data. Normalising the catalogue will not be able to cover everything.
Doing the mapping is challenging, the data expressed in MARC isn't always normalised so it's not clear if the description is an expression or a manifestation. MARC is very structured but sometimes meaningless, there's lots of convolusion in the specificity, the perspective of different domaines are not well coordinated etc. But there are possibilities of capturing the specificity, better coordinating the vocabularies and so on. Then by linking to external resources we add more value to our resources. Use of SPARQL to help in the linking of sources. Value can also be added to link to internal data.
Another of the main challenges is convincing people, especially cataloguers, so we need to be open.
The LIBRIS upgrade
Niklas Lindström, Lina Westerling, Swedish National Libraray
Abstract: Starting in earnest in 2012, The Swedish National Library (Kungliga Biblioteket – KB) begun the development of a new infrastructure and system, based at its core on Linked Data. It directly employs the linked entity description model represented by RDF, and has the capacity to mesh with other linked data on the web, through minimal engineering efforts.
(see more description of talk)
Need for a modern produce, a platform for data for making is searchable and describing, a method for mapping exisitng data to contemporary models of description and a user interface for editing (cataloguing, curating, linking). Web-based cataloguing tool.
The platform is Open Source and works with all data formats, including RDF etc. Limits of MARC, especially hard to find things and to define things. RDF is not a solution but a means to help solve this problem, because of how it describes data. The tool is a simple expression independent of formats, terms etc. Transform of MARC in JSON-LD. Use of prefixes and uri's.
The Utter Denormalisation of turning JSON-LD back into MARC. This is a temporary measure, because needs to integrate with union catalogues, extract data etc. But the idea is that there's a new interface and new formats.
The design is intiuitve, simple, inspiring, user centered. See beta at devkat.libris.kb.se (test/test) It is quite similar to an end-user search tool. It is based on linked data. Needs to handle all data. Normalising the catalogue will not be able to cover everything.
Doing the mapping is challenging, the data expressed in MARC isn't always normalised so it's not clear if the description is an expression or a manifestation. MARC is very structured but sometimes meaningless, there's lots of convolusion in the specificity, the perspective of different domaines are not well coordinated etc. But there are possibilities of capturing the specificity, better coordinating the vocabularies and so on. Then by linking to external resources we add more value to our resources. Use of SPARQL to help in the linking of sources. Value can also be added to link to internal data.
Another of the main challenges is convincing people, especially cataloguers, so we need to be open.
...in the precioussss knowledge
ELAG 2014
Lord of the strings – a somewhat expected journey
Sir Marc of the SubFields thinks the question is a waste of time: where is the lingering Gold?
Whilst Richard the Evangelist of the Dubliners sold us the Discovery Tool with a Search Box.
We need to find how to turn this straw... books, into gold: Drink from the magic cup of Sir Tim then wander in The Cloud. The RDA rules... so secret no one knows what it is.
The Story of Link-a-Lot: the Format Monster follows any format, the Deep Sea Owl jumped in the water and disappeared, the wolves Frbrooooo, the non-dead Marc and twin brother Mods... all enter the story.
The answer is in GOD On Tology, Godot. Begin your search waiting for Godot.
For a complete view of the story, see The Lord of the Strings slides
Lord of the strings – a somewhat expected journey
- Karen Coyle, Digital Libraries Consultant, USA
- Rurik Greenall, Developer, Norwegian University of Science and Technology (NTNU) Library
- Lukas Koster, Library Systems Coordinator, Library of the University of Amsterdam
- Martin Malmsten, Head of Development and Design, National Library of Sweden/LIBRIS
- Anders Söderback, Head of the Department of Publishing, Stockholm University Library
Sir Marc of the SubFields thinks the question is a waste of time: where is the lingering Gold?
Whilst Richard the Evangelist of the Dubliners sold us the Discovery Tool with a Search Box.
We need to find how to turn this straw... books, into gold: Drink from the magic cup of Sir Tim then wander in The Cloud. The RDA rules... so secret no one knows what it is.
The Story of Link-a-Lot: the Format Monster follows any format, the Deep Sea Owl jumped in the water and disappeared, the wolves Frbrooooo, the non-dead Marc and twin brother Mods... all enter the story.
The answer is in GOD On Tology, Godot. Begin your search waiting for Godot.
For a complete view of the story, see The Lord of the Strings slides
Role of libraries in supporting digital scholarship
ELAG 2014
Key note: The Role of libraries in supporting digital scholarship
Stella Wisdom, Digital Curator, The British Library
(see description of the talk)
Need to change the services to meet the need of researchers. There is more and more digital content, increased collaboration working or re-purposing of content. The BL wants the researchers to do innovative research with their content. The BL has been digitising for at least two decades and aims to do much more.
Digital content - examples of recent developments at the BL
Key note: The Role of libraries in supporting digital scholarship
Stella Wisdom, Digital Curator, The British Library
(see description of the talk)
Need to change the services to meet the need of researchers. There is more and more digital content, increased collaboration working or re-purposing of content. The BL wants the researchers to do innovative research with their content. The BL has been digitising for at least two decades and aims to do much more.
Digital content - examples of recent developments at the BL
- Georeference maps and new interactive tool (http://www.bl.uk/maps)
- Europeana 1914-1918 Roadshows - visited museums in different parts of the country, showing some of their digitised images
- Off the Map: video games festival, following a preservation about complexe object conference to which Stella went to and gave her ideas of what the BL can do in this area. There's a museum about video games, Victoria & Albert Museum also organised a competition. The BL made a special feature on the web archive. BL organised a competion: Crytek off the map: visual trip through 17thC. London made by 6 2nd-grade students (winners of last year's competition)
- Work done on sound collections, with permissions to re-use (under certain conditions) see the Flying Buttress
- Organising exhibitions such as Beautiful Science (picturing data, inspiring insight)
- Dora's lost data game
- British Library labs: one of the main actions is a yearly competition to identify innovative ideas that showcase the Library's collections
- The Victorian Meme Machine, to preserve Victorian jokes (one of the winners of the labs competition) - it will combine jokes with images, all coming from the BL collections
Subscribe to:
Posts (Atom)