Automatic index construction for multimedia digital libraries

https://doi.org/10.1016/j.ipm.2009.10.006Get rights and content

Abstract

Indexing remains one of the most popular tools provided by digital libraries to help users identify and understand the characteristics of the information they need. Despite extensive studies of the problem of automatic index construction for text-based digital libraries, the construction of multimedia digital libraries continues to represent a challenge, because multimedia objects usually lack sufficient text information to ensure reliable index learning. This research attempts to tackle the problem of automatic index construction for multimedia objects by employing Web usage logs and limited keywords pertaining to multimedia objects. The tests of two proposed algorithms use two different data sets with different amounts of textual information. Web usage logs offer precious information for building indexes of multimedia digital libraries with limited textual information. The proposed methods generally yield better indexes, especially for the artwork data set.

Introduction

The rapid advances of information technologies have allowed for the inclusion of vast amounts of electronic information in digital libraries. This electronic information initially was primarily text-based, but it has expanded to include graphics, animation, audio, video, and interactive media (Tjondronegoro & Spink, 2008). Thus, the ability to help users easily, efficiently, and conveniently retrieve multimedia information from the vast array available presents both an opportunity and a challenge for modern digital libraries.

In traditional text-based digital libraries, indexing provides the main tool to help users seek information and understand the topics contained within documents of interest to them (Berry & Castellanos, 2007). Many researchers address the challenge of indexing text-based information by leveraging content features derived from titles, keywords, abstracts, or full texts and thereby determining similarities among objects (Berry and Castellanos, 2007, Boley et al., 1999, Zhao, 2002). A clustering technique then develops a set of clusters, each of which receives a label, assigned manually or automatically (Yang & Pederson, 1997) that distinguishes the documents in that cluster from those in other clusters. The clusters then form an index for organizing text-based information.

Indexing multimedia information, however, is more challenging, because these data comprise opaque collections of bytes with limited textual information, such as short titles, the date of creation, or names of the artists (Mehtre, Kankanhalli, & Lee, 1997). Despite the existence of some techniques for automatic keyword extraction of multimedia objects in specific domains, the number of derived keywords and their accuracy remain limited (Tsai, McGarry, & Tait, 2006). Furthermore, with limited textual information for multimedia objects, traditional text-based clustering approach may not work well. Therefore, a pressing need emerges, namely, to integrate other sources of data to cluster objects in a multimedia digital library.

With the advent of the World Wide Web, an overwhelming number of digital libraries now provide interfaces that allow for ubiquitous information access. The usage data associated with Web-based digital libraries automatically get recorded in Web usage logs by Web servers. Therefore, each user click within a Web-based digital library results in one or more records in the Web usage log, such that each record represents the source IP, access time, access method/URL/protocol, referred URL, status, bytes transferred, browser type, and so forth. Table 1 displays a sample Web usage log for the electronic thesis and dissertation (ETD) system at National Sun Yat-Sen University. The first record in Table 1 shows that an entry with a universal resource number etd-0717101-163917 was accessed on 01/Apr/2004:00:00:02 by a user with the IP address 218.165.248.55. The second and third records in Table 1 indicate a user who has chosen to view an entry identified by etd-0130101-140550. Objects of the same category logically should have a higher chance of being accessed together compared with objects in different categories. Therefore, we propose to tackle the problem of indexing multimedia information by employing Web usage logs, in combination with limited keywords attached to multimedia information.

This article reports our endeavor to integrate textual data and usage data pertaining to multimedia objects and thus build the index. We develop two methods to construct an index for multimedia objects that employs both the (possibly limited) textual data associated with the objects and their usage data over a specified period of time, as recorded by Web servers. One method, called MCAT, applies both clustering and classification techniques, and the other, MCLU, uses only clustering techniques. We apply both methods, as well as methods that use only textual data, to two data sets derived from the World Art digital library from Airiti, Inc., in Taiwan and the ETD system at National Sun Yat-Sen University. The World Art digital library involves only a limited amount of textual information pertaining to images of artwork. The evaluation results using this data set show that an index constructed by considering both usage and content data better matches the predefined index than does an index that uses only one source of data. In addition, the resulting index effectively reduces users’ efforts to find the information they require. The ETD system contains a profound amount of textual information, in addition to usage data, so we use it to investigate how our proposed methods perform even for a digital library with rich textual data. Compared with traditional text-based approaches, the indexes created by our proposed methods are only slightly inferior in terms of matching the predefined index. Nevertheless, our proposed indexes retain the advantages of enabling users to identify the information they need quickly. We thus conclude that the proposed methods offer promising improvements for building indexes for multimedia digital libraries.

The remainder of this article is organized as follows: In Section 2, we review related research efforts. In Section 3, we describe our methods for indexing multimedia information, using textual content information and Web usage logs. We report the results of our experiments in Section 4 and evaluate the various methods by applying real-world data collected from the World Art digital library and an ETD system. Finally, in Section 5, we summarize and point to some further research directions.

Section snippets

Literature review

Digital libraries attract tremendous interest, including several research projects that attempt to address the vast challenges in this field, such as the Alexandria Digital Library (ADL) project at the University of California at Santa Barbara (Manjunath & Ma, 1996), the DLI project at the University of Illinois (Chen et al., 1996), the Informedia project at Carnegie Mellon University (Wactlar, Kanade, Smith, & Stevens, 1996), the Variations2 project at Indiana University (Byrd & Isaacson, 2003

Proposed methods

Most previous work employs textual information to construct document indexes, though more recent work also facilitates clustering with Web usage logs. We observe though that multimedia digital libraries often lack sufficient textural information and propose constructing an index of multimedia objects by employing both (textual) content and usage data. We define the content similarity between two multimedia objects according to their textual data. Specifically, every multimedia object can be

Data sets

To evaluate our proposed methods, we collected data from two test beds: the World Art Digital Library from Airiti, Inc. (http://www.airiti.com/Arts), whose home page (in Chinese) is in Fig. 3, and the ETD System at National Sun Yat-Sen University (NSYSU) (http://www.lib.nsysu.edu.tw/eThesys/), whose English home page appears in Fig. 4. The World Art Digital Library contains a limited amount of textual information, whereas the NSYSU ETD System provides abundant textual content. We also obtained

Conclusions

In this article, we address the problem of index construction for multimedia digital libraries by developing two index construction methods, MCAT and MCLU. These two methods employ primitive keywords and usage data to develop an index. The empirical experiments reveal that compared with traditional content-based clustering methods, our methods, when applied to digital libraries with limited textual data, generate indexes that exhibit better content and usage entropies. For digital libraries

References (36)

  • Cooley, R., Mobasher, B., & Srivastava, J. (1999). Creating adaptive Web sites through usage-based clustering of URLs....
  • M.M. Gaber et al.

    Mining data streams: A review

    ACM SIGMOD Record

    (2005)
  • Han, E. H., Karypis, G., Kumar, V., & Mobasher, B. (1997). Clustering based on association rule hypergraphs. In...
  • J. Han et al.

    Data mining: Concepts and techniques

    (2006)
  • E.H. Han et al.

    Hypergraph based clustering in high dimensional data sets: A summary of results

    IEEE Bulletin of the Technical Committee on Data Engineering

    (1998)
  • S.-Y. Hwang et al.

    Combining article content and Web usage for literature recommendation in digital libraries

    Online Information Review

    (2004)
  • S.-Y. Hwang et al.

    A prototype WWW literature recommendation system for digital libraries

    Online Information Review

    (2003)
  • B.J. Jansen et al.

    Searching for multimedia: Analysis of audio, video and image Web queries

    World Wide Web

    (2000)
  • Cited by (6)

    • LVTIA: A new method for keyphrase extraction from scientific video lectures

      2022, Information Processing and Management
      Citation Excerpt :

      Keyword extraction or keyphrase extraction is used alternatively in this research, as in the literature they have the same meaning. Various studies have been done in multimedia indexing, in general, while some of them have focused on converting multimedia into text, and extracting keywords and tags from the textual content (Awad et al., 2017; Hwang, Yang, & Ting, 2010; Kaavya & LakshmiPriya, 2015). For video indexing and keyphrase extraction, several research are done based on analyzing video frames and extracting motion-based and object recognition based features from the frames (Gayathri & Mahesh, 2020; Spolaôr et al., 2020).

    • Automatic subject indexing of textt

      2019, Knowledge Organization
    • Design and implementation of a multimedia database application system

      2013, Journal of Theoretical and Applied Information Technology
    View full text