The browser you are using is not supported by this website. All versions of Internet Explorer are no longer supported, either by us or Microsoft (read more here: https://www.microsoft.com/en-us/microsoft-365/windows/end-of-ie-support).

Please use a modern browser to fully experience our website, such as the newest versions of Edge, Chrome, Firefox or Safari etc.

Importance of HTML structural elements and metadata in automated subject classification

Author

  • Koraljka Golub
  • Anders Ardö

Summary, in English

The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.

Publishing year

2005

Language

English

Pages

368-378

Publication/Series

Research and advanced technology for digital libraries / Lecture Notes in Computer Science

Volume

3652

Document type

Conference paper

Publisher

Springer

Topic

  • Electrical Engineering, Electronic Engineering, Information Engineering

Conference name

9th European Conference, ECDL 2005

Conference date

2005-09-18 - 2005-09-23

Conference place

Vienna, Austria

Status

Published

ISBN/ISSN/Other

  • ISSN: 0302-9743
  • ISSN: 1611-3349
  • ISBN: 3-540-28767-1
  • doi:10.1007/3-540-45747-X