Skip to main content

Linking Tabular Columns to Unseen Ontologies

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2023 (ISWC 2023)

Abstract

We introduce a novel approach for linking table columns to types in an ontology unseen during training. As the target ontology is unknown to the model during training, this may be considered a zero-shot linking task at the ontological level. This task is often a requirement for businesses that wish to semantically enrich their tabular data with types from their custom or industry-specific ontologies without the benefit of initial supervision. In this paper, we describe specific approaches and provide datasets for this new task: training models on open domain tables using a broad source ontology and evaluating them on increasingly difficult tables with target ontologies having different levels of type granularity. We use pre-trained Transformer encoder models and a range of encoding strategies to explore methods of encoding increasing amounts of ontological knowledge, such as type glossaries and taxonomies, to obtain better zero-shot performance. We demonstrate these results empirically through extensive experiments on three new public benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdelmageed, N., Schindler, S.: Jentab meets semtab 2021’s new challenges. In: SemTab@ ISWC, pp. 42–53 (2021)

    Google Scholar 

  2. Abdelmageed, N., Schindler, S., König-Ries, B.: BiodivTab: a tabular benchmark based on biodiversity research data. In: SemTab@ISWC, submitted (2021)

    Google Scholar 

  3. Baazouzi, W., Kachroudi, M., Faiz, S.: Kepler-asi at semtab 2021. In: SemTab@ ISWC, pp. 54–67 (2021)

    Google Scholar 

  4. Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25

    Chapter  Google Scholar 

  5. Bogatu, A., Fernandes, A.A.A., Paton, N.W., Konstantinou, N.: Dataset discovery in data lakes. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 709–720 (2020)

    Google Scholar 

  6. Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Colnet: embedding the semantics of web tables for column type prediction. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, 27 January–1 February 2019, pp. 29–36. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.330129

  7. Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Learning semantic annotations for tabular data. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019, pp. 2088–2094. ijcai.org (2019). https://doi.org/10.24963/ijcai.2019/289

  8. Chen, Y., et al.: An empirical study on multiple information sources for zero-shot fine-grained entity typing. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 2668–2678. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.210

  9. Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: carefully evaluating entity linking for tabular data. In: Pan, J.Z., et al. (eds.) ISWC 2020. LNCS, vol. 12507, pp. 328–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_21

    Chapter  Google Scholar 

  10. Dash, S., Bagchi, S., Mihindukulasooriya, N., Gliozzo, A.: Permutation invariant strategy using transformer encoders for table understanding. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 788–800. Association for Computational Linguistics, Seattle (2022). https://doi.org/10.18653/v1/2022.findings-naacl.59. https://aclanthology.org/2022.findings-naacl.59

  11. Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: table understanding through representation learning. Proc. VLDB Endow. 14(3), 307–319 (2020). https://doi.org/10.5555/3430915.3442430. http://www.vldb.org/pvldb/vol14/p307-deng.pdf

  12. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423

  13. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. JMLR Proceedings, vol. 9, pp. 249–256. JMLR.org (2010). http://proceedings.mlr.press/v9/glorot10a.html

  14. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  15. Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.: TaPas: weakly supervised table parsing via pre-training. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4320–4333. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.398. https://aclanthology.org/2020.acl-main.398

  16. Hu, K., et al.: Viznet: towards a large-scale visualization learning and benchmarking repository. In: Proceedings of the 2019 Conference on Human Factors in Computing Systems (CHI). ACM (2019)

    Google Scholar 

  17. Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., Karypis, G. (eds.) Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019, pp. 1500–1508. ACM (2019). https://doi.org/10.1145/3292500.3330993

  18. Huynh, V.P., et al.: Dagobah: table and graph contexts for efficient semantic annotation of tabular data. In: SemTab@ISWC, pp. 19–31 (2021)

    Google Scholar 

  19. Iida, H., Thai, D., Manjunatha, V., Iyyer, M.: TABBIE: pretrained representations of tabular data. In: Toutanova, K., et al (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021, pp. 3446–3456. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.270

  20. Iida, H., Thai, D., Manjunatha, V., Iyyer, M.: TABBIE: pretrained representations of tabular data. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3446–3456. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.naacl-main.270. https://aclanthology.org/2021.naacl-main.270

  21. Jiao, X., et al.: Tinybert: distilling BERT for natural language understanding. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, Findings of ACL, 16–20 November 2020, vol. EMNLP 2020, pp. 4163–4174. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.372

  22. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab 2019: resources to benchmark tabular data to knowledge graph matching systems. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12123, pp. 514–530. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49461-2_30

    Chapter  Google Scholar 

  23. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. CoRR abs/1702.08734 (2017). http://arxiv.org/abs/1702.08734

  24. McCray, A.T.: An upper-level ontology for the biomedical domain. Comput. Funct. Genomics 4, 80–84 (2003)

    Article  Google Scholar 

  25. Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., Grohe, M.: Weisfeiler and leman go neural: Higher-order graph neural networks. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, USA, 27 January–1 February 2019, pp. 4602–4609. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33014602

  26. Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: Hartig, O., Harth, A., Sequeda, J.F. (eds.) Proceedings of the First International Workshop on Consuming Linked Data, Shanghai, China, 8 November 2010, CEUR Workshop Proceedings, vol. 665. CEUR-WS.org (2010). http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf

  27. Nguyen, P., Yamada, I., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Semtab 2021: Tabular data annotation with mtab tool. In: Jiménez-Ruiz, E., et al. (eds.) Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, 27 October 2021, CEUR Workshop Proceedings, vol. 3103, pp. 92–101. CEUR-WS.org (2021). http://ceur-ws.org/Vol-3103/paper8.pdf

  28. Obeidat, R., Fern, X., Shahbazi, H., Tadepalli, P.: Description-based zero-shot fine-grained entity typing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 807–814 (2019)

    Google Scholar 

  29. ,bibitemch27DBLP:confspswwwspsRenLZ20 Ren, Y., Lin, J., Zhou, J.: Neural zero-shot fine-grained entity typing. In: Seghrouchni, A.E.F., Sukthankar, G., Liu, T., van Steen, M. (eds.) Companion of The 2020 Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. pp. 846–847. ACM/IW3C2 (2020). https://doi.org/10.1145/3366424.3382725

  30. Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to dbpedia. In: Akerkar, R., Dikaiakos, M.D., Achilleos, A., Omitola, T. (eds.) Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, WIMS 2015, Larnaca, Cyprus, 13–15 July 2015, pp. 10:1–10:6. ACM (2015)

    Google Scholar 

  31. Suhara, Y., et al.: Annotating columns with pre-trained language models. arXiv preprint arXiv:2104.01785 (2021)

  32. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=rJXMpikCZ

  33. Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., Tan, W.: Sato: contextual semantic type detection in tables. Proc. VLDB Endow. 13(11), 1835–1848 (2020). http://www.vldb.org/pvldb/vol13/p1835-zhang.pdf

  34. Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 13:1–13:35 (2020). https://doi.org/10.1145/3372117

  35. Zhang, T., Xia, C., Lu, C.T., Philip, S.Y.: Mzet: memory augmented zero-shot fine-grained named entity typing. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 77–87 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarthak Dash .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Model Predictions

The following tables below shows examples of predictions returned by our proposed model built using a pretrained TinyBERT encoder. This model is trained using Wikidata labels and is asked to predict from the DBpedia target ontology for the top two tables. For the bottom two tables, the model predicts from the UMLS Semantic Network (UMLS SN).

The first row in the block titled Top model prediction returns model predictions using Type labels only. The second row returns predictions using Type labels and associated glossaries. The final row in this block returns predictions using our proposed encoding strategy. Note that the BioDivTab benchmark does not contain table metadata.

figure a
figure b

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dash, S., Bagchi, S., Mihindukulasooriya, N., Gliozzo, A. (2023). Linking Tabular Columns to Unseen Ontologies. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14265. Springer, Cham. https://doi.org/10.1007/978-3-031-47240-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47240-4_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47239-8

  • Online ISBN: 978-3-031-47240-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics