Linking Tabular Columns to Unseen Ontologies

Dash, Sarthak; Bagchi, Sugato; Mihindukulasooriya, Nandana; Gliozzo, Alfio

doi:10.1007/978-3-031-47240-4_27

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14265))

Included in the following conference series:

International Semantic Web Conference

1335 Accesses
1 Citations

Abstract

We introduce a novel approach for linking table columns to types in an ontology unseen during training. As the target ontology is unknown to the model during training, this may be considered a zero-shot linking task at the ontological level. This task is often a requirement for businesses that wish to semantically enrich their tabular data with types from their custom or industry-specific ontologies without the benefit of initial supervision. In this paper, we describe specific approaches and provide datasets for this new task: training models on open domain tables using a broad source ontology and evaluating them on increasingly difficult tables with target ontologies having different levels of type granularity. We use pre-trained Transformer encoder models and a range of encoding strategies to explore methods of encoding increasing amounts of ontological knowledge, such as type glossaries and taxonomies, to obtain better zero-shot performance. We demonstrate these results empirically through extensive experiments on three new public benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdelmageed, N., Schindler, S.: Jentab meets semtab 2021’s new challenges. In: SemTab@ ISWC, pp. 42–53 (2021)
Google Scholar
Abdelmageed, N., Schindler, S., König-Ries, B.: BiodivTab: a tabular benchmark based on biodiversity research data. In: SemTab@ISWC, submitted (2021)
Google Scholar
Baazouzi, W., Kachroudi, M., Faiz, S.: Kepler-asi at semtab 2021. In: SemTab@ ISWC, pp. 54–67 (2021)
Google Scholar
Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25
Chapter Google Scholar
Bogatu, A., Fernandes, A.A.A., Paton, N.W., Konstantinou, N.: Dataset discovery in data lakes. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 709–720 (2020)
Google Scholar
Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Colnet: embedding the semantics of web tables for column type prediction. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, 27 January–1 February 2019, pp. 29–36. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.330129
Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Learning semantic annotations for tabular data. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019, pp. 2088–2094. ijcai.org (2019). https://doi.org/10.24963/ijcai.2019/289
Chen, Y., et al.: An empirical study on multiple information sources for zero-shot fine-grained entity typing. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 2668–2678. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.210
Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: carefully evaluating entity linking for tabular data. In: Pan, J.Z., et al. (eds.) ISWC 2020. LNCS, vol. 12507, pp. 328–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_21
Chapter Google Scholar
Dash, S., Bagchi, S., Mihindukulasooriya, N., Gliozzo, A.: Permutation invariant strategy using transformer encoders for table understanding. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 788–800. Association for Computational Linguistics, Seattle (2022). https://doi.org/10.18653/v1/2022.findings-naacl.59. https://aclanthology.org/2022.findings-naacl.59
Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: table understanding through representation learning. Proc. VLDB Endow. 14(3), 307–319 (2020). https://doi.org/10.5555/3430915.3442430. http://www.vldb.org/pvldb/vol14/p307-deng.pdf
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. JMLR Proceedings, vol. 9, pp. 249–256. JMLR.org (2010). http://proceedings.mlr.press/v9/glorot10a.html
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.: TaPas: weakly supervised table parsing via pre-training. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4320–4333. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.398. https://aclanthology.org/2020.acl-main.398
Hu, K., et al.: Viznet: towards a large-scale visualization learning and benchmarking repository. In: Proceedings of the 2019 Conference on Human Factors in Computing Systems (CHI). ACM (2019)
Google Scholar
Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., Karypis, G. (eds.) Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019, pp. 1500–1508. ACM (2019). https://doi.org/10.1145/3292500.3330993
Huynh, V.P., et al.: Dagobah: table and graph contexts for efficient semantic annotation of tabular data. In: SemTab@ISWC, pp. 19–31 (2021)
Google Scholar
Iida, H., Thai, D., Manjunatha, V., Iyyer, M.: TABBIE: pretrained representations of tabular data. In: Toutanova, K., et al (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021, pp. 3446–3456. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.270
Iida, H., Thai, D., Manjunatha, V., Iyyer, M.: TABBIE: pretrained representations of tabular data. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3446–3456. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.naacl-main.270. https://aclanthology.org/2021.naacl-main.270
Jiao, X., et al.: Tinybert: distilling BERT for natural language understanding. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, Findings of ACL, 16–20 November 2020, vol. EMNLP 2020, pp. 4163–4174. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.372
Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab 2019: resources to benchmark tabular data to knowledge graph matching systems. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12123, pp. 514–530. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49461-2_30
Chapter Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. CoRR abs/1702.08734 (2017). http://arxiv.org/abs/1702.08734
McCray, A.T.: An upper-level ontology for the biomedical domain. Comput. Funct. Genomics 4, 80–84 (2003)
Article Google Scholar
Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., Grohe, M.: Weisfeiler and leman go neural: Higher-order graph neural networks. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, USA, 27 January–1 February 2019, pp. 4602–4609. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33014602
Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: Hartig, O., Harth, A., Sequeda, J.F. (eds.) Proceedings of the First International Workshop on Consuming Linked Data, Shanghai, China, 8 November 2010, CEUR Workshop Proceedings, vol. 665. CEUR-WS.org (2010). http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf
Nguyen, P., Yamada, I., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Semtab 2021: Tabular data annotation with mtab tool. In: Jiménez-Ruiz, E., et al. (eds.) Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, 27 October 2021, CEUR Workshop Proceedings, vol. 3103, pp. 92–101. CEUR-WS.org (2021). http://ceur-ws.org/Vol-3103/paper8.pdf
Obeidat, R., Fern, X., Shahbazi, H., Tadepalli, P.: Description-based zero-shot fine-grained entity typing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 807–814 (2019)
Google Scholar
,bibitemch27DBLP:confspswwwspsRenLZ20 Ren, Y., Lin, J., Zhou, J.: Neural zero-shot fine-grained entity typing. In: Seghrouchni, A.E.F., Sukthankar, G., Liu, T., van Steen, M. (eds.) Companion of The 2020 Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. pp. 846–847. ACM/IW3C2 (2020). https://doi.org/10.1145/3366424.3382725
Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to dbpedia. In: Akerkar, R., Dikaiakos, M.D., Achilleos, A., Omitola, T. (eds.) Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, WIMS 2015, Larnaca, Cyprus, 13–15 July 2015, pp. 10:1–10:6. ACM (2015)
Google Scholar
Suhara, Y., et al.: Annotating columns with pre-trained language models. arXiv preprint arXiv:2104.01785 (2021)
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=rJXMpikCZ
Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., Tan, W.: Sato: contextual semantic type detection in tables. Proc. VLDB Endow. 13(11), 1835–1848 (2020). http://www.vldb.org/pvldb/vol13/p1835-zhang.pdf
Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 13:1–13:35 (2020). https://doi.org/10.1145/3372117
Zhang, T., Xia, C., Lu, C.T., Philip, S.Y.: Mzet: memory augmented zero-shot fine-grained named entity typing. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 77–87 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research AI, Yorktown Heights, NY, USA
Sarthak Dash, Sugato Bagchi, Nandana Mihindukulasooriya & Alfio Gliozzo

Authors

Sarthak Dash
View author publications
You can also search for this author in PubMed Google Scholar
Sugato Bagchi
View author publications
You can also search for this author in PubMed Google Scholar
Nandana Mihindukulasooriya
View author publications
You can also search for this author in PubMed Google Scholar
Alfio Gliozzo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarthak Dash .

Editor information

Editors and Affiliations

University of Liverpool, Liverpool, UK
Terry R. Payne
University of Bologna, Bologna, Italy
Valentina Presutti
Southeast University, Nanjing, China
Guilin Qi
Universidad Politécnica de Madrid, Madrid, Spain
María Poveda-Villalón
Huawei Technologies R&D UK, Edinburgh, UK
Giorgos Stoilos
Centrum Wiskunde and Informatica, Amsterdam, The Netherlands
Laura Hollink
IT University of Copenhagen, Copenhagen, Denmark
Zoi Kaoudi
Nanjing University, Nanjing, China
Gong Cheng
Tsinghua University, Beijing, Beijing, China
Juanzi Li

A Appendix

1.1 A.1 Model Predictions

The following tables below shows examples of predictions returned by our proposed model built using a pretrained TinyBERT encoder. This model is trained using Wikidata labels and is asked to predict from the DBpedia target ontology for the top two tables. For the bottom two tables, the model predicts from the UMLS Semantic Network (UMLS SN).

The first row in the block titled Top model prediction returns model predictions using Type labels only. The second row returns predictions using Type labels and associated glossaries. The final row in this block returns predictions using our proposed encoding strategy. Note that the BioDivTab benchmark does not contain table metadata.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dash, S., Bagchi, S., Mihindukulasooriya, N., Gliozzo, A. (2023). Linking Tabular Columns to Unseen Ontologies. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14265. Springer, Cham. https://doi.org/10.1007/978-3-031-47240-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-47240-4_27
Published: 27 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47239-8
Online ISBN: 978-3-031-47240-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

Linking Tabular Columns to Unseen Ontologies

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Model Predictions

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation