ABSTRACT
Extracting geographical tags from webpages is a well-motiva-ted application in many domains. In illicit domains with unusual language models, like human trafficking, extracting geotags with both high precision and recall is a challenging problem. In this paper, we describe a geotag extraction framework in which context, constraints and the openly available Geonames knowledge base work in tandem in an Integer Linear Programming (ILP) model to achieve good performance. In preliminary empirical investigations, the framework improves precision by 28.57% and F-measure by 36.9% on a difficult human trafficking geotagging task compared to a machine learning-based baseline. The method is already being integrated into an existing knowledge base construction system widely used by US law enforcement agencies to combat human trafficking.
- C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE transactions on knowledge and data engineering, 18(10):1411--1428, 2006. Google ScholarDigital Library
- R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160--167. ACM, 2008. Google ScholarDigital Library
- A. Dubrawski, K. Miller, M. Barnes, B. Boecking, and E. Kennedy. Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Trafficking, 1(1):65--85, 2015.Google ScholarCross Ref
- R. S. Garfinkel and G. L. Nemhauser. Integer programming, volume 4. Wiley New York, 1972.Google Scholar
- B. Han, P. Cook, and T. Baldwin. Text-based twitter user geolocation prediction. Journal of Artificial Intelligence Research, 49:451--500, 2014. Google ScholarCross Ref
- M. Kejriwal and P. Szekely. Information extraction in illicit domains. arXiv preprint arXiv:1703.03097, 2017. Google ScholarDigital Library
- N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Google ScholarDigital Library
- J. L. Leidner. Toponym resolution in text: Annotation, evaluation and applications of spatial grounding of place names. Universal-Publishers, 2008.Google Scholar
- F. Niu, C. Zhang, C. Ré, and J. W. Shavlik. Deepdive: Web-scale knowledge-base construction using statistical learning and inference. VLDS, 12:25--28, 2012.Google Scholar
- G. Optimization et al. Gurobi optimizer reference manual. URL: http://www.gurobi.com, 2:1--3, 2012.Google Scholar
- F. Ostermann. Hybrid geo-information processing: Crowdsourced supervision of geo-spatial machine learning tasks. In Proceedings of the 18th AGILE International Conference on Geographic Information Science, Lisbon, Portugal, pages 9--12, 2015.Google Scholar
- E. Riloff, R. Jones, et al. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI/IAAI, pages 474--479, 1999. Google ScholarDigital Library
- B. Roark and E. Charniak. Noun-phrase co-occurrence statistics for semiautomatic semantic lexicon construction. In Proceedings of the 17th international conference on Computational linguistics-Volume 2, pages 1110--1116. Association for Computational Linguistics, 1998. Google ScholarDigital Library
- M. Speriosu and J. Baldridge. Text-driven toponym resolution using indirect supervision. In ACL (1), pages 1466--1476, 2013.Google Scholar
- P. Szekely, C. A. Knoblock, J. Slepicka, A. Philpot, A. Singh, C. Yin, D. Kapoor, P. Natarajan, D. Marcu, K. Knight, et al. Building and using a knowledge graph to combat human trafficking. In International Semantic Web Conference, pages 205--221. Springer, 2015.Google ScholarCross Ref
- M. Wick and C. Boutreux. Geonames. GeoNames Geographical Database, 2011.Google Scholar
- Using contexts and constraints for improved geotagging of human trafficking webpages
Recommendations
Information Extraction in Illicit Web Domains
WWW '17: Proceedings of the 26th International Conference on World Wide WebExtracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have 'long tails' and suffer from the ...
Webscraping as an Investigation Tool to Identify Potential Human Trafficking Operations in Romania
WebSci '15: Proceedings of the ACM Web Science ConferenceInformation communication technology has enabled criminals to remain distant from the crimes they commit with reduced risk. However, by moving this underground criminal activity online, digital evidence of communication with members of the crime group, ...
Using technology in human trafficking: international law perspective and reflections within Middle Eastern countries
Human trafficking represents a serious violation of human rights, dignity and freedom. Many states have attempted to develop effective policies to combat human trafficking. The UN, as well as the European Union and the European Council, strived to ...
Comments