Protocol to discover machine-readable entities of the ecosystem management actions taxonomy

Summary The ecosystem management actions taxonomy (EMAT) consists of actions taken by humans and wildlife that affect an ecosystem. Here, I present a protocol for discovering machine-readable entities of the EMAT. I describe steps for acquiring stories from online locations, collecting them into a story file, and processing them through a software package to extract those actions that match EMAT taxa. I then detail procedures for using the story file to learn new EMAT taxa.

In general, a taxon is one of the categories of a taxonomy.Because the EMAT categorizes actions, in what follows, an EMAT taxon is referred to as an EMAT action.Also, in database theory, an entry in a database such as an existing student in a school's enrollment database is referred to as an entity.Therefore, a real-world action reported by a machine-readable source and extracted by this protocol as an exemplar of a particular EMAT action, is referred to as an EMAT entity.
The EMAT is an extension of a political actions taxonomy developed by Leng. 3 Currently, the EMAT consists of 119 militaristic actions, 191 diplomatic actions, 198 economic actions, 92 ecosystemdirected anthropogenic actions, and 37 ecological actions.Each action is associated with a set of archetypal actors.Ecological actions are of various types including species abundance, habitat metrics such as vegetation index, wildlife disease outbreak events, events of wildlife-caused damage to crops, and events of wildlife attacks on humans.
Each EMAT action can be decomposed into at most, three sentence components: an m-word verb, a direct object phrase, and/or a prepositional phrase.Letting m be a positive integer, an m-word verb subsumes single-word verbs (either regular or irregular), and multi-word verbs (those that use more than one word to convey their meaning, e.g., ''picked up''); also known as phrasal verbs. 4ese decompositions are realized in the file, parsedematacts.datwherein each EMAT action has been manually parsed into three equivalence sets: A set of semantically equivalent m-word verbs, a set of semantically equivalent direct object phrases, and a set of semantically equivalent prepositional phrases, respectively.
The scripts and JAVA program listed in the key resources table will run on a Windows 11 computer.The id software package can extract entities at scale by analyzing stories in parallel.This is accomplished by taking advantage of the embarrassingly parallel search (see Malapert et al. 5 ) characteristic of the entity extraction step: The extraction of entities from one story is independent of the extraction of entities from some other story.Therefore, a cluster computer having m compute nodes, each with at least four processors can deliver an m-times speed up in the processing time of n stories when n [ m.

STEP-BY-STEP METHOD DETAILS
Acquire stories and ecological data from machine-readable sources Timing: 4 h Timing: 30 min (for step 1) Timing: 1 h per 50 Alert emails (for step 2) Timing: 1 h per 50 Alert emails (for step 3) Timing: 30 min per 100 stories (for step 4) Timing: 1 h (for step 5) Scrape stories from either existing machine-readable files or the World Wide Web utilizing a variety of program files (see the key resources table).Download all batch, PowerShell, and Outlook macro files listed in the key resources table to a single folder, e.g., C:\pedatacq (for political-ecological data acquisition).
1. Aggregate existing, separate story files.Produce a story file from a set of separate individual story files that is formatted for the EMAT extraction and learning steps, below.a. Use a file naming scheme that allows the collection of files to be referred to with wildcard notation.
b. Edit catstories.bat to specify the collection of files to be aggregated and to specify the aggregation file as follows.
c. Run catstories.bat to produce (a) the line ''beginarticle 0'' being inserted as line 1 in each file, and (b) all of these files being concatenated into one file.CRITICAL: Do not click more files on Exchange because this will cause the URL of each Alert's story link to not be written to allalerts.dat.
e. Open each alert email and click each story's link and write it to a file as an HTML-only file type (''webpage, HTML only'' format).

Substeps
iii.After the macro finishes, verify that this folder is now empty on both the local version of Outlook and on OWA.iv.Hit any key to continue the getalerts batch job.b.This batch file continues by scraping the stories off the web pointed to in the downloaded emails.The program does this by concatenating and cleaning these downloaded Alert files as follows.
i. Concatenate all Google Alerts emails into one alerts file.
ii. Clean this file by first, removing ^@ control characters via the device of converting the file from UTF-16 to ANSI.Second, replace the beginning and ending string added by Google and Outlook in every story URL with a space so that getnews.ps1can successfully read and then connect to these URLs.These replacements need to be performed in the following order.Replace %3A with: Replace %2F with / Replace %3D with a space.Replace %26 with a space.Replace <https://nam02.safelinks.protection.outlook.com/?url= with a space.Replace https://www.google.com/with a space.c.Download each story pointed to by a link in the alerts file by executing the following command.
The second argument to this command is the alerts file, and the third is the file to append the downloaded stories to.ii.Click Properties from the menu on the right of the screen.Then, click the characteristic of the task to be edited and then click Edit.iii.For this task's Action, specify Run a Program and then enter the full path to the program getnews.bat, e.g., c:/pedatacq/getnews.bat.iv.Set the General options of the task to: runs only when the user is logged on.5. Acquire ecological data from machine-readable sources.
a. Acquire species abundance data as follows.
i. Enter into a web search engine phrases that mention the species of interest and its general location.
Note: For example, East African cheetah was used to acquire cheetah abundance data. 2 ii.Read the research articles that are returned from this search and manually extract abundance data given in each article.Otherwise, download the data sets from data repositories pointed to in these articles.b.Acquire satellite images of large mammals and/or large flora as follows.
i. Identify the box of longitude and latitude coordinates of the area of interest.
ii. Contact a commercial satellite imagery provider such as MAXAR (www.maxar.com)or Airbus Defence and Space (www.intelligence.airbus.com)and purchase images of this area taken at satellite fly-over dates that are closest to those desired.iii.Once purchased, download the desired images.iv.Create a one-sentence story for this data set.
Note: From Haas 2 : ''Entities in the data set reference table (table ecodatref in Figure 1) are observations on the collect data EMAT action and have seven attributes: source, species, type, country, region, startdate, and enddate.The source attribute is either observation, powershell c:/pedatacq/getnews.ps1 googlealerts alerts filename append filename

Figure 1. Expected outcomes from running this protocol
The protocol begins by producing a file of stories scraped from machine-readable sources.The protocol then extracts actions from these stories that match actions in the Ecosystem Management Actions Taxonomy (EMAT).The protocol finishes by assisting humans to learn both, new phrases to add to a particular EMAT action's equivalence sets; and entirely new EMAT actions.
or model.The species attribute indicates the observed species, e.g., cheetah, rhino, or cycad (Cycadophyta).The type attribute can take on the values of abundance, presence/ absence, capture-recapture, rainfall, NDVI, and landuse.For these latter three values, the species attribute is set to N/A.These data set references are preprocessed into one-sentence stories of the form ''group name collected type data on species in region, country during the period startdate to enddate.''group name is the group who collected data, e.g., Kenya Wildlife Services, SANParks Scientific Services, or TerraServer.

Extract EMAT entities
Timing: 1 h per 1,000 stories This step extracts EMAT entities from the story file.
6. Locate the story file that was created by substeps 1 through 4. 7. Run the id software package's relation, parse stories() to extract sentences, mword verbs, direct object phrases, prepositional phrases, and EMAT entities from the stories in this story file.
Note: For example, the id software package's input file, extract.id(see the key resources table) extracts actions from the story file, ef9-13.txt: The parse stories() relation executes the following steps.a. Scan each story for the story's source.b.Remove a pre-selected set of HTML tags from the story to produce a tag-filtered story.c.Form a text fragment of the story that consists of textual content sentences only.
Note: A sentence contains textual content if (a) it contains at least three common words defined by the list {the, a, of, is, by, to, be, from, and, have, in, that, on, with, as, at, inside}; and (b) less than 80% of its words are irrelevant as defined by the list {content, copyright, stylesheet, subscribe, subscription, login, header, sidebar, wrapper, label, navigation, class, column, http:, republish, div}.
d. Search each tag-filtered story for its date, groups, regions, and EMAT entities.In particular, extract EMAT entities with the EMAT entity extraction algorithm.
Note: These four searches are performed simultaneously by running each search in its own, independent thread.A speed-up will accrue when the code is run on a stand-alone computer or cluster computer node that has at least four processors -one for each thread.

Note:
The EMAT entity extraction algorithm searches a sentence for m-word verbs that partially match m-word verbs that are members of an EMAT action's m-word verb equivalence set.For example, say that some hypothetical story contains the sentence, Five poachers were arrested on June 10, 2019 and sentenced to prison on August 8, 2019.
This sentence contains two 1-word verbs: arrested, and sentenced.Similar searches are executed to find direct object phrases, and prepositional phrases 6 that partially match members of corresponding equivalence sets.Parsing is accomplished with a modified version of the shallow parsing algorithm of Daelemans et al. 7 The EMAT entity extraction algorithm is as follows.i.Using the phrase similarity sub-algorithm, search a sentence for an m-word verb that best matches entries in m-word verb equivalence sets.Declare a match if a pair's similarity (SIM) is greater than 0.95.Let this best-matching EMAT action be a member of action cluster j.Phrases are allowed to appear in any order within a sentence.
Note: In the following phrase similarity subalgorithm, an n-gram is a sub-sequence of n words in a natural language phrase.One definition of the degree of similarity between two phrases is the Phrasal Overlap Measure of Ponzetto and Strube. 8Haas 9 describes a modified version of this measure.In-turn, an improved version of that measure is, where the number of words in phrase 1 is N = ph 1 % ph 2 ; s is the number of times n-gram pairs are formed by starting at the same location in each phrase; and m n is the number of n-grams that are common to the two phrases.
Note: A pair of n-grams is declared to be common if 1.0 minus the Levenshtein distance 10,11 between the two is greater than 0.99.If Levenshtein distance between the two.
ii.If no matches are found, return null.
iii.Initialize the variables SIMD and SIMP to 0.0.iv.Search the sentence for a direct object phrase that matches an entry in direct object phrase equivalence sets of EMAT actions in action cluster j.Declare a match if SIM > 0:95.Store the matched EMAT action with the highest similarity score in actionD and store its similarity value in SIMD.v. Search the sentence for a prepositional phrase that matches an equivalence set prepositional phrase of one of the actions in action cluster j.Declare a match if SIM > 0:95.Store the matched EMAT action with the highest similarity score in actionP and store its similarity value in SIMP.vi.If actionD and actionP are both null, return null.vii.If SIMP > 0:8 return actionP.viii.If SIMD > SIMP, return actionD.Otherwise, return null.e.In the final list of extracted entities, check each entity for possible duplicates of it.Do this as follows.
i. When/if a duplicate is found, remove it.ii.After a duplicate is found, search the list again for additional duplicates.
iii.Move to the next entity only when no duplicates are found for the current entity.8. Examine the output file, parsefailed.datfor stories that the parsing algorithm failed on.9. Rerun all story files after improvements to the EMAT entity extraction algorithm are made and/or associated lexicon files are updated.

Learn new equivalence set members and EMAT actions
Timing: 30 min EMAT actions and their equivalence sets do not define a static taxonomy but rather a dynamic one as both the language evolves and new interactions between humans and ecosystems emerge.This Protocol dynamic characteristic of the EMAT is operationalized with a software-assisted human learning algorithm that can identify either a new equivalence set member of an existing EMAT action or, more fundamentally, an entirely new EMAT action.
10. Create several strings of one to four words that are thought to describe in-part, an EMAT action.
Note: For example, the EMAT action Elephants trample crops could also be described with the strings elephants destroyed fields or smashed by elephants.
11. Locate the story file that was created by substeps 1 through 4. 12. Run the file, phrases.batwith these strings on this story file.View the output file for common usage of these strings in stories contained in the story file.These instances will suggest m-word verbs, objects, and prepositional phrases to add to that EMAT action's equivalence sets in parsedematacts.dat.13.Detect those sentences in a set of stories that have a SIMP score between 0.7 and 0.8 -or where SIMD > SIMP.Create a list of these {sentence, EMAT action} pairs.

Note:
The EMAT extraction algorithm, above, uses a SIMP value of 0.8 or greater to extract an EMAT entity.
Note: These scores compare phrases to existing EMAT actions.Therefore, there is always a particular EMAT action that is associated with a SIMP score or a SIMD score.
14. Examine each sentence in this list to determine if it is clearly describing an occurrence of any existing EMAT action.For such a sentence, add the m-word verb, direct object phrase, and prepositional phrase to the equivalence sets of the associated, existing EMAT action.
CRITICAL: When updating parsedematacts.dat,if there are highly relevant words that need to be detected for a particular EMAT action, keep short the prepositional phrase that contains the needed words.That way, the similarity measure will be large only if those words are present in the story's sentence.

Note:
As an example of discovering new equivalence set members, the following sentence gives a high SIMD score for the EMAT action Sell a few rhino horns.
In 2014, Kenya enacted tough new laws that make ivory poaching and trafficking punishable by fines of 200,000 USD or even life in prison compared to the maximum fines of about 400 USD that were handed out previously.
Clearly, this sentence describes an entity of the EMAT action: tighten wildlife agreement or laws.
Hence, the 1-word verb enacted should be added to this action's m-word verb equivalence set, and the phrase tough new laws that make ivory poaching and trafficking punishable should be added to the action's direct object phrase equivalence set.

EXPECTED OUTCOMES
The protocol's story-scraping step produces a file of stories scraped from machine-readable sources (Figure 1).These stories are formatted for further processing by the protocol's EMAT entity extraction step.This step in-turn, produces a file of EMAT entities that are tagged by date, location, reporting source, actor, and target.The protocol's learning step produces a list of potentially new equivalence set members for existing EMAT actions, and a list of potentially new EMAT actions along with their initial equivalence sets.

QUANTIFICATION AND STATISTICAL ANALYSIS
Assessing extraction accuracy Haas 9 discusses algorithms designed to extract EMAT entities from media: The algorithm's accuracy can be assessed by comparing the actions extracted from a random sample of stories to those extracted by a human reading the same set of stories.Using the set of human-extracted actions as the benchmark, the algorithm can make two types of errors: failing to extract an action in a story; and extracting an action that does not exist in the story, referred to here as a spurious action.
(a) through (e), above have been automated in the file getalerts.bat.Manually run this batch job by typing C:/pedatacq/getalerts at a Windows command prompt.CRITICAL: Only after starting this batch job, start a local version of Outlook.a.After this local version of Outlook has finished updating, do the following.i.Click Developer, Macros, and then downloadalertsvba.
2. Manually process Google Alerts.Produce a story file from a set of Google Alert emails that is formatted for the EMAT extraction and learning steps, below.Google Alerts service has been started, it will autonomously and continually send Alert emails to the designated folder at the user's email address.This can amount to 500 emails per month.Weekly monitoring of this folder can avoid email access failures due to the receipt of an excessive number of emails.Get into the desired folder, e.g., googleeastaf and then click the Select All icon.iv.Click File and then Save As.Take the default (Text Only).Give a file-name with a .datextension, e.g., allalerts.dat.Call this the alerts file.
4. Acquire stories from commercial news aggregators.Use a free commercial news aggregator to find stories.The PowerShell program getnews.ps1 (called from getalerts.bat) is reused from above to scrape stories from the commercial news aggregator, Newsapi.a. Setup a free account at www.newsapi.org.b.Schedule the program getnews.bat to run at a specified time and day every week with Task Scheduler, found at Windows Administrative Tools > Task Scheduler.Do this as follows.i. Edit this task by clicking Task Library and then clicking on the task.
15. Examine the sentence for an ecosystem-relevant militaristic, diplomatic, economic, ecosystemdirected, or ecological action.If a new EMAT action is being described, use the sentence's TenBoma is a communications based initiative that uses modern technology and sophisticated data analysis to allow law enforcement agencies to predict poaching plots in advance and thwart the incidents.Instead, this sentence appears to be describing a new ecosystem management action that is about the introduction of a new technology to combat wildlife trafficking.Hence, the action and associated equivalence set members as shown in Table2should be added to the EMAT definition file, emat.dfn, and parsedematacts.dat,respectively.
m-word verb, direct object phrase, and (if present) the sentence's prepositional phrase as the initial members of this new EMAT action's three equivalence sets, respectively.Note:As an example of discovering a new EMAT action, the following sentence from the file ef15-16.txt(seeTable1)produces a high SIMP score for the EMAT action Sell a few rhino horns but is clearly not describing that or any other existing EMAT action.

Table 2 .
A new EMAT action along with its initial equivalence set members

Table 1 .
Story, database, and entities history files in the author's collectionThe file name pattern *sdb.dat is a story database file (one set of entities per story), and *acts.dat is an entities history file.