Dis/similarities in the design and development of legal and algorithmic normative systems: the case of Perspective API

ABSTRACT For several decades now, legal scholars and other social scientists have been interested in conceiving of technologies as regulatory media and comparing their normative affordances to law’s regulatory characteristics. Recently, scholars have also started to explore the normative nature of Machine Learning and Artificial Intelligence systems. Most of this scholarship, however, adopts a largely theoretical perspective. This article takes a different approach and attempts to provide the discussion with a more factual grounding. It does so by investigating the construction of one particular Machine Learning system, the content moderation system Perspective API developed by Google. Its open-source development and a voluminous trove of publicly available documentation render Perspective API a virtually unique resource to study the inner logics of Machine Learning systems development. Based on an in-depth analysis of these logics, the article fleshes out similarities and dissimilarities concerning the normative structure of algorithmic and legal systems regarding four different subjects: practical constraints, evaluative diversity, modes of evolution and standards of evaluation. The article then relates the case of Perspective to the European Union’s proposal for an Artificial Intelligence Act and shows how the study’s insights might help in readying the Act for the realities of contemporary multi-party AI development.


Introduction
Historian Melvin Kranzberg in 1986 famously declared it his first law of technology that 'Technology is neither good, nor bad; nor is it neutral'. 1 It is this law's third proposition, the assertion of technology's fundamental non-neutrality, that also nicely epitomises the central concern of a decidedly legal genre of academic literature: a literature interested in the normativity of (digital) technologies. 2 This literature conceives of law and technology as sharing a central common characteristic: both regulate human behaviour. Just like the law, technology could forbid or permit, sanction or enable specific human actions.
Once this assumed functional equivalence between law and technology is accepted, a second question arises: how do technologies' and law's regulatory nature differ? Aspects that such comparisons can investigate include such issues as a normative medium's force (the practical (in)evitability of its regulatory effects), its grammar (e.g. the specificity or flexibility of its norms), or its legitimacy (looking at issues like norm-making, accountability or contestability). Scholars have used this conceptual framework to compare the law to a range of (digital) technologies, including 'code', 3 'big data' 4 and 'blockchains'. 5 Consistent with the current proliferation of these technologies, academics have recently taken an interest in Machine Learning (ML) and Artificial Intelligence (AI) systems. Scholars are speculating what kinds of normativities these technologies might bring about. 6 This article attempts to contribute to this discussion. However, different from most other contributions, it does not engage in primarily theoretical considerations, but attempts to provide the debate around ML as a regulatory medium with a more 2 See infra ns 3-5. 3  factual grounding. To this end, the article presents an in-depth analysis of one particular ML-based algorithmic system: Perspective API. 7 Perspective API is a ML-driven, Natural Language Processing (NLP)-based online content moderation system developed by Google-subsidiary Jigsaw. It is advertised as being able to 'identify 'toxic' comments' and ' [help] platforms and publishers create safe environments for conversation'. 8 Its main uses lie in assisting human moderators to detect 'toxic' content as well as filtering 'toxic' comments automatically. 9 Its open-source code and the voluminous trove of documentation and internal discussions made available by Jigsaw render Perspective a virtually unique resource to study the internal dynamics of ML development processes and their external normative effects. In an interdisciplinary approach, interweaving empirical findings with interviews, 10 Computer Science literature and legal theory, this article explores this practice of AI to elucidate how the practicalities of AI production shape the normative systems they establish differently from legal systems. The rest of the article is organised as follows: Section 2 gives a structured account of the Machine Learning development process, or what is often called the Machine Learning pipeline, attempting to equip jurists with a better understanding of the processes shaping the production of algorithmic systems. Section 3 uses this structure to retrace the development of Perspective API. Section 4 then takes the case of Perspective to illustrate similarities and dissimilarities in the conception, development and evolution of algorithmic normative systems on the one side and legal normative systems on the other. After contextualising the current inquiry and identifying a number of (potential) limitations, it points out dis/similarities with respect to four structural characteristics: practical constraints, evaluative diversity, modes of evolution and standards for evaluation. Finally, section 5 shows how these insights relate to the EU's current proposal for an Artificial Intelligence Act and makes a number of proposals for regulatory amendment.

Modelling the machine learning pipeline
To better understand the subsequent in-depth analysis of Perspective API, it is useful to give a brief run through on how the development process of supervised 11 Machine Learning systems usually proceeds. 7 <https://www.perspectiveapi.com/> accessed 20 December 2022. 8 Ibid. 9 See infra 3.4. 10 For a description of the sources and materials used in this study, see infra 4.1. 11 Next two Supervised Learning, two other big Machine Learning paradigms exist: Unsupervised Learning and Reinforcement Learning. As most decision-oriented Machine Learning systems rely on Supervised Learning, the article focuses on this approach. However, it is to be noted that, as the case of Perspective also shows, the rapid proliferation of large pretrained (language) models, which emerge through unsupervised learning, and their application for downstream transfer on a growing Generally, the development of supervised Machine Learning systems can be split into twelve consecutive, intertwined stages, which in turn can be bundled into four larger chapters. 12 It is important to note, however, that software development is often far from a strictly linear, sequential process and that feedback may lead to revisions, recursions and iterations at all stages.

Project conception
No software writes itself. Rather, at the source of any algorithmic system there is always a task to be completed, a process to be optimised, a problem to be solved. To reach a 'solution', this problem must first be identified and more narrowly conceived (1. Problem statement). Software developers will then transpose the problem into one that is digitally computable. In other words, the problem is truncated, reframed or reformulated in such a way that a solution in the form of code and data shows promise. This will lead to a first preliminary set of decisions concerning the system's main characteristics (e.g. traditional, static algorithms vs. Machine Learning; potential data sources; possible application scenarios) (2. Project map). Finally, developers need to determine the system's target variable. This is the data point the system will ultimately be asked to generate (3. Target variable definition). This could be an e-mail filter's 'spam/no spam' label or the grade a student should be given for their English essay.

Data collection and data preparation
ML algorithms for learning rely on data. Analysing old data, ML systems identify patterns which in turn are used to classify new, unseen examples. Therefore, one central element of the ML pipeline is collecting (enough, useful) data (4. Data collection). Such 'raw' data, however, can almost never be directly processed. For algorithms to be able to process the data, it must be simplified, structured and cleaned (5. Data cleaning). To take two examples from the sphere of NLP, this 'data wrangling' can include largely formal adjustments, such as lowercasing all text so as to enable case-insensitive text analysis, but can also comprise more substantial alterations, such as disposing of so-called stopwords (e.g. 'the', 'this', 'all', 'not', number of (language-related) tasks (including content moderation), starts to blur the distinctions between unsupervised and supervised learning. 12  'never', …), which are often seen as semantically irrelevant. Finally, for supervised learning systems, data must be annotated with human-created labels that will serve as the supposed ground truth for the ensuing training and testing stages (6. Annotation). 13 Annotation includes design and execution of a specific annotation system (e.g. number of annotations per object, selection of annotators, annotation instructions, …) as well as choosing a method for integrating diverging annotation values (i.e. when annotators have given disparate target values).

Model building
Model building, firstly, entails choosing an algorithm to perform the learning and prediction function (7. Model selection). Programmers, in other words, need to decide how the system is supposed to 'make sense' of the data and acquire knowledge for future predictions. Options range from relatively well-traceable methods such as linear or logistic regression to more complex approaches such as Random Forests, k-Nearest Neighbours, Naïve Bayes or Neural Networks. 14 In the field of Natural Language Processing, such architectures are increasingly built on top of existing language models, such as Google's BERT family or OpenAI's GPT-3, which have been pretrained on immense amounts of other text data. 15 Once an architecture has been chosen, the model is trained on a subset of the data called training data (8. Training). Here, the model, through reiterative processes of trial, error and adjustment, learns the predictive relevance of different variables (e.g. the presence of a word or the presence of a word in co-presence of another word) and thereby gradually approximates the optimal mathematical model for predicting the desired target value. However, while the model after this training might be able to perfectly predict all target valuables in the training data, we still lack understanding of its performance on new data. Inter alia, 13 That is, if developers do not possess labelled data already, as might be the case where a system can be trained on existing decision data, such as when a bank automates its lending operations by training a ML system with data on past credit decisions. 14 Different architectures often also require a different representation of the data. For example, logistic regressions in NLP often work with simple bags-of-words representations, where textual units (e.g. a sentence) are understood as syntactically undefined collections ('bags') of smaller units (e.g. words). Neural Network-based systems, on the other hand, typically work with so-called word embeddings, where words or sentences are represented as vectors in multidimensional (vector-)spaces. 15 Language models represent probability distributions of words in a given language. The foundational task most language models initially become trained on is that of inferring some output text given some input text (e.g. input = 'Barack Obama was once ___ of the United States of America', most probable output = 'president'). Once a model's training corpus is sufficiently large -GPT-3, one of the currently most popular language models, has been trained on 500 billion tokens of an average size of four characters, for examplethis process leads to language production skills virtually indistinguishable from those of adult humans. Metaphorically speaking, language models can be thought of as encoding linguistic knowledge, such as information on a language's syntax and semantics; for an instructive and interactive introduction into language models, see  19 it is crucial that developers understand which value to optimise for.

Deployment
If such efforts have not been made before, deployment will firstly include identifying concrete use scenarios and contacting potential customers 16 Alternatively, models are sometimes tested through so-called cross validation. Cross validation works by repeatedly partitioning the whole amount of data available at the time of model-building into different training and testing subsets and combining validation results at the end of several iterations. 17 For an instructive introduction, see Alaa Tharwat, 'Classification Assessment Methods' (2021) 17(1) Applied Computing and Informatics 168; performance metrics for regression analysis, i.e. for predictive models that do not produce (binary) classifications, but rather compute full numerical values, of course cannot work with this scheme. Instead, regression models are typically evaluated on the basis of metrics measuring a system's mean error rate. 18 More accurately the ROC-AUC measures 'the probability that a randomly chosen negative example will receive a lower score than a randomly chosen positive example, i.e. that the two will be correctly ordered', see (10. Distribution). Amongst other options, developers can sell their system to one or more specific customers, offer it as a rentable service or make it free for open use. Any of these decisions will have different consequences on how the model will be advertised as well as what kind of front-end (i.e. a program's client-facing user interface) needs to be developed. Finally, the system must be implemented (11. Implementation). This entails the model's integration into existing workflows and its operational activation. However, while implementation marks a potential close for the development process, it is more common for developers to stay in the loop, keep monitoring the system's operations and performance and overhaul components where necessary (12. Monitoring). 20 Such overhauling, for its part, can also comprise broader restructurings, such as retraining based on newly available data generated by the running system. All in all, ML software development thus often results in what is better described as a continuously updating and updated 'living' system, rather than a stable final product Figure 1.
3. The case of Perspective API: a more detailed look This conceptual breakdown can also help us retrace and understand the development of Perspective API. The extensive publicly available resources documenting the development of Perspective provide exceptional material to become acquainted with (some of) the trade-offs software developers are usually faced with and (some of) the heuristics they follow to resolve them. The following accountwhich is primarily based on Perspective's textual outputs, but has also been enhanced and corroborated by interviewing one of Perspective's early lead researchers/developersis therefore rather thorough. The hope is that it can give jurists and other social scientist a sense of the internal logics driving ML development processes.
Before we dive into the analysis, it seems important to note that the team behind Perspective has put a substantial amount of work and resources into developing a free-of-charge moderation engine that is accessible to everyone  20 One reason why monitoring may be important is to identify performance shortcomings not identified through the testing process, e.g. because the type of data which the system performs poorly on was not included in the test set. and, to the author's knowledge, constitutes the most transparent automated moderation system currently operating. It has also significantly expanded academic discussions around and reflection on automated content moderation and deliberative AI development more broadly. 21

Project conception
Perspective API in 2016 started off as a collaboration between Wikimedia Research and Google's public charity think-tank Jigsaw, intended to understand the spread of 'toxicity and harassment' on Wikipedia discussion pages. 22 Internal Wikimedia surveys had suggested that verbal abuse and hostility were not only a widespread phenomenon on Wikipedia discussion pages, but also drove some users away from engaging on the site. 23 As a response to these findings, a small team of researchers was tasked to lead a community research project to deliver a more detailed picture of the factual situation as well as brainstorm potential countermeasures. Despite early user concerns and alternative proposals to rethink and improve Wikipedia's poorly functioning 24 (human-operated) moderation system, 25 the project's administrators seem to quite quickly have chosen some form of AI-powered automation as the preferred output of their undertakings. 26  models, 29 uploaded a public demo 30 and published a research paper on the system's architecture and performance. 31 Despite virtually immediate findings that Perspective's filters were not only achingly vulnerable to simple adversarial work-arounds, but also seemed to block worrying amounts of benign text, 32 and with Wikimedia appearing to lose interest in continued collaboration, Perspective began to expand its operations (see infra 3.4). However, as other platforms did not have the same comment policy as Wikipedia, having the system zero in on 'personal attacks' seemed misplaced and a new, broader target variable necessary. Perspective's engineers therefore had to define a new standard according to which the system was to distinguish acceptable from non-acceptable comments. Perspective's engineers believed that the system had to go beyond established speech prohibitions as stipulated by legal or policy documents (e.g. proscriptions against speech intended to incite violence or 'hate speech') which they perceived as too demanding to efficiently prevent online conversations from derailing. 33 At the same time, they wanted (and needed) 34 individuals to agree as much as possible on the content that was to fall under the prohibition. 35 Coherent with these objectives, and after some experimentation, the team's choice ultimately fell on 'toxicity', defined primarily as material 'likely to make people leave a discussion'. Whereas Perspective has over the years added a number of 'toxicity'-subcriteria, such as 'identity attack' ('comments targeting someone because of their identity') and 'insult' ('insulting, inflammatory, or negative comment towards a person or a group of people'), 36 which the system can 'predict' if so configured, the system is still chiefly promoted as being able to 'identify [online] toxicity' 37 and 'toxicity' remains the system's primary target variable. 38  38 Besides, at least the New York Times has access to a further registry of site-tailored target variables, such as 'incoherent', 'inflammatory' or 'unsubstantial'.

Data collection and data preparation
Wikipedia discussion pages in the period between 2004 and 2015. 39 For annotation, a total of 4053 crowdworkers on the platform CrowdFlower were then asked to identify whether the specific comment they were shown 'contain[ed] a personal attack or harassment'. 40 This procedure led to the annotation of altogether 115,737 comments, with each comment labelled by at least 10 different annotators. 41 Consistent with its post-launch reorientation towards tackling 'toxicity', Perspective quickly had to obtain new data. Again, Perspective turned to its Wikipedia talk page corpus, 42 this time having 120,000 comments labelled on a five step-scale from 'very healthy' to 'very toxic'. 43 The instructions given to annotators defined 'toxic' comments as comments that were 'hateful, aggressive or disrespectful' and 'likely to make you leave the discussion', whereas 'healthy' comments were described as 'polite, thoughtful or helpful, [and] likely to make you want to continue the discussion'. 44 Only months after that, however, Perspective was faced with a new wave of public criticism: Journalists, testing the model's public demo, discovered that Perspective's algorithm exhibited disconcerting bias against comments making (self-identifying) references to minority identities. 45 Anodyne comments such as 'I am a black man.' (assumed 80% toxic), 'I am a dyke.' (assumed 60% toxic) and even 'Islam is a major world religion.' (assumed 66% toxic) all returned alarmingly high 'toxicity' scores. Observers, quite understandably, worried that such errant inferences could exacerbate, rather than remedy the silencing of already marginalised users. 46 To address these concerns, Perspective attempted to 'improve models by balancing their training data' 47 with non-toxic examples of identity-terms usage mined from Wikipedia articles. 48 Doubling down on its de-biasing efforts, Perspective in March 2019 introduced yet another dataset. Instead of relying on unreviewed Wikipedia text, however, this time a new set of human-labelled data was presented, consisting in roughly 1.8 million comments taken from the archive of a commenting plugin provider called 'Civil Comments'. 49 Crowdworkers had labelled all comments with toxicity scores and also tagged a subset of 450,000 comments with additional identity attributes indicating whether the comment mentioned a certain identity (e.g. female, male, homosexual, Jewish or disabled). 50 The hope was that this new data could make up for the lack of complexity and authenticity the previous synthetically generated de-biasing dataset had exhibited. 51 Finally, while Perspective has tapped into yet other data sources, the shape and form of these is somewhat unclear. 52 Private statements by the Perspective support team indicate that the system does not make (continual) use of the huge volume of comments flushed through its servers by the institutions using its public API. 53 annotations could be coded either binarily ('attack' or 'no attack') or as numerical averages (e.g. 0.3 or 0.95 depending on the distribution of 'attack'or 'no attack'-votes). After having trained and tuned all models over a 3:1:1 training:development:testing-split, 56 all 8 models were evaluated on the basis of two classic 'performance' metrics: ROC-AUC 57 and Spearman's rank correlation coefficient ρ. 58 The model that ultimately ranked highest on both metrics consisted in a logistic regression model trained on character-level 4-grams with annotations input as numerical averages. 59 As with the system's training data, these algorithmic models, too, had to be modified at the time of Perspective's broader redeployment as a system for 'toxicity control'. To this end, Perspective launched a public coding competition promising the three teams which could develop the strongest 'toxicity' classifier cash prizes of totalling $35,000. 60 Out of the 4,500 contending teams, ranked with regard to their model's ROC-AUC, the winning team's contribution presented a significant complexification to Perspective's previous models. Amongst other things, the simple bag-of-words representation was replaced by constructing the algorithm on top of pre-trained word embeddings (namely FastText and GloVe) 61 and the previously used logistic regression algorithm exchanged for a convolutional Neural Network. 62 After the discovery of Perspective's concerning discriminatory effects, two other remodelling projects were launched to reduce the system's bias. First, the model was retrained with the extended Wikipedia dataset. 63 A following research contribution by Perspective, however, showed that the mitigation strategies' effect on real world identity-referencing data was altogether weak. 64 Second, using the data sourced from Civil Comments, another public coding competition asked participants to mitigate bias while not reducing the model's overall performance. 65 As Perspective used a specifically linguistic items such as syllables etc.). To give an example: the sentence 'To be or not to be' could be represented in the following ways: To, be, or, not, to be (word-level 1-gram); To be, be or, or not, not to, to be (word-level 2-gram), etc.; T, o, _, b, e, _, o, r, … (character-level 1-gram); To, o_, _b, be, e_, _o, or, r_, … (character-level 2-gram), etc.. An n-gram model then calculates the desired output variable, such as a classification label, on the basis of the n-grams' probabilistic distribution. 56 More information can be found here: <https://github.com/ewulczyn/wiki-detox/tree/master/ src/modeling> accessed 20 December 2022. 57  designed new metric to measure the system's (reduced) bias, the improvements achieved through this remodelling are somewhat hard to understand. What is clear, however, is that the team that ultimately won the competition did so by introducing further complexities to the existing model such as replacing the previous (non-contextual) word embeddings with even bigger (contextual) pretrained language models (namely a combination of BERT, GPT-2 and XLNet) and adding a model-combination algorithm called Stochastic Weight Averaging. 66 Finally, in 2022 Jigsaw presented 'a new generation of toxic content classifiers', which they call Unified Toxic Content Classification (UTC). 67 This new framework consists of a large transformer architecture, 68 pretrained on a mix of proprietary data from Perspective itself (4.6B comments to online fora, partly derived from the live traffic sent to Perspective's production API) and an existing 'massively multilingual' corpus called mT5. 69 On the reported evaluation benchmarks, including a new test set constructed from Perspective's abovementioned new proprietary data set as well as a number of preexisting evaluation sets, 70 this new architecture largely outperformed Perspective's previous model and especially increased performance on non-English languages. 71 After this latest overhaul, most of Perspective's functions are now available in a total of 18 different languages. 72

Deployment
As mentioned above, Perspective started as a collaboration with Wikipedia, which was probing ways to counter the harassment present on some of its discussion pages. 73 Yet, as the collaboration with Wikipedia concluded after the first model was developed and potentially ready for implementation, Perspective needed to forge new links. One interested party was the New York Times, which started working with Perspective early on. 74 After having handed Perspective access to its archive of moderated comments, the Times in mid-2017 announced that it had implemented a new system called Moderator based on Perspective for its internal moderation operations. 75 Moderator would rank all incoming comments with a score indicating 'the probability that it would be rejected by a Times moderator' so that the Times' moderators could prioritise review and (potential) clearance of those comments the system deemed most likely in breach of the Times' standards. 76 Comments with low probability of rejection could (and later would) 77 be waved through automatically, while higher-scoring contributions could be authorised on bulk whenever a moderator had ascertained that other comments with similarly high toxicity-scores were, indeed, acceptable. 78 The Times also warned its moderators of Perspective's limitations and emphasised that all human moderators were free to overrule the system's scorings at any point. 79 Disqus, a comment hosting service used on sites such as ABC News and Rotten Tomatoes, which processes an average of 50 million comments per month, was the next large adopter of Perspective. 80 Different from the implementation at the Times, however, Disqus kept instructions on how to interpret and use the filter short and also provided a rather simplistic, uncritical front end: website administrators could have Disqus show them all incoming comments labelled 'toxic', but could not access information on what threshold was chosen for distinguishing toxic and non-toxic comments, nor remit comments they deemed wrongly labelled. 81 Other newspapers and commenting services followed. 82 Some of these use Perspective not (only) to assist its moderation staff internally, but also to warn users before posting whenever the system considers their comments (intolerably) 'toxic'. 83  FaceIt lets Perspective screen all chat messages in real time and hand out warnings and bans automatically, 84 Latin American social media platform Taringa! labels comments with a score above 0.9 'NSFW' or blocks them automatically. 85 Freelance programmers furthermore developed Perspective-based moderation plug-ins for a number of major online platforms such as Discord, 86 Wordpress, 87 Reddit 88 and Telegram. 89 Early 2021 Jigsaw announced that Perspective was processing 500 million requests a day. 90 Finally, Perspective has also been able to establish itself as a much-used and well-respected tool and benchmark for other researchers and software developers. Inter alia, Perspective has been used to define and classify types of 'toxic users', 91 to identify 'toxic comments' related to political news cycles and hypothesise about the drivers of such 'toxicity', 92 and to evaluate the efficiency of mitigation strategies attempting to prevent large language models from generating 'harmful' 93 or 'biased' 94 text. Beyond its practical relevance, Perspective has by now thus also acquired considerable epistemic authority.

Dis/similarities: the development of legal and algorithmic normative systems compared
The following section will now compare the regulatory affordances of algorithmic systems as uncovered by the case of Perspective API with the normative structure of legal systems. The section will first zoom in on the issue of practical constraints (4.2), and then take a closer look at questions of evaluative diversity (4.3). Subsequently, it will explore differences in systems' modes of evolution (4.4), and finally weigh in on diverging standards of evaluation (4.5). What is meant by these themes of comparison will (hopefully) become clear in the following paragraphs. Before we dive into these issues, however, it is important to contextualise some of the following considerations and highlight (potential) limitations to this study as well as similar trans-systemic examinations of normativity more broadly (4.1).

Context and limitations
The present inquiry focuses on the technical side of things. It does not direct (as) much attention to the broader institutional, cultural, political, economic, ideological … context within and through which Perspective came about. It is clear that an analysis of Jigsaw's institutional setting 95 would be able to uncover other important determinants of Perspective's trajectory. It could perhaps uncover specific organisational incentives (economic and other), a specific institutional memory or specific individual beliefs and goals guiding Jigsaw's staff. The present inquiry does not analyse or weigh in on any of these aspects in greater detail. It may thus overemphasize the contribution of technical aspects to the detriment of other factors.
Perspective, its development process and the identified normative characteristics may also not be representative of other content moderation systems or other decision-oriented ML systems more generally. Concerning other moderation systems, Perspective's development is presumably more similar to moderation tools developed by moderation-as-a-service providers, moderation tools annexed to large language models 96 as well as moderation tools pertaining to the spectrum of practices that have elsewhere been called 'artisanal' 97 than to those employed by large social media platforms 95 Such an analysis may consider a number of aspects, such as that Jigsaw was formally an independent Alphabet subsidiary, but is now operating as a unit within Google since 2020, that it started with an original 'mission […] to use technology to tackle the toughest geopolitical challenges', now reformulated to '[u]pholding technology as a force for good' against 'threats to open societies', that it was initially directed by former U.S. Secretary of State member Jared Cohen and is now headed by former Google strategist Yasmin Green, or that overall it seems to occupy some rather hard-todefine space between public-benefit thinktank, corporate philanthropy and strategic (macro)market development; it may also be noted that most of Jigsaw's other projects, such as its VPN software (Outline), its anti-DDoS service (Project Shield) or its ads targeting methodology trying to expose radi- like Facebook or YouTube. In contrast to these largest moderation efforts, these systems, like Perspective, by and large do not interact with industrially scaled teams of human moderators and, relatedly, do not possess capabilities to quickly retrain or remodel systems in light of novel (linguistic) phenomena. 98 As concerns comparisons with other (decisions-oriented) ML systems, the case of Perspective may be more representative of systems/institutions sharing some of Perspective's structural characteristics (such as the fact that it has to perform deeply value-laden evaluations on a continuously shifting range of inputs or that it centralises systems development while implementation and usage take place in a decentralised manner) than with those that do not. Findings may thus (perhaps) transfer well to other APIconnected systems, but less to intrainstitutionally developed systems. Finally, on a more fundamental level, it may be objected that comparing algorithmic systems with law is like comparing apples with oranges or grandmothers and toads. 99 The differences between these things would just be too structural for comparisons to spawn any kind of interesting or scientifically worthwhile insights or, perhaps more problematically, would lead to false analogies and obstruct rather than advance an accurate understanding of their nature. To a certain extent this is true. Private companies and their computational products clearly operate on a structurally different level from the law. Citizens also rarely hold the exact same expectations of legitimacy (and thus objectivity, equal treatment, democratic participation etc.) with regard to private companies as they do with regard to states and other legislating bodies. At the same time, more and more people do believe that many large tech companies should wield their power more responsibly. What is more, Jigsaw proactively presents itself as an altruistic benefactor and rather confidently assumes competencies (e.g. resolving conflicts or providing public order) that would usually (need to) be fulfilled by governments. Finally, even if one rejects the idea that law and private technological infrastructures can (sometimes) be held to the same or similar normative requirements, seeing the undeniable growth of tech companies' (regulatory/normative) power, it still seems worthwhile to at least attempt to compare their normative structures with a mere descriptive interest. That is, by and large, what the following sections aim to do. 100 98 In particular, this might mean that such smaller systems exhibit similarities with regard to the aspects of evaluative diversity, modes of evolution and standards of evaluation, infra 4.3-4.5. 99 Of course, many comparisons between apples and oranges are worthwhile and constructive given certain rather widespread (epistemic) interests, such as comparisons as to their mean calorie content, their mean vitamin content or their mean carbon footprint; see also Scott A. Sandford, 'Apples and oranges: a comparison' (1995) 1.3 Annals of Improbable Research 2; grandmothers and toads may be compared as to their cognitive capacities or with regard to certain aspects of their social behaviour. 100 Compare Hildebrandt (n 6) 2 'The question of the moral evaluation of a specific technology is not equivalent with an assessment of the "normativities" it affords'.

Practical constraints
The first, and perhaps most obvious thing that stands out when comparing algorithmic and legal normative systems concerns the vastly different practical constraints these systems have to deal with. With practical constraints we refer here to the manifold requirements, limitations and impossibilities that arise from a normative system's practical operationalisation, its practicalities. Practical constraints shape, restrict and condition normative projects, but do not themselves (claim to) further any positive normative objectives. Which phenomena exactly are to be understood as such practical constraints will, of course, never have one univocal answer. With regard to the law, one may claim that one big set of practical constraints emerges from law's linguistic nature, from the fact that law is (generally) expected to be represented in the form of (written) human-readable language. 101 This linguistic nature, for one thing, poses limitations where the law attempts to regulate phenomena naturally expressed in other kinds of languages, such as computational languages. 102 It also generates a much larger sphere of limitations related to the fact that languages are syntactically and semantically finite (limiting the space of possible normative projects), 103 that they are pre-existing (limiting law's capacity to rid itself of undesired connotations) and that they are fundamentally indeterminate (limiting law's capacity to generate normative certainty). 104 A second set of practical constraints arises from law's enforcement structures. Law, unlike certain other normative media, 105 by and large is not selfexecuting; its effects on the real world, generally speaking, depend on secondary enforcement through authorities, courts and other institutions. This dependence, of course, firstly introduces a structural lack of resources: authorities possess insufficient resources to investigate all cases falling within their jurisdiction, courts possess insufficient resources to conduct all-encompassing fact-finding and legal evaluation, individuals possess limited resources to pursue all well-grounded legal action. It also introduces other limitations, however, such as those caused by the fact that enforcement 101 For a good introduction into many of these issues, see Timothy Endicott, 'Law and Language' in Edward N. Zalza (ed), The Stanford Encyclopedia of Philosophy (Spring 2022), <https://plato.stanford. edu/archives/spr2022/entries/law-language/> accessed 20 December 2022. 102 For example, where the law is concerned with certain problems in data protection or algorithmic bias, it cannot address them in their own 'natural' languages, which are computational, but instead has to translate such problems to human-readable approximations. Similar considerations might apply to phenomena expressed in yet other 'languages' such as with regard to copyright questions in the fields of music or dance. 103 The consequences of this feature are well-illustrated by the discipline of comparative legal linguistics, see e.g. Comparative Legal Linguistics 104 Hildebrandt (n 6) (framing this last characteristic as a feature rather than a bug of law's normativity). 105 Computer code, for example, is often framed as possessing a form of self-executing normativity. Of course, this view is based on a specific understanding of what 'code' comprises, which is probably too narrow to include applications like Perspective.
usually has to follow clearly defined procedures (e.g. an adversarial or inquisitorial procedural regime) or that enforcement bodies can structurally distort regulatory projects (e.g. because of pre-existing bias or exposure to lobbying influence). Attentive regulators may be aware of such constraints and attempt to anticipate their effects, but not all of these hurdles are remediable. Algorithmically operationalised normative systems encounter a (partly) different set of practical constraints. One group of such constraints, easily overlooked, arise from the fact that (most) Machine Learning models, in order to perform 'well', 106 not only require loads of data to train on, but also require training data that is fairly homogeneous, incorporating fairly consistent ratings. This has specific impact on the phase of data annotation where one 'half' of the system's training data, its ground truth, becomes constructed. Annotation has to be geared towards obtaining the greatest level of inter-annotator agreement possible. As the interviewed Perspective engineer stated: 'If you are trying to get a machine to do something and you don't have agreement, the machine can't do it.' 107 Within the context of Perspective, this want of agreement influenced both the selection of 'toxicity' as the system's primary target variable as well as the instructions given to annotators on how to identify 'toxic content'. As emerges from some of Jigsaw's publications and was further corroborated by the interviewed engineer, 'toxicity' and the corresponding annotation instructions were (also) chosen, because they generated higher inter-annotator agreement than other potential criteria. 108 This makes sense from a ML point of view: high inter-annotator agreements generate at least some justification, providing a measurable metric indicating 'high data reliability', 109 and, as was mentioned, are generally believed to increase the model's performance. At the same time, it seems questionable that these technical considerations/incentives should constrain, or even govern, such an essential and chiefly normative choice as that of a system's target variable. This holds all the more so, when choice falls on a variable such as 'toxicity', which can cite no national or international legal standard backing it, 110 is seen by many as irrelevant or undesirable and carries a rather marked risk of metastasising into a more diffuse marker for merely inconvenient or objectionable content. 111 Another set of constraints, again exemplified by the development of Perspective, emerge from the integration of pre-trained large language models, which, due to the accuracy boosts they often create, have quickly become the new state of the art within NLP tasks such as text classification. Large language models, of course, first of all introduce a new level of complexity to language classification which makes it harder to predict potential shortcomings and biases. Furthermore, the expensive costs for creating and even running these models will be too high for many to create or modify these models themselves, rendering them dependent on models created by third party institutions. Where firms may resort to cheaper and less compute-intensive adaptations of large language models, such as distillations, 112 however, they may find that previous bias-mitigation strategies prove less effective. 113 Finally, to name one last example of practical constraints, operationalising speech prohibitions through algorithmic systems also makes it harder to integrate context into the evaluation of speech. While it is at least possible, albeit difficult, to include immediate textual context into speech evaluation (e.g. by adding a comment's surrounding comments into the training database), 114 there are so far no available strategies for including non-textual (non-immediate) context (such as information on current relevant political happenings or any other information not represented or representable within the text to be evaluated itself). This impossibility of enriching text evaluation with external context represents another practical constraint encountered by algorithmically, but not legally operationalised speech evaluation systems.
Many more practical constraints can be imagined for legal and algorithmic normative systems alike. Clearly, no normative system can ever be free of practical constraints. This, however, should not mean that consideration of such constraints, especially in a normative project's early phase of conception, should be foregone, but rather that the undesired effects of such constraints should be counteracted through conscious, foresighted design.

Evaluative diversity
Another point of difference between algorithmic and legal systems might be seen in the degree of evaluative diversity. Judging from the case of Perspective, it seems that legal systems generally allow for higher degrees of evaluative diversity.
Any (complex) normative system can be characterised as to the degree of variance and disagreement it affords, or what might be called evaluative diversity. This degree will be influenced, inter alia, by the number and independence of institutions authorised to emanate evaluative judgments, by the ambiguity of such judgments and the ambiguity of the evaluative standards themselves, as well as by the system's need for and means to synchronise different actors' evaluative positions. 115 Legal systems generally seem to afford a rather high degree of evaluative diversity. This diversity can be seen as the function of a number of central institutional principles: judicial independence, the limited availability of appeal mechanisms, the possibility of published dissent as well as the brute reality of a highly decentralised, multi-party application apparatus all systematically limit the convergence of judgments.
In the production of technologically driven normative systems the opposed tendency seems discernible. Here, normative convergence often is the default, if not even the only option.
One phase in which this came to light evidently in the development of Perspective was the phase of annotation. Much in line with the previous decision for 'toxicity' as the system's target variable, the annotation process, too, was geared towards obtaining maximum evaluative convergence. Both annotation instructions as well as the annotators themselves (albeit perhaps unconsciously) were chosen for the reason that they generated the highest degree of uniformity possible. 116 Again, there are several reasons for such an approach. For one thing, more distinct rating distributions generally facilitate the development of a more 'confident' model, generating higher rating probabilities. 117 For another, within software engineering communities to some degree the belief persists that diverging ratings indicate mistakes in the annotation process; an error-free procedure would lead to perfect uniformity. 118 Needless to say, this is wrong: any task that is not completely 115 To give an example, the evaluative diversity of a system of faith could be related to the number of worshipped deities, the (non-)ambiguity of its foundational texts as well as the number and independence of institutions accredited to put forward authoritative interpretations. 'objective'which the normative evaluation of human language most certainly is notwill naturally elicit varied responses, none of which can be (readily) ascertained as true or false. 119 Much more fundamentally still, the bias towards convergence in many supervised ML classifiers finds expression in the simple fact that annotations are (uncritically) unified into single value aggregates. Whereas annotation starts with a range of diverging ratings, reflecting the underlying variety of diverging judgments, it usually ends up with one single, 'harmonized' value, 120 subsequently used to develop one single model. To give one example: What exactly does an aggregated value of 0.58, composed of seven 0.4 ratings (lightly leaning towards non-toxicity) and three 1.0 ratings (firmly set on toxicity) represent? What seems lost on most developers here is that such aggregation not only eliminates relevant information and fabricates a non-ambiguity that does not exist in reality, but that it can also lead to a significant artificial inflation of performance scores. 121 While research on how more accurate systems, reflective of the diversity of real-life judgment, can be developed, has started, 122 any system that ultimately winds up with one single value will have serious difficulty conveying the evaluative diversity occurring in reality. Therefore, the only approach to overcome such simplifications seems the development of systems able to 'propagate uncertainty [and diversity] downstream'. 123 None of this is to say that more evaluative diversity is always and necessarily preferable. While evaluative diversity increases a system's flexibility, rendering it more responsive to the diversity of factual circumstances, 124 and can abet normative innovation, 125 it can also decrease the predictability, consistency and equality of a normative system. What degree of evaluative diversity one should strive for is thus ultimately a normative question itself. Algorithmic and legal normative systems present different potentials for obtaining different levels of evaluative diversity, with algorithmic systems currently showing stronger centripetal tendencies.

Modes of evolution
Especially the procedures through which Perspective was fitted with new data also allow for interesting observations on the different modes of evolution characterising legal and algorithmic normative systems. 126 Here, modes of evolution refer to the methods and dynamics pursuant to which a normative system develops.
While differences exist between legal systems and legal fields, it is safe to say that law, for the most part, evolves incrementally. This deeply entrenched conception 127 can be explained with reference to two central, albeit countervailing forces. On the one hand, law's case-based exposure to ever new arguments in ever new factual circumstances and its obligation (and capacity) to respond to such case-specific arguments effect an enduring contestation, modulation and normative rejuvenation. On the other hand, the principle of separated powers and the related understanding of the judiciary as performing an essentially retrospective-resolutive function, as well as the principles of legal coherence and legal stability limit the possibility of more extensive, coordinated legal innovations. Lodged between these two countervailing forces, the law routinely ends up in a situation where it can learn and develop bit by bit, but never too much at one time. In other words, the law routinely ends up with incremental development. 128 If we look at the construction of Perspective, a different developmental dynamic seems to prevail.
Starting with the training of Perspective on the first dataset from Wikipedia, the system until now has progressed mainly through the one-at-a-time addition of large, discrete datasets. Except for the tailored implementation built for the New York Times, 129 Perspective does not, in fact, automatically incorporate the scoring feedback adopters can submit through the API's 'SuggestCommentScore' feature or make use of it to retrain its model. 130 Put differently, Perspective does not, to use the prevalent Computer Sciences terms, perform online or continual learning.
Different reasons exist why continual learning might not be implemented in operative ML algorithms. From a technical perspective, continual learning algorithms face a problem called the stability-plasticity dilemma and can be struck by so-called catastrophic forgetting. 131 Different from humans, ML systems can have significant difficulty retaining old knowledge under the influx of new learnings. While research into solving these problems is prolific and diverse, 132 it is often highly task and architecture-specific and, therefore, hard to port. 133 Besides these technical reasons, offline training also provides considerable safety and control benefits. For instance, developers can evaluate the effects of a (re-)training before 'letting the algorithm loose on reality', something which cannot be equally guaranteed under continual learning conditions. Finally, when system control is centralised, such as in the case of Perspective, dataset holders might fear that taking in annotations from decentralised adopters could expose their dataset to manipulation, sloppiness or, perhaps most importantly, ungeneralizable semantic idiosyncrasies. While 'perfect' continual ML might therefore be theoretically possible, practical circumstances, thus, significantly hinder its realisation.
For these reasons, many supervised Machine Learning systems follow a learning path progressing through non-continuous, discrete updates or what might be called transformative evolution. Instead of the gradual progression characterising the legal system, changes are introduced through one-off additions of (large) data batches which at once update the entire system.
It is not immediately apparent whether one type of evolution is preferable over the other. Of course, a more incremental developmental dynamic bears the advantage of adapting to changing circumstances more quickly. The possibility of planned, purposive transformation, on the other hand, gives system administrators much more control for targeted improvement.
Ultimately, the suitability of different development types might depend on the concerned regulative task and parameters such as variability, predictability and context-sensitivity. In contexts like content moderation, where both the data (usage and meaning of words) and the evaluative thresholds (communication norms) are in constant shift, the benefits of agile, incremental evolution are hard to dismiss. At the same time, algorithmic systems provide much larger opportunities for pre-adoption impact assessments and allow developers to evaluate how modifications will influence the treatment of known or contrived cases, something not equally possible within legal contexts.

Standards of evaluation
Finally, legal and algorithmic systems also differ with regards to their respective standards of evaluation. Standards of evaluation, here, refer to the kinds of arguments that can be, or are usually made in evaluating a system's normative soundness.
The legal system's standards of evaluation, in essence, are twofold. 134 The first source of standards consists of the legal system's own, internal stock of arguments: existing precedent, rules, doctrines, principles, values, etc.. Especially as petitioners facing unfavourable legal conditions will attempt to modify the existing standards (e.g. through recourse to the system's more open-textured elements such as constitutional norms), these internal standards are in constant flux.
The second source of law's standards of evaluation, in line with what was said above on law's practical-contextual embeddedness, is practical reason. Practical reason, here, refers to the virtually boundless mass of practical evaluative arguments brought forward in contesting (or defending) the 'real world' justifiability of a given legal rule.
The sources of evaluation in algorithmic systems show a rather pronounced difference to these standards. As suggested by the case of Perspective, in algorithmic systems the dominant evaluative standards undoubtedly are standards of accuracy, i.e. standards concerning the system's ability to correctly predict as many instances as possible. While there is nothing inherently wrong or dubious about accuracy metrics, 135 their (uncritical) centring also raises questions. First of all, the language of accuracy often obscures that every metric selection always also implies a choice of normative priorities. This is rather obvious and, therefore, well-established for metrics focusing on one particular type of error, such as recall or precision. 136 Even the ROC-AUC used by Perspective, which is prevalence-independent and therefore often depicted as objective and valueneutral, 137 however, contains clear normative choices: it assigns equal importance to False Positives and False Negatives and prioritises correct ranking (i.e. that a True Positive will have a higher predicted likelihood than a True Negative) over correct probability estimates (i.e. that estimated probabilities approximate their true values). 138 Yet, it is neither evident that wrongly limiting unoffensive speech should have the same relevance as wrongly rubber-stamping invective, nor that a system built to inform human moderators with numerical toxicity scores should not prioritise the accuracy of precisely these numbers.
Furthermore, accuracy can conflict with other normative desiderata. 139 The case of Perspective illustrates that whereas developers may not necessarily actively prioritise accuracy over other values, they may often follow an unconscious cognitive bias that overall accuracy represents the system's most central measurement and should therefore never incur impairments. The debiasing efforts of Perspective, for instance, all operated on the proviso that changings should not have (significant) negative effects on overall accuracy. Effectively, accuracy can thus end up constraining a system's development and neutralise overhauling efforts.
What is more, the focus on accuracy metrics takes away resources needed for a stronger engagement with external evaluation methods 140 and distracts from simple, yet highly revelatory evaluation heuristics, such as asking how similar the used training data is to the actual field of application. It suppresses the creation of new, innovative evaluation standards that could explore the normative ramifications of algorithmic systems in greater depth and greater detail. These could include more localised inquiries (such as how Perspective affects speech on specific topics, certain vocabularies, 'political speech' generally, …), targeted diagnosis of a model's weaknesses 141 or testing on data sources not used in the training phase.
Developers, here, could draw inspiration from legal systems. Legal systems have grappled with many speech issues endemic to online conversations, too. Of course, the concrete range, framing and reactions to these issues vary from one jurisdiction to another. 142 The point, however, is not that developers should replicate existing legal 'solutions'which in any case cannot simply be 'automated'but rather that legal discourses provide rich troves of speech policy analysis which can help broaden developers' evaluative perspectives.
Finally, juxtaposing algorithmic and legal normative systems also highlights the little interest legal systems seem to have for 'accuracy'. Of course, the issue of legal 'accuracy' is complicated, as few lawyers would believe that legal cases have one true answer, hence denying the existence of a 'ground truth'. 143 However, a burgeoning field of rules binding legislation to oversee and review the effects of legislative projects shows that the law is no complete stranger to formal 'quality' controls either. 144 Future research could assess whether, and if so to what degree, more formalised quality controls could and/or should play a role for the judicial system, too. 145

Summary
Legal and algorithmic normative systems thus show similarities and dissimilarities when compared as to their broader normative structures. While both types of systems have to deal with practical constraints (although partly different ones), they have rather marked differences in terms of evaluative 141  diversity and standards of evaluation. Concerning modes of evolution, much seems to depend on algorithmic systems' specific design. While online learning systems may approach the law's incremental mode of development, offline, batch learning-type systems, such as Perspective, again seem to show more differences than similarities.

Regulatory relevance: the EU's proposal for an Artificial Intelligence Act
Until recently, few laws have attempted to regulate algorithmic systems directly. 146 At present, however, global regulatory efforts are ramping up. 147 One proposal that stands out, both for its comprehensiveness and level of detail, is the EU's proposal of an Artificial Intelligence Act (hereinafter 'AI Act' or 'the Act'). 148 As the proposal's (current) recitals state, the Act's twofold objective is to 'foster the development, use and uptake of artificial intelligence in the internal market [while] at the same time meet[ing] a high level of protection of public interests, such as health and safety and the protection of fundamental rights'. 149 After the Act was first proposed by the EU Commission in April 2021, it has been revised by the EU Commission and is currently discussed by the EU Parliament. Presumably it will find its definite form by way of the so-called trilogue discussions within the next one to two years. Whereas some decisions on the Act's scope and substance by now seem more or less final, much may still be subject to change. This raises the question whether the case of Perspective and the above-identified insights on dis/similarities in the development of legal and algorithmic normative systems can help us in identifying ambiguities or shortcomings in (the current form of) the AI Act. Three issues, here, seem noteworthy, namely: clarifying ambiguities as to the applicable law; rethinking the obligations on providers of 'general purpose AI systems'; and, extending the obligations on 'users'.

Clarifying ambiguities as to the applicable law
Content moderation systems like Perspective, in all likelihood, will not themselves fall within the Act's scope of application: They do not operate within any of the eight areas currently foreseen to trigger the Act's obligations for high-risk AI systems, 150 nor do they constitute 'general purpose AI' for which the Act envisages a tailormade set of requirements established through implementing acts to be adopted by the Commission. 151 This leaves open the question whether AI systems not covered by the Act may still be assessed under other pertinent legislation, such as the EU's GDPR. Some provisions seem to implicitly acknowledge the GDPR's continued application. 152 Article 2(5), on the other side, states explicitly only that the regulation shall not affect application of the EU's provisions on the (non-)liability of online intermediaries. 153 It may thus lead some to think that, e contrario, all other regulations potentially covering AI systems would be disapplied. Certainly, this is the interpretation some commercial developers of AI systems will endorse. If the EU does not intend to leave data subjects worse off than before, it should thus clarify that the AI Act 150 Annex III, these include: biometrics; critical infrastructure; education and vocational training; employment, workers management and access to self-employment; access to and enjoyment of essential private services and essential public services and benefits; law enforcement; migration, asylum and border control management; administration of justice and democratic processes; of course, content moderation systems may be used in some of these contexts (e.g. within automated job interviews). On the face of it, however, this does not seem to fulfil the AI Act's requirement that an AI system be 'intended to be used' e.g. 'for recruitment or selection of natural persons'. Currently, the Act does not seem to foresee any obligations for developers whose systems may not be intended to be used in high-risk contexts, but may still be used in such contexts in a foreseeable manner. The Act also provides no definition as to what constitutes a relevant 'intention'. 151 Articles 4a-4c. Again it would seem advisable to clarify the current definition of 'general purpose AI system' which currently includes 'AI system[s] that [are] intended by the provider to perform generally applicable functions such as image and speech recognition, [… and] may be used in a plurality of contexts […]'; Perspective API does perform such functions (i.e. 'speech recognition' and 'pattern detection') and may be used in a plurality of contexts, but does not seem to partake in the set of technologies that the EU, as far as publicly known, had in mind when crafting these rules, see n 157 and corresponding text. 152 Namely Article 10(5) (granting an exception from the prohibition to process special category data under Article 9(1) GDPR, for the purposes of AI bias monitoring, detection and correction), Article 29(6) (stipulating that users of AI systems shall use the information provided by systems' providers (developers) to conduct their data protection impact assessments as required by Article 35 GDPR); Article 54 (lifting certain prohibitions on further processing under Article 6(4) GDPR for participants of AI regulatory sandboxes), see also recitals 32, 44, 44a, -72a. 153 Set out previously in the eCommerce Directive and now replaced by the EU's Digital Services Act.
does not abrogate the GDPR nor any other existing EU data protection legislations. 154 Should GDPR provisions continue to apply, one provision that operators (and developers?) 155 of ML systems need to pay attention to is Article 22 GDPR, which limits the legality of automated decision making (ADM) and stipulates that whenever data controllers do use ADM they 'shall implement suitable measures to safeguard the data subject's rights and freedoms and legitimate interests, at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision'. Although Article 22 GDPR requires that a decision be 'based solely on automated processing' and that it 'produces legal effects […] or similarly significantly affects [the data subject]', under the EDPB's rather extensive interpretations of these criteria as well as in light of the existing consumer-friendly case law, 156 it does not seem all too improbable that content moderation systems like Perspective could indeed fall under the provision's scope of application.

Rethinking the obligations on providers of 'general purpose AI systems'
One of the biggest changes the proposal for the AI Act has seen since its initial conception, is the introduction of a title on 'general purpose AI systems' (GPAIS). Although the Act so far mentions no practical examples of GPAIS, its definition of these systems -'AI system[s] that [are] intended by the provider to perform generally applicable functions such as image and speech recognition, audio and video generation, pattern detection, question answering, translation and others[, and which] may be used in a plurality of contexts and be integrated in a plurality of other AI systems'indicates that the title is aimed at what are now often called 'foundation models'. 157 Foundation models can be characterised as possessing general, multipurpose capabilities (e.g. language 'understanding', language generation, object recognition or text-to-image synthesis) that can be leveraged, adapted and finetuned for a wide range of more specific downstream tasks (e.g. a chatbot application or logo-generation software). As suggested by their name, they function as 'foundations' for more specialised, single-purpose systems.
Foundation models can suffer from the same ills and deficits afflicting more specialised, application-oriented ML systems, such as weak performance, compromise of private data or biased or discriminatory outputs. Applications built on top of such foundation models are often likely to inherit such problems, thereby multiplying their effects. Especially as foundation models are swiftly turning into the de facto state of the art for an increasing number of contexts and tasks, it is thus imperative that the AI Act's obligations extend to these systems.
It seems rather imprudent, however, that obligations are currently foreseen to apply to GPAIS only with respect to those risks that may emerge where a GPAIS is intentionally or foreseeably adapted to a high-risk context, 158 but not regarding the risks that may emerge from GPAIS adaptation in non-high-risk systems. To some extent, this is a problem just because of the sheer influence and leverage of many foundation models. Models like Google's BERT, OpenAI's GPT and DALL·E suits, or Meta's OPT-175B have already or will soon be implemented in hundreds or thousands of applications and contexts. They are also run by the market's most powerful and well-resourced players. It seems both ill-advised and unnecessary to exempt these providers from assessing and mitigating their systems' potential security risks. 159 A second and even more important reason why safety obligations on GPAIS providers should not be limited to the risks arising from high-risk scenarios only, is that mitigation strategies on the GPAIS level can affect downstream adopters in diverse and potentially conflicting ways. Whereas specific mitigation efforts may benefit some downstream tasks, they may create new risks for others. For instance, researchers have shown that plausible strategies for debiasing pretrained language models reduce bias on some downstream adoptions while exacerbating it on others. 160 It is also easy to imagine that rendering a language model oblivious to certain protected attributes may behoove certain tasks (such as unbiased biography classification), 161 but not others (such as unbiased hate speech detection). 162 Last but not least, tradeoffs between 'accuracy' and other desiderata such as fairness may be judged differently for different contexts. 163 For these reasons, mitigation activities that GPAIS providers will perform to attenuate risks in (what the AI Act defines as) high-risk scenarios may end up rendering GPAIS adoption more dangerous in non-high-risk settings. Considering that non-high-risk uses are likely to outnumber high-risk adaptations considerably, this seems unfortunate to say the least. Ultimately, it would thus seem most appropriate to shed the restriction on risk management obligations to high-risk scenarios. 164 generated logs, and 5) inform the provider or distributor of a system and 6) suspend its use where they have reason to believe that use in accordance with the instructions may result in risks for the health, safety or fundamental rights of concerned persons or where they have identified a so-called 'serious incident'. 166 These are good starting points. All in all, however, the proposal seems to still underestimate the leverage many users hold on preventing or minimising systems-generated risks. 167 In the case of Perspective API, for example, users had significant influence on the level of risk likely to emerge from Perspective's operations through the ways in which they implemented the provided classifier (e.g. the degree of automation/human involvement, the target variable and threshold chosen to trigger consequences, the consequences triggered, the information provided to users interacting with the system, etc. etc.). The situation will be similar in many other cases, where AI systems are not delivered as 'plug and play' applications, but allow for/require some modification and implementation on the part of the user. Realistically, the implementation level will indeed often be the only stage where certain system-generated risks can be addressed (effectively). The AI Act should take such insights seriously and correspondingly broaden users' obligations. One way of doing so would be to 1) stipulate a duty on users to assess whether a system poses an acceptable level of risk within the setting they intend to use it in prior to adoption and 2) stipulate a duty on users to implement the system in such a way as to minimise the remaining risks. Not providing for such an extension of user duties, would seem to waste valuable regulatory leverage.

Conclusion
As this article was hopefully able to show, a more factual approach, engaging in in-depth explorations of how algorithmic systems become designed and developed, can unearth significant, yet often overlooked dis/similarities concerning the normative structure of legal and algorithmic systems respectively. While this article highlighted the four themes of practical constraints, evaluative diversity, modes of evolution and standards of evaluation, other heuristics certainly would be no less productive. Importantly, such findings should also interest policymakers on the lookout for effective approaches to the societal challenges of our digital age. 166 Article 29; 'serious incidents' are defined as an malfunctioning of an AI system that leads to serious damage to a person's health, a serious disruption of critical infrastructure, a breach of obligations under Union law intended to protect fundamental rights or serious damage to property or the environment, Article 3(44). 167 This failure to appreciate users' control and influence seems to come from a view of AI as a 'product', delivered to customers as plug-and-play software. While some systems may be adequately captured by such an understanding, many others need to be further specified, modified and implemented by users, all of which includes decisions on the system's ultimate functioning.
The case of Perspective also points to a number of weaknesses in the current proposal for an EU Artificial Intelligence Act. Next to clarifying ambiguities regarding the potential, yet undesirable, disapplication of existing EU (data protection) law, the Act would also seem to benefit from adjustments concerning the allocation of responsibilities in multi-party AI development settings, e.g. where GPAIS systems become adapted to a more specified downstream task. In particular, the Act would presumably do well to extend GPAIS providers' safeguarding duties to dangers arising in potential non-high risk scenarios and should also broaden the set of obligations falling on AI users.
Looking ahead, we may discover that the AI Act's ontologyits identifying of AI providers and users as the two relevant types of actorslacks complexity and breadth. Already today one can observe that AI development ecologies include actors, such as dataset developers, ML model providers 168 or ML development platforms 169 that cannot easily be captured within this ontology. Of course, no law can construct its regulatory precepts on a lossless replication of reality's factual complexity. Inevitably, the AI Act or any other regulatory proposal will ignore certain relevant aspects of AI development. This, however, makes it all the more important that academic legal analysis keep abreast of the changing realities of AI systems development so as to be able to incorporate new developments when applying or critiquing existing laws or advocating for new ones. Not least for this last reason, detailed examinations of specific AI development processes, such as the present study investigating the construction of Perspective API, would seem to remain important in the future, too.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributor
Paul Friedl is a doctoral candidate and research assistant at at the DFG-Research Training Group DynamInt at Humboldt University Berlin. After completing his legal studies in Berlin and Rome, Paul obtained a Master of Laws in Comparative, European and International Laws at the European University Institute. His primary academic interests lie in data protection and data governance law, law and technology more broadly as well as AI regulation. Paul has published on European data protection law and European constitutional law.