Ready, set, Go! Data-race detection and the Go language

Data races are often discussed in the context of lock acquisition and release, with race-detection algorithms routinely relying on vector clocks as a means of capturing the relative ordering of events from different threads. In this paper, we present a data-race detector for a language with channel communication as its sole synchronization primitive, and provide a semantics directly tied to the happens-before relation, thus forging the notion of vector clocks.


Introduction
One way of dealing with complexity is by partitioning a system into cooperating subcomponents. When these subcomponents compete for resources, coordination becomes a prominent goal. One common programming paradigm is to have threads cooperate around a pool of shared memory. In this case, coordination involves, for example, avoiding conflicting accesses to memory. Two concurrent accesses constitute a data race if they reference the same memory location and at least one of the accesses is a write. Because data races can lead to counter intuitive behavior, it is important to detect them.
The problem of data-race detection in shared memory systems is well studied in the context of lock acquisition and release. When it comes to message passing, the problem of concurrent accesses to channels, in the absence of shared memory, is also well studied-the goal, in these cases, is to achieve determinism rather than racefreedom [6,7,37]. What is less prominent in the race-detection literature is the study of channel communication as the synchronization primitive for shared memory systems. In this paper, we present exactly that; a dynamic data-race detector for a language in the style of Go, featuring channel communication as means of coordinating accesses to shared memory.
We fix the syntax of our calculus in Section 3 and present a corresponding operational semantics. The configurations of the semantics keep track of memory events (i.e. of read and write accesses to shared variables) such that the semantics can be used to detect races. A proper book-keeping of the event also involves tracking happens-before information. In the absence of a global clock, the happens-before relation is a vehicle for reasoning about the partial order of events from different threads [17]. Different from other race detectors, which often employ vector clocks (VCs) as a mechanism for capturing the happen-before relation, we dispense with the notion of VCs and tie our Supported by the bilateral project UTF-2018-CAPES-Diku/10001 "Modern Refactoring". formalization directly to the concept of happens-before. Our race detector is built upon a previous result [9], where we formalize a weak memory model inspired by the Go specification [11]. The core of the paper was a proof of the DRF-SC guarantee, meaning, we proved that the proposed relaxed memory model behaves Sequentially Consistently (SC) when running Data-Race Free (DRF) programs. The proof hinges on the fact that, in the absence of races, all threads agree on the contents of memory. The scaffolding used in the proof contains the ingredients for the race detector presented in this paper.
We should point out, however, that the operational semantics presented here and used for race detection is not a weak semantics. 1 Apart from the additional information for race detection, the semantics is "strong" in that it formalizes a memory guaranteeing sequential consistency. To focus on a form of strong memory is not a limitation. Since we have established that a corresponding weak semantics enjoys the crucial DRF-SC property [9], the strong and weak semantics agree up to the first encountered race condition. Given that even racy program behaves sequentially consistently up to the point in which the first data-race is encountered, a complete race detector can safely operate under the assumption of sequential consistency.
The remainder of the paper is organized as follows. Section 2 presents background information on data races and synchronization via message passing that are directly related to the formalization of our approach to race detection. Section 3 formalizes race detection in the context of channel communication as sole synchronization mechanism. We turn our attention to the issue of efficiency in Section 4. Section 5 puts our work in the perspective of trace theory. Section 6 gives a detailed comparison of our algorithm and existing race detection algorithms for the acquire-release semantics. Section 7 examines related work and Section 8 provides a conclusion and touches on future work.

Background
Read and write conflicts. Memory accesses conflict if they target the same location and at least one of the accesses is a write-there are no read-read conflicts. A data race constitutes of conflicting accesses that are unsynchronized. Furthermore, a data race manifests itself when an execution step is immediately followed by another and the two steps are conflicting. This definition is the closest one can get to a notion of simultaneity in an operational semantics, where memory interactions are modeled as instantaneous atomic steps. While manifest races are obvious and easy to account for, races in general can involve accesses that are arbitrarily far apart in a linear execution. A "memory-less" detector can fail to report races, for example non-manifest races, that could otherwise 1 Note that while the mentioned semantics of [9] differs from the one presented here, both share some commonalities. Both representations are based on appropriately recording information of previous read and write events in their run-time configuration. In both versions, a crucial ingredient of the book-keeping is connecting events in happens-before relation. The purpose of the book-keeping of events, however, is different: in [9], the happens-before relation serves to operationally formalize the weak memory model (corresponding roughly to PSO) in the presence of channel communication. In the current paper, the same relation serves to obtain a race detector. Both versions of the semantics are connected by the DRF-SC result, as mentioned.
be flagged by more sophisticated race detectors. The ability to flag non-manifest dataraces is correlated with the amount of information kept and the length in which this information is kept for. In general, recording more information and storing it for longer leads to higher degrees of "completeness" at the expense of higher run-time overheads. 2 We break down the notions of read-write and write-write conflicts into a more finegrained distinction. Inspired by the notion of data hazards in the computer architecture literature, we break down read-write conflicts into read-after-write (RaW) and writeafter-read (WaR) conflicts. To keep consistent with this nomenclature, we refer to writewrite conflicts as write-after-write (WaW). 3 We make the distinction between the detection of after-write races and the detection of write-after-read ones. As we will see in Section 3.3, the detection of after-write races can be done with little overhead. The detection of after-read, however, cannot.
When reading or writing a variable, it must be checked that conflicting accesses happened-before the current access. The check must happen from the perspective of the thread attempting the access. In other words, the question of whether an event occurred in the "definite past" (i.e., whether an event is in happened-before relation with "now") is thread-local; threads can have different views on whether an event belongs to the past. This thread-local nature is less surprising than it may sound: if one threads executes two steps in sequence, the second step can safely assume that the first has taken effect; after all, that is what the programmer must have intended by sequentially composing instructions in the given program order. Such guarantees hold locally, which is to say that the semantics respects program order within a thread. It is possible, however, for steps to not take effect in program order. A compiler or hardware may rearrange instructions, and it often does so in practice. What must remain true is that these reorderings cannot be observable from the perspective of a single thread. When it comes to more than one thread, however, agreement on what constitutes the past cannot be achieved without synchronization. Synchronization and consensus are integrally related. 4 Specifically, given a thread t, events from a different thread t are not in the past of t unless synchronization forces them to be. 2 It should go without saying that observing one execution as being race free is not enough to assert race-freedom of the program, even if one has observed a complete trace of a terminating run of a program. Completeness can at best be expected with respect to alternative schedules or linearizations of a given execution. 3 The mentioned "temporal" ordering and the use of the word "after" refers to the occurrence of events in the trace or execution of the running program. It is incorrect to conflate the concept of happens-before with the ordering of occurrences in a trace. For instance, in a RaW situation, the read step occurs after a write in an execution, i.e., the read is mentioned after the write in the linearization. This order of occurrence does not mean, however, that the read happens-after the write or, conversely, the write happens-before the read. Actually, for a RaW race (same as for the other kinds of races), the read occurs after the write but the accesses are concurrent, which means that they are unordered as far as the happens-before relation is concerned. 4 In the context of channel communication and weak memory, the connection between synchronization and consensus is discussed in a precise manner in our previous work; see the consensus lemmas of [9].
Synchronization via bounded channels. In the calculus presented here, channel communication is the only way in which threads synchronize. Channels can be created dynamically and closed; they are also first-class data, which means channel identifiers can be passed as arguments, stored in variables, and sent over channels. Send and receive operations are central to synchronization. Clearly, a receive statement is synchronizing in that it is potentially blocking: a thread blocks when attempting to receive from an empty channel until, if ever, a value is made available by a sender. Since channels here are bounded, there is also potential for blocking when sending, namely, when attempting to send on a channel that is full. The happens-before memory model stipulates, not surprisingly, a causal relationship between the communicating partners [11]: A send on c happens-before the corresponding receive from c completes. (1) Given that channels have finite capacity, a thread remains blocked when sending on a full channel until, if ever, another process frees a slot in the channel's buffer. In other words, the sender is blocked until another thread receives from the channel. Correspondingly, there is a happens-before relationship between a receive and a subsequent send on a channel with capacity k [11]: The i th receive from c happens-before the (i + k) th send on c completes. ( Interestingly, because of this rule, a causal connection is forged between the sender and some previous receiver who is otherwise unrelated to the current send operation. When multiple senders and receivers share a channel, rule (2) implies that it is possible for two threads to become related (via happens-before) without ever directly exchanging a message. 5 The indirect relation between a sender and a prior receiver, postulated by rule (2), allows channels to be used as locks. In fact, free and taken binary locks are analogous to empty and full channels of capacity one. A process takes and releases locks for the purpose of synchronization (such as assuring mutually exclusive access to shared data) without being aware of "synchronization partners." In the (mis-)use of channels as locks, there is also no inter-process communication. Instead, a process "communicates" with itself: In a proper lock protocol, the process holding a lock (i.e. having performed a send onto a channel) is the only one supposed to release the lock (i.e. performing the corresponding receive). Thus, a process using a channel as lock receives its own previously sent message-there is no direct inter-process exchange. Note, however, synchronization still occurs: subsequent accesses to a critical region are denied by sending onto a channel and making it full. See Section 3.5.2 for a more technical elaboration.
To establish a happens-before relation between sends and receives, note the distinction, between a channel operation and its completion in the formulation of rules (1) and (2). The order of events in a concurrent system is partial; not only that, it is strictly partial since we don't think of an event as happening-before itself. A strict partial order is an irreflexive, transitive, and asymmetric relation. In the case of synchronous channels, if we were to ignore the distinction between an event and its completion, according to rule (1), a send would then happen-before its corresponding receive, and, according to rule (2), the receive would happen-before the send. This cycle breaks asymmetry. Asymmetry can be repaired by interpreting a send/receive pair on a synchronous channel as a single operation; indeed, it can be interpreted as a rendezvous.
The distinction between a channel operation and its completion is arguably more impactful when it comes to buffered channels. For one, it prevents sends from being in happens-before with other sends, and receives from being in happens-before with other receives. To illustrate, let sd i and rv i represent the i th send and receive on a channel. If we remove from rules (1) and (2) the distinction between an operation and its completion, the i th receive would then happens-before the (i + k) th send-based on rule (2)-and the (i + k) th send would happens-before the (i + k) th receive-based on rule (1): By transitivity of the happens-before relation, we would then conclude that the i th receive happens-before the (i + k) th receive, which would happen-before the (i + 2k) th receive and so on. As a consequence, a receive operation would have a lingering effect through-out the execution of the program-similarly for send operations. This accumulation of effects is not only expensive to implement from a language design perspective, but it is also counter intuitive for the application programmer, who would be forced to reason about arbitrarily long histories.

Data-race detection
We start in Section 3.1 by presenting the abstract syntax of our calculus and, in Section 3.2, an overview of the operational semantics used for data-race detection. The race detector itself is introduced incrementally. We start in Section 3.3 with a simple detector that has a small footprint but that is limited to detecting after-write races. We build onto this first iteration of the detector in Section 3.4, making it capable of detecting after-write as well as after-read races. The detector's operation is illustrated by examples in Section 3.5. Later, in Section 4, we turn to the issue of efficiency and introduce "garbage collection" as a mean to reduce the detector's footprint. These race detectors can be seen as augmented versions of an underlying semantics without additional book-keeping related to race checking. This "undecorated" semantics, including the definition of internal steps and a notion of structural congruence, can be found in Appendix A.

A calculus with shared variables and channel communication
We formalize our ideas in terms of an idealized language shown in Figure 1 and inspired by the Go programming language. The syntax is basically unchanged from [9].
Values v can be of two forms: r denotes local variables or registers; n is used to denote references or names in general and, in specific, p for processes or goroutines, m for memory events, and c for channel names. We do not explicitly list values such as the unit value, booleans, integers, etc. We also omit compound local expressions like e 1 + e 2 . Shared variables are denoted by x, z, etc., load z represents reading the shared variable z into the thread, and z := v denotes writing to z. References are dynamically created and are, therefore, part of the run-time syntax. Run-time syntax is highlighted in the grammar with an underline as in n. A new channel is created by make (chan T, v), where T represents the type of values carried by the channel and v a non-negative integer specifying the channel's capacity. Sending a value over a channel and receiving a value as input from a channel are denoted respectively as v 1 ← v 2 and ← v. After the operation close, no further values can be sent on the specified channel. Attempting to send values on a closed channel leads to a panic.
Starting a new asynchronous activity, called goroutine in Go, is done using the gokeyword. In Go, the go-statement is applied to function calls only. We omit function calls, asynchronous or otherwise, as they are orthogonal to the memory model's formalization. The select-statement, here written using the ∑-symbol, consists of a finite set of branches (or communication clauses in Go-terminology). These branches act as guarded threads. General expressions in Go can serve as guards. Our syntax requires that only communication statements (i.e., channel sending and receiving) and the default-keyword can serve as guards. This does not reduce expressivity and corresponds to an A-normal form representation [31]. At most one branch is guarded by default in each select-statement. The same channel can be mentioned in more than one guard. "Mixed choices" [26,27] are also allowed, meaning that sending-and receiving-guards can both be used in the same select-statement. We use stop as syntactic sugar for the empty select statement; it represents a permanently blocked thread. The stop-thread is also the only way to syntactically "terminate" a thread, meaning that it is the only element of t without syntactic sub-terms.
v ::= r | n values e : The let-construct let r = e in t combines sequential composition and scoping for local variables r. After evaluating e, the rest t is evaluated where the resulting value of e is handed over using r. The let-construct acts as a binder for variable r in t. When r does not occur free in t, let boils down to sequential composition and, therefore, is more conveniently written with a semicolon. See also Figure 15 in the appendix for syntactic sugar.

Overview of the operational semantics
To capture the notion of ordering of events between threads, an otherwise unadorned operational semantics (equation (7)) is equipped with additional information: each thread and memory location tracks the events it is aware of as having happened-before-see the happens-before set E hb in the run-time configurations of equation (3) and (4), this set is present in terms corresponding to threads, p E hb ,t , as well as memory locations, (|E hb , z:=v| ) or m(|E r hb , z:=v| ). Depending on the capabilities of the race detector, slightly different information is tracked as having happened-before (i.e. stored in a happensbefore set).

After-write races
When detecting after-write races (i.e. RaW and WaW), in order to know whether a subsequent access to the same variable occurs without proper synchronization, one has to remember additional information concerning past writeevents. Specifically, it must be checked that all write events to the same variable happenedbefore the current access. The happens-before set is then used to store information pertaining to write events; read events are not tracked. Also, terms representing a memory location have a different shape when compared to the undecorated semantics. In the undecorated semantics, the content v of a variable z is written as a pair (|z:=v| ). When after-write races come into play, it is not enough to store the last value written to each variable; we also need to identify write events associated with the variable. Thus, an entry in memory takes the form (|E hb , z:=v| ) where E hb holds identifiers m, m , etc. that uniquely identify write events to z-contrast the run-time configurations in equation (7) and (3). The number of prior write events that need to be tracked can be reduced for the sake of efficiency, in which case the term representing a memory location takes the form m(|E r hb , z:=v| ) where m is the identifier of the most recent write to z. See equation (4).

Write-after-read races
Besides the detailed coverage of RaW and WaW races in Section 3.3, we describe the detection of write-after-read races in Section 3.4. When it comes to WaR, the race checker needs to remember information about past reads in addition to past write events. Abstractly, a read event represents the fact that a loadstatement has executed. Thus, the set E hb of an entry (|E hb , z:=v| ) in memory holds identifiers of both read and write events. In the strong semantics, a read always observes one definite value which is the result of one particular write event. Therefore, the configuration contains entries of the form m(|E r hb , z:=v| ) where m is the identifier of the "last" write event and E r hb is a set of identifiers of read events, namely those that accumulated after m. Note that "records" of the form m(|E r hb , z:=v| ) can be seen as n + 1 recorded events, one write event together with n ≥ 0 read-events. This definition of records with one write per variable stands in contrast to a weak semantics, where many different write events may be observable by a given read [9].

Synchronization
Channel communication propagates happens-before information between threads, and thus, affects synchronization. In the operational rules, each channel c is actually realized with two channels, which we refer to as forward, c f , and backward, c b -see Figure 4. The forward part serves to communicate a value transmitted from a sender to a receiver; it also stipulates a causal relationship between the communicating partners [11]-see rule (1) of page 4. To capture this relationship in the context of race checking, the sender also communicates its current information about the happens-before relation to the receiver. The communication of happens-before information is accomplished by the transmission of E hb over channels; see rule R-REC in Figure 4.
The memory model also stipulates a happens-before relationship between a receive and a subsequent send on a channel with capacity k-see rule (2) of page 4. While we refer to the forward channel as carrying a message from a sender to a receiver, the backward part of the channel is used to model the indirect connection between some prior receiver and a current sender; see R-SEND in Figure 4.
The interplay between forward and backward channels can also be understood as a form of flow control. Entries in the backward channel's queue are not values deposited by threads. Instead, they can be seen as tickets that grant senders a free slot in the communication channel, i.e., the forward channel. 6 Thus, the number of "messages" in the backward channel capture the notion of fullness: a channel is full if the backward channel is empty. See rule R-SEND in Figure 4 or Figure 18 for the underlying semantics without race checking. When a channel of capacity k is created, the forward queue is empty and the backward queue is initialized so that it contains dummy elements E hb⊥ (cf. rule R-MAKE). The dummy elements represent the number of empty or free slots in the channel. Upon creation, the number of dummy elements equals the capacity of the channel.
As discussed in Section 2, there is a distinction between a synchronization operation and its completion. A send/receive pair on a synchronous channel can be seen as a rendezvous operation; captured in our semantics by the R-REND reduction rule of Figure 4. When it comes to asynchronous communication, the distinction between a channel operation and its completion is handled by the fact that send and receive operations update a thread's local state but do not immediately transmit the updated state onto the channel-see rules R-SEND and R-REC in Figure 4.

Detecting read-after-write (RaW) and write-after-write (WaW) races
To detect "after-write" races, run-time configurations are given following syntax: Configurations are considered up-to structural congruence, with the empty configuration • as neutral element and as associative and commutative. The definition is standard and included in Appendix A.1. Likewise relegated to the appendix are local reduction rules, i.e., those not referring to shared variables or channels (see Appendix A.2).
In the configurations, a triple (|E z hb , z:=v| ) not only stores the current value of z but also records the unique identifiers m, m , etc of every write event to z in E z hb . 7 A write to memory updates a variable's value and also generates a fresh identifier m. In order to record the write event, the tuple (m, !z) is placed in the happens-before set of the term representing the memory location that has been written to. The initial configuration starts with one write-event per variable and the semantics maintains this uniqueness as an invariant. In effect, the collection of recorded write events behave as a mapping from variable to values. 8 A thread t is represented as p E hb ,t at run-time, with p serving as identifier. To be able to determine whether a next action should be flagged as race or not, a goroutine keeps track of happens-before information corresponding to past write events. An event mentioned in E hb is an event of the past, as opposed to being an event that simply occurred in a prior step. An event is "concurrent" if it occurred in a prior step but is not in happens-before relation with the current thread state. Concurrent memory events are potentially in conflict with a thread's next step. More precisely, if the memory record (|E z hb , z:=v| ) is part of the configuration, then it is safe for thread p E hb ,t to write to z if E z hb ⊆ E hb . Otherwise, there exist a write to z that is not accounted for by thread p and a WaW conflict is raised. Similar when reading from a variable.
Data-races are marked as a transition to an exception E-see the derivation rules of Figure 3, and, when write-after-read races are considered, Figure 7. The exception takes as argument a set containing the prior memory operations that conflict and are concurrent with the attempted memory access.
Goroutines synchronize via message passing, which means that channel communication must transfer happens-before information between goroutines. Suppose a goroutine p has just updated variable z thus generating the unique label m. The tuple (m, !z) is placed in the happens-before set of both the thread p and the memory record associated with z. At this point, p is the only goroutine whose happens-before set contains the label m associated with this write-record. No other goroutine can read or write to z without causing a data-race. When p sends a message onto a channel, the information about m is also sent. Suppose now that a thread p reads from the channel and receives the corresponding message before p makes any further modifications to z. The tuple (m, !z) is added to p 's happens-before set, so both p and p are aware of z's most recent write to z. The existence of m in both goroutine's happens-before sets implies that either p or p are allowed to update z's value. The rules for channel communication are given in Figure 4. They will remain unchanged when we extend the treatment to RaW conflicts. The exchange of happens-before information via channel communication is also analogous to the treatment of the weak semantics in [9]. Finally, goroutine creation is a synchronizing operation, where the child inherits the happens-before set from the parent-see Figure 5.

Detecting write-after-read (WaR) races
In the previous section, the detection of read-after-write and write-after-write races required happens-before sets to contain write labels only. The detection of write-afterread races requires recording read labels, as well. A successful read of variable z causes a fresh read label, say m , to be generated. The pair (m , ?z) is added to the reader's happens-before set as well as to the record associated with z in memory-see rule R-READ of Figure 6.
In order for a write to memory to be successful, the writing thread must not only be aware of previous write events to a given shared variable, but must also account for all accumulated reads to the variable. A write-after-read data-race is raised when a write is attempted by a thread and the thread is unaware of some previous reads to z. In other words, there exist some read-label in the happens-before set associated with the variable's record, say r ∈ E z hb ↓ ? , that is not in the thread's happen-before set, r / ∈ E hb . The projection ↓ ? essentially filters out write events from the happens-before set. Under   Fig. 6: Operational semantics augmented for data-race detection these circumstances, the precondition E z hb ↓ ? E hb of the R-WRITE-E WaR rule is met and a race is reported. Compared to the detector of Section 3.3, the reporting of WaW races in rule R-WRITE-E WaW is augmented with the precondition E z hb ↓ ? ⊆ E hb . Without this precondition, there would be non-determinism when reporting WaW and WaR conflicts. 9 Note, however, that when both WaW and WaR apply, the read in the WaR race happens-after the write involved in the WaW race. We favor to resolve this non-determinism and to report the most recent conflict.
The detector presented here can flag all conflicts: read-after-write, write-after-write, and write-after-read. In Section 4 we also make the detector efficient by "garbage collecting" stale information. But before then, let us look at a couple of examples that illustrate the detector's operation.

Examples
We will look at two examples of properly synchronized programs. The first is a typical usage of channel communication; one in which an action is placed in the past of another. The second example relies on mutual exclusion instead. In this case, we know that actions are not concurrent, but we cannot infer an order between them. By contrasting the two examples in Section 3.5.3, we derive observations related to determinism and constructivism.
3.5.1 Message passing Message passing, depicted in Figure 8, involves a producer writing to a shared variable and notifying another thread by sending a message onto a channel. A consumer receives from the channel and reads from the shared variable. The access to the shared variable is properly synchronized. Given the operational semantics presented in this chapter, we can arrive at this conclusion as follows. A fresh label, say m, is generated when p 1 writes to z. The memory record involving z is updated with this fresh label, and the pair (m, !z) is placed into p 1 's happens-before set, thus yielding E hb 1 . A send onto c sends not only the message value, 0 in this case, but also the happens-before set of the sender, E hb 1 , see rule R-SEND. The act of receiving from c blocks until a message is available. When a message becomes available, the receiving thread receives not only a value but also the happens-before set of the sender at the time that the send took place, see rule R-REC. Thus, upon receiving from c, p 2 's happensbefore set is updated to contain (m, !z). Receiving from the channel places the writing to z by p 1 into p 2 's definite past. The race-checker makes sure of this fact by inspecting p 2 's happens-before set when p 2 attempts to load from z. In other words, the racechecker checks that the current labels associated with z in the configuration are also present in the happens-before set of the thread performing the load.
The message passing example illustrates synchronization as imposing of an order between events belonging to different threads. The message places the producer's write in the past of the consumer's read. Next, we will look into an example in which synchronization is achieve via mutual exclusion. Two threads, p 1 and p 2 , are competing to write to the same variable. We will not be able to determine which write happens-before the other. Even though we cannot infer the order, we can determine that a happens-before order exists and, therefore, that the program is properly synchronized. Figure 9 shows a typical mutual exclusion scenario. It involves two threads writing to a shared variable z. Before writing, a thread sends a message onto a channel c which capacity | c | = 1. After writing, it receives from c.  Note that the channel is being used as a semaphore [8]. Sending on the channel is analogous to a semaphore wait or P operation. Receive is analogous to signal or V. The wait decrements the value of the semaphore and, if the new value is negative, the process executing the wait is blocked. A signal increments the value of the semaphore variable, thus allowing another process (potentially coming from the pool of previously blocked processes) to resume. Similarly, a send operation decrements the number of available slots in the channel's queue, while a receive increments it. Sending on a channel with capacity 1 can only take place if the channel is empty; meaning, all previous sends are matched with a corresponding receive.

Mutual exclusion
A send and its corresponding receive do not directly contribute to synchronization in this example. The send is matched by a receive from the same thread; nothing new is learned from this exchange. To illustrate this point, which may come as a surprise, let us look at an execution. Say p 1 is the first to send 0 onto c. Then p 1 's happensbefore set E hb1 is placed onto the channel along with the value of 0. The thread then proceeds to write to z, which generates a fresh label, say m ; the pair (m , !z) is placed on p 1 's happens-before set. When receiving from c, p 1 does not learn anything new! It receives the message 0 and a "stale" happens-before set E hb1 . The receive causes the receiver's happens-before set to be updated, but the "update" is completely mute. The new happens-before, say E hb 1 , remains unchanged: The explanation for why the program is synchronized, in this case, is more subtle. It involves reasoning about the channel's capacity. Recall that, according to rule (2) on page 4, the i th receive from a channel with capacity k happens before the (i + k) th send onto the channel completes. Since channel capacity is 1 in our example, rule (2) implies that the first receive from the channel happens-before the second send completes. If p 1 is the first to write to z, then p 1 is also the first to receive from c. Receiving from c places p 1 's happens-before set onto the backward channel (see rule R-REC). This happens-before set contains the entry (m , !z) registering p 1 's write to z. Upon sending onto c, p 2 receives from the backward channel and learns of p 1 's previous write. Thus, by the time p 2 writes to z, the write by p 1 has been-placed onto p 2 's definite past. Since no concurrent accesses exist, the race checker does not flag this execution as racy.
Similarly, p 2 could first send onto c and write to z. The argument for the proper synchronization of this alternate run would proceed in the same way. Therefore, even though it is not possible to infer who, among p 1 and p 2 , writes to z first, we know that one of the writes is in a happens-before relation with the other. This knowledge is enough for us to conclude that the program is properly synchronized.

Determinism, confluence, and synchronization
In the message passing example of Section 3.5.1, we are able to give a constructive proof-sketch of the synchronization between p 1 and p 2 ; the "proof" puts an event from p 1 in the past of p 2 . In the mutual exclusion example of Section 3.5.2, no such guarantee is possible. Instead, we give a non-constructive "proof" that p 1 and p 2 are synchronized by arguing that either p 1 's actions are in the past of p 2 's or vice versa. The law of excluded middle is used in this non-constructive argument.
The absence of constructivism is tied to the absence of determinism. While in the message passing example the program is deterministic, in the mutual exclusion example it is not. There is no data race in the mutual exclusion example, but there is still a "race" insofar as the two threads compete for access to a shared resource. The resource, in this case, is the channel, which is being used as a lock. The two threads race towards acquiring the lock (i.e. sending onto the channel) first. The initial configuration has two transitions, one in which p 1 acquires the lock first and one in which p 2 does. These transitions are non-confluent.
When it comes to reasoning about programs that model hardware, the lack of constructivism and the non-confluence in the use of channels as locks is a hindrance. Deterministic languages and constructive logics are needed in order to rule out scenarios in which two logic gates attempt to drive the same via with different logic values (i.e. a short circuit) [2]. In the case of channel communication and in the absence of shared memory, determinism can be achieved by enforcing ownership on channels; for example, by making sure a single thread can read and a single thread can write on a given channel at any given point in the execution [36]. It is possible for the ownership on channels to be passed around the threads in a way that preserves determinism [37].
The examples show that the absence of absence of data races is not enough to ensure determinism. In general, however, determinism is not a requirement. Many applications require "only" data-race freedom.

Efficient data-race detection
We have been gradually introducing a data-race checker. In Section 3.3, we presented a simple checker that flags after-write races (WaW and RaW) but is not equipped for write-after-read (WaR) detection. In Section 3.4, we augmented the detector to handle WaR. Here, we discuss how these detectors can be implemented efficiently; where efficiency is gained by employing "garbage collection" to reduce the detector's memory footprint. Note that keeping one record per variable is already a form of efficiency gain. In a relaxed memory model, since there may be more than one value associated with a variable at any point in the execution, one might keep one record per memory event [9]. The first step towards a smaller footprint is to realize that, if the underlying memory model supports the DRF-SC guarantee, a data-race detector can be built assuming sequential consistency. The reason being that, when a data race is flagged, execution stops at the point in which the weak and strong memory models' executions would diverge.
Knowing that memory events can overtake each other, in this section we discuss how stale or redundant information can be garbage collected. More precisely, we show how to garbage collect the data structures that hold happens-before information, that is, the thread-local happens-before set and the per-memory-location one.

Most recent write
Terms representing a memory location have taken different shapes when compared to the undecorated semantics. In the undecorated semantics, the content v of a variable z is written as a pair (|z:=v| ). For after-write race detection, an entry in memory took the form of (|E hb , z:=v| ) with E hb holding information about prior write events. Our first optimization comes from realizing that we do not need to keep a set of prior write events. We can record only the most recent write and still be able to flag all after-write racy executions. With this optimization, we may fail to report all accesses involved in the race, but we will still be able to report the execution as racy and to flag the most recent conflicting write event. This optimization is significant; it reduces the arbitrarily large set of prior write events to a single point.
An intuitive argument for the correctness of the optimization comes from noticing that a successful write to a variable can be interpreted as the writing thread taking ownership of the variable. Suppose a goroutine p has just updated variable z. At this point, p is only goroutine whose happens-before set contains the label, say m, associated with this write-record. The placement of the new label into p's happens-before set can be seen as recording p's ownership of the variable: a data-race is flagged if any other thread attempts to read or write to z without first synchronizing with p-see the check (m, !z) ∈ E hb in the premise of the R-WRITE and R-READ rules of Figure 10.
When p sends a message onto a channel, the information about m is also sent. Suppose now that a thread p reads from the channel and receives the corresponding message before p makes any further modifications to z. The tuple (m, !z) containing the write-record's label is added to p 's happens-before set. Now both p and p are aware of z's most recent write to z. The existence of m in both goroutine's happens-before sets imply that either p or p are allowed to update z's value. We can think of the two goroutines as sharing z. Among p and p , whoever updates z first (re)gains the exclusive rights to z.
It may be worth making a parallel with hardware and cache coherence protocols. Given the derivation rules, we can write a race detector as a state machine. Compared to the Modified-Exclusive-Shared-Invalid protocol (MESI), our semantics does not have the modified state: all changes to a variable are immediately reflected in the configuration, there is no memory hierarchy in the memory model. As hinted above, the other states can be interpreted as follows: If the label of the most recent write to a variable is only recorded in one goroutine's happens-before set, then we can think of the goroutine as having exclusive rights to the variable. When a number of goroutines contain the pair (m, !z) in their happen-before set with m being the label of the most recent write, then these goroutines can be thought to be sharing the variable. Other goroutines that are unaware of the most recent write can be said to hold invalid data.

Runtime configuration and memory related reduction rules
Given the "most recent write" optimization above, and, if we were satisfied with afterwrite conflicts, an entry in memory would take the form of m(|z:=v| ), with the label m uniquely identifying the event associated with v having been stored into z. Being able to flag after-write but not write-after-read races may be an adequate trade-off between completeness and efficiency. By not having to record read events, a simplified detector tailored for after-write race detection has a much smaller footprint than when read-afterwrite conflicts are also taken into account. Besides, a write-after-read race that is not flagged in an execution may realize itself as a read-after-write race in another run, and then be flagged by the simplified detector. 10 In contrast, the detection of write-after-read races requires more book-keeping: we need read-in addition to write-labels. This addition is required because a WaR conflict can ensue between an attempted write and any previous unsynchronized read to the same variable. Therefore, the race-checker is made to remember all such potentially troublesome reads. 11 The runtime configuration is thus modified, this time as to contain entries of the form m(|E r hb , z:=v| ). The label m identifies of the most recent write event to z and the set E r hb holds-read event identifiers, namely, the identifiers of reads that accumulated after m.
Note that records of the form m(|E r hb , z:=v| ) can be seen as n + 1 recorded events: one write together with n ≥ 0 read events.
The formal semantics maintains the following invariants. First, the happens-before information E r hb in m(|E r hb , z:=v| ) contains information of the form (m , ?z) only, i.e., there are no write events and all read-events concern variable z. Also, the event labels are unique for both reads and writes. In an abuse of notation, we may refer to m being in E r hb and write m ∈ E r hb meaning, more precisely, (m, ?z) ∈ E r hb . another via the permutation of independent events. It can be shown that if S 0 h − → S n is a run containing a write-after-read race, the exist an equivalent run in which the race materializes as a read-after-write race. 11 Since depending on scheduling, a WaR data-race can manifest itself as RaW race, one option would be not add instrumentation for WaR race detection and, instead, hope to flag the RaW manifestation instead. Such practical consideration illustrates the trade-off between completeness versus run-time overhead.

Garbage collection of happens-before sets
Knowledge of past events contained in a happens-before set E hb is naturally monotonically increasing. For example, each time a goroutine learns about happens-before information, it adds to its pool of knowledge. In particular, events that are known to have "happened-before" cannot, by learning new information, become "concurrent." An efficient semantics, however, does not accumulate happens-before information indiscriminately; instead, it purges redundant information. We say "redundant" from the point of view of flagging racy executions, but leaving out conflicting accesses that have been overtaken by more recent memory events.

Garbage collection on writes
For a thread t to successfully write to z, all previously occurring accesses to z must be in happens-before with the thread's current state. One optimization comes from realizing that we can purge all information about prior accesses the variable z from the happens-before set of the writing thread t. We call these prior accesses redundant from the point of view of flagging racy executions. The reason for the correctness of this optimization is as follows: All future access of t to z are synchronized with the redundant accesses, after all, the accesses are recorded in t's happens-before set. Therefore, from the perspective of t, these accesses do not affect data-race detection. For the same reason, if a thread t synchronizes with t, there is no race to report if and when t accesses memory-the absence of these redundant accesses from t 's happens-before is, therefore, inconsequential. Finally, if t does not synchronize with t, then an access to z is racy because it is unsynchronized with t's most recent write, regardless of the redundant prior accesses. Note that this optimization allows us to flag all racy executions even if we fail to report some of the accesses involved in the race. Rule R-WRITE of Figure 10 embodies this discussion. Before writing, the rule checks that the attempted write happens-after all previously occurring accesses to z. This check is done by two premises: premise (m, !z) ∈ E hb makes sure that the most recent write to z, namely, the one that produced event (m, !z), is in happens-before with the current thread state E hb . As per discussion in Section 4.1, being synchronized with the most recent write means the thread is synchronized with all writes up to that point in the execution. The other premise, E r hb ⊆ E hb , makes sure that the attempted write is in happens-after read accesses to z. If these two premises are satisfied, the write can proceed and prior accesses to z are garbage collected from the point of view of t. The filtering of redundant accesses is done by subtracting E hb ↓ z in where ↓ z projects the happens-before set down to operations on variable z. Finally, the write rule also garbage collects the in-memory record E r hb by setting it to / 0, 12 meaning that no read event have accumulated after the write yet. 12 As per discussion in Section 4.1, a term representing a memory location m(|E r hb , z:=v| ) records in E r hb all the reads to z that have accumulated after the write that generated the write label m. When a new write m of value z := v ensues, we update the memory term to record this new write and we reset its corresponding E r hb to / 0.

Garbage collection on reads
We also garbage collect on load operations. Say t reads from z, thus generating event (m , ?z). Let us call redundant the memory accesses to z in t's happens-before set at the time event (m , ?z) takes place, with the exception of (m, !z). A read operation can only conflict with a future write; there are not read-read conflicts. For a future write to take place, the writing thread will need to synchronize with a thread that "knows" about the read m . 13 Any thread that knows of m would also know about the redundant access to z and know of (m, !z). In other words, m and m subsume all happened-before accesses of z from the perspective of t. Therefore, we can garbage collect all such accesses by filtering them out of the thread's happen-before set, as in These redundant accesses are also filtered out of the in-memory happens-before set:

Off-line garbage collection and channel communication
A thread that mostly communicates over channels but does few memory accesses can, by virtue of its communication, accumulate events in its happens-before set. These events may become redundant from the point of view of flagging racy executions, and therefore, can be garbage collected. The garbage collector rule R-GC of Figure 11 can be run nondeterministically during the execution of a program. The rule could also be run before a thread sends onto or receives from a channel. Sending and receiving on a channel c causes the thread to deposit its happens-before set onto c's forward and backward channels respectively. By running the GC rule before the send and receive operations, we ensure that the happens-before sets deposited onto the forward and backward channels do not contain redundant information. Fig. 11: Off-line garbage collection 13 "Knowing about the read m " is a necessary condition for a thread to successfully write to z, but it is not a sufficient one. There may exist other reads, say m , m , etc that are concurrent with m . A thread needs to synchronize with all such concurrent reads before it can successfully write to z.

Connections with trace theory
It is tempting to think of happens-before in terms of observations, where a and b are in happens-before if and only if we observe a followed by b, and never the other way around. This intuition is captured by the following tentative definition: Let idx (a, h) be the index of event a in a run h. Given the set of runs H starting from an initial configuration, we say that event a happens-before b if-and-only-if, for all runs When it comes to weak memory systems, there exist events that are ordered according go the above tentative definition but that are not in happens-before relation. Take the improperly synchronized message-passing example of Figure 12 as an example. In this example, a thread p 0 writes to a shared variable z and sets a flag; another thread, p 1 , checks the flag reads from z if the flag has been set. If A and B are the first and second instructions in thread p 0 , and C and D are the loads of the flag and of the shared variable z in p 1 , then program order gives rise to A → hb B and C → hb D. We also have that the load of z in D only occurs if the value of the flag observed by thread p 1 is true, which means it was previously set by thread p 0 in B. Therefore, in all runs in which D is observed, B necessarily occurs earlier in the execution. This necessity does not, however, place B and D in happens-before relation. Under many flavors of weak memory, the memory accesses between the two threads are not synchronized. As the example shows, our tentative definition of happensbefore as always-occurring-before or necessarily-occurring-before does not work for weak memory systems. How about for sequential consistent ones?
In the program of Figure 13, thread p 0 sends values 0 and 1 into channel c consecutively. Concurrently, thread p 1 writes 42 to a shared variable z and receives from the channel, while thread p 2 first receives from the channel and conditionally reads from z. From this program, we construct an example in which events are necessarily ordered but are not in happens-before-even if we assume sequential consistency. To illustrate this point, let us consider an execution of the program. Let (o) p be a trace event capturing the execution of operation o by threads p. Let also z! and z? represent a write and read operation on the shared variable z, and sd c and rv c represent send and receive operations on channel c. Assuming channel capacity | c | ≥ 2, the sequence below is a possible trace obtained from the execution of the program. Note that the if-statement's  reduction is interpreted as an internal or silent transition: Given that p 1 receives from c before p 2 does, the value received by p 2 must be 1 as opposed to 0. Therefore, p 2 takes the branch and reads from the shared variable z. Figure 14 shows the partial order on events for this execution. Program order is captured Fig. 14: Partial order on conditional-race example.
by the vertical arrows in the diagram; channel communication is captured by the solid diagonal arrows. As per discussion in Section 3.2.3, we make the distinction between a channel operation and its completion. A channel operations is depicted as two halfcircles; the operation's completion is captured by the bottom half-circle. That way, a send (top of the half-circle) happens-before its corresponding receive completes (bottom half). Now, given that the send operations are in happens-before, meaning (sd c 0) p 0 → hb (sd c 1) p 0 , and that channels are First-In-First-Out (FIFO), the reception of value 0 from c must occur before the reception of 1. This requirement is captured by the dotted arrow in the diagram. However, according to the semantics of channel communication (i.e. rules (1) and (2) of page 4), this order does not impose a happens-before relation between the receiving events. In other words, there exist events that are necessarily ordered, but not in happens-before relation to one another.
Our operational semantics mimics the Go memory model in defining synchronization in terms of channel communication. Specifically, we abide by rules (1) and (2), which establish a happens-before relation between a send and the completion of its corresponding receive, and, due to the boundedness of channels, between a receive and the completion of a future send. However, these are not the only imposition by the semantics on the order of events. Channels act as FIFO queues in both Go [4] as well as in our operational semantics. However, neither Go nor our operational semantics establish a happens-before relation between consecutive sends or consecutive receives. For example, the i th send on a channel c does not happens-before the (i + 1) th send on c. Therefore, there exist events that are necessarily ordered, but that are not in happensbefore relation. The failure of our tentative definition of happens-before as necessarilyoccurring-before, given early in this section, has subtle implications as discussed next.

Happens-before, traces, and commutativity of operations
Traces come from observing the execution of a program and are expressed as strings of events. In a concurrent system, however, events may not be causally related, which means that the order of some events is not pre-imposed. In reality, instead of sequences, events in a concurrent system form a partially ordered set (see Figure 14 for an example). As advocated by Mazurkiewicz [22], it is useful to combine sequential observations with a dependency relation for studying "the nonsequential behaviour of systems via their sequential observations." By defining an independence relation on events, it is possible to derive a notion of equivalence on traces: two traces are equivalent if it is possible to transform one into the other "by repeatedly commuting adjacent pairs of independent operations" [16].
One way to define independence is as follows: Given a run R i Clearly, if a happens-before b, then a and b cannot be swapped in a trace. So, independence between two events means (at least) the absence of happens-before relation between them. But happens-before is not all that needs to be considered in the definition of independence.
When translating a partial order of events to a trace, not every linearization that respects the happens-before relation is a valid trace. Some linearizations of the partial order may not be "realizable" by the operational semantics. In other words, there can be traces that abide by the happens-before relation but that cannot be generated from the execution of a program. For example, we can obtain the following linearization given the partial order of Figure 14: This linearization respects the partial order based on the happens-before relation: program order is respected, so is the relation between sends and their corresponding receives. However, this linearization breaks the first-in-first-out assumption on channels.
FIFO is broken because, in order for p 2 to read from z, it must be that it received the value of 1 from the channel. But p 2 is the first thread to receive from the channel and, since 0 was the first value into the channel, it must also have been the first value read from the channel. Therefore, the linearization in Trace 6 is not "realizable" by the operational semantics. While happens-before restricts the commutation of trace operations, there exist other operations that are ordered (though not ordered by happens-before) and that, consequently, must not commute. The difficulty in conciliating the commutativity of trace events with the happensbefore relation remains counterintuitive today, even though its origins are related to an observation made years ago in a seminal paper by Lamport [17]. In the paper, Lamport points out that "anomalies" can arise when there exist orderings that are external to the definition of happens-before-see the "Anomalous Behavior" section of [17]. In order to avoid these anomalies, one suggestion from the paper is to expand the notion of happens-before so that, if a and b are necessarily ordered, then a and b are also in happens-before.
Let us analyze the consequences of rolling FIFO notions into the definition of happens-before. Given the example of Figure 13, since the sends are ordered in a happens-before relation, and the channel is FIFO, one can argue that the receive events should also be ordered by happens-before. According to this argument, we ought to promote the dotted line in Figure 14 to a solid → hb arrow. This modification would make the example well-synchronized. In one hand, given that the write to z by p 1 and the read from z by p 2 are always separated by events (by the two receive events in specific), interpreting the two memory accesses as being synchronized seems rather fitting: the two memory accesses cannot happen simultaneously, nor can they exist side-by-side in a trace.
There are downsides to this approach. For one, the resulting semantics deviates from Go's, but, more importantly, such a change does impact synchronization in counter intuitive ways. Specifically, making the dotted arrow a happens-before arrow would imply that a receiver (in this case p 2 ) can learn about prior events that are not known by the corresponding sender. If the dotted arrow is promoted to a synchronization arrow, the write (z!) p 1 is communicated to p 2 via p 0 without p 0 itself being "aware" of the write. In other words, the write identifier is transmitted via p 0 but is not present in p 0 's happens-before set.
We follow Go and allow for some events to always occur in order without affecting synchronization. Consequently, such ordered events are not considered to be in happens-before order. A less clear consequence, however, is that races can longer be defined as simultaneous (or side-by-side) accesses to a shared variable. This point is explored next.

Manifest data races
Section 2 mentioned the concept of manifest data race; below we give a concrete definition.
Definition 1 (Manifest data race). A well-formed configuration R contains a manifest data race if either hold: for some p 1 = p 2 .
Manifest data races can also be defined on traces.
Definition 2 (Manifest data race). A well-formed trace h contains a manifest data race if either are a sub-sequence of h and where p 1 = p 2 ).
While manifest races are obvious, races in general may involve accesses that are arbitrarily "far apart" in a linear execution. By bring conflicting accesses side-by-side, we could show irrefutable evidence of a race that, otherwise, may be obscured in a trace. Let h h represent the fact that h is derivable from h by the repeated commutation of adjacent pairs of independent operations. If h h and h contains a manifest data race, then we say h contains a data-race. This definition of races seems unequivocal. From here, soundness and completeness of a race detector may be defined as such:  Theorems 3 and 4 are also clear and unequivocal. More importantly, they link two world views: the view of races as unsynchronized accesses with respect to the happensbefore relation and a view of races in terms of commutativity of trace eventsà la Mazurkiewicz. The problem with the concept of manifest data race and Theorems 3 and 4, however, is that when the definition of independence is made to respect FIFO order as well as the happens-before relation, the notion of manifest data race is no longer attainable. In other words, given a definition of independence which respects FIFO and happens-before, there exist racy traces from which a manifest data race is not derivable.
The program of Figure 13 gives rise to such an example. The access to z by p 2 only occurs if p 2 receives the second message sent on the channel. In other words, the existence of event (z?) p 2 in a trace is predicated on the order of execution of channel operations: p 2 only reads from z if the other thread, p 1 , receives from c before p 2 does. 14 This requirement places the receive operations between the memory operations. Therefore, a trace in which (z!) p 1 and (z?) p 2 are side-by-side is not attainable. Yet, as discussed previously, the accesses to z are not ordered by happens-before, and, therefore, are concurrent. Since the accesses are also conflicting, they constitute a data race.
It seems that Mazurkiewicz traces are "more compatible" with confluence checking than data-race checking. In data-race checking, there are non-confluent runs that do not exhibit data races; these runs are non-confluent because they have "races on channels." In our example, the two receives from p 1 and p 2 are in competition for access to the channel. These receive operations are concurrent and non-confluent. Finally, the example also hints at the perhaps more fundamental observation: that races have little to do with simultaneous accesses to a shared variable but instead with unsynchronized accesses. While simultaneous accesses are clearly unsynchronized, not all unsynchronized accesses may be made simultaneous. 15 6 Comparison with vector-clock based race detection Vector clocks (VCs) are a mechanism for capturing the happen-before relation over events emanating from a program's execution [21]. A vector clock V is a function Tid→Nat which records a clock, represented by a natural number, for each thread in the system. "VCs are partially-ordered ( ) in a pointwise manner, with an associated join operation ( ) and minimal element (⊥ V ). In addition, the helper function inc t increments the t-component of a VC." [10] Using vector clocks, Pozniansky and Schuster [28] proposed a data-race detection algorithm that became a well-known and is sometimes referred to as DJIT + . Their algorithm works as follows. Each thread is associated with a clock. Also, each thread t keeps track of the last operation "known" to t as having been performed by another thread u. More precisely, C t is a vector clock for which the operation associated with C t (u) happened-before t's current operation. The algorithm also keeps track of memory operations. Each memory location x has two vector clocks, one associated with reads, R x , and another with writes, W x . The clock of he last read from variable x by thread t is recorded in R x (t); similar for W x (t) and writes to x by t. A race is flagged when a thread t attempts to read from x while being "unaware" of some recent write to x.
14 In this example, we use the value of the message received on a channel to branch upon. But since a receive from a channel changes a thread's "visibility" of what is in memory, it is possible to craft a similar example in which all message values are unit but in which a thread's behavior changes due to a change in the ordering of the receives. 15 There may not exist a configuration from which two transitions are possible; transitions that involve conflicting memory accesses. Yet, it is possible for two access separated "in time" to be unsynchronized.
Precisely, a race is flagged when t attempts to read from x and there exists a write to x by thread u, W x (u), that is not accounted for by t, meaning W x (u) ≥ C t (u), or, more concisely, W x C t . If t succeeds in reading from x, then R x (t) is updated to the value of C t (t). Similarly, a race is also flagged when t attempts to write to x while being unaware of some recent read or write to x, meaning R x C t or W x C t . If t succeeds in writing to x, then W x (t) is updated to C t (t).
A thread's clock is advanced when the thread executes synchronization operations, which have bearing on the happens-before relation. The algorithm was proposed in the setting of locks; each lock m is associated with a vector clock L m . When a thread t releases a lock m, its associated vector clock L m is updated to C t and thread's clock is advanced, meaning C t := inc t (C t ). If a thread u acquires m, then C u is updated to C u L m . We can think of lock release as placing a message, namely the vector clock associated with the releasing thread, into a buffer of size one. Acquiring a lock is analogous to receiving from a channel with buffer size one: the receiving thread updates its vector clock by incorporating the vector clock previously "stored" in the lock. Thus, in comparison with the approach presented in our paper, lock operations are a special case of buffered channel communication. Our paper accounts for channels of arbitrary size and takes the implications of capacity limitations, as per rule (2) and the discussion on Section 2.
Another significant difference between our approach and DJIT + is that we dispense with the notion of vector clocks. Vector clocks are a conceptual vehicle to capturing partial order of events. Instead of relying on VCs, our formalization is tied directly to the concept of happens-before. For one, vector clocks are expensive. Common operations on VCs consume O(τ) time and, a VC's representation requires O(τ) storage where τ is the number of threads spawn during the execution of a program [10]. It turns out that not all uses of VCs in DJIT + are strictly necessary. For example, Flanagan and Freund [10] introduce the concept of epoch, which consists of a pair c@t where c is a clock and t a thread identifier. They then replace W x , the vector clock recording writes to x, with a single epoch. This epoch captures the clock and thread identity associated with the most recent write to x. Similarly, in our approach, a memory location is associated with the identifier of only the most-recent write to a given variable. Any thread who is "aware" of the identifier is allowed to write to the corresponding variable. In our case, however, we do not need to record which thread performed the most recent write.
Compared to DJIT + , FASTTRACK reduces the dependency on vector clocks by also replacing R x with the epoch of the most recent read to x. However, since reads are not totally ordered, FASTTRACK dynamically switches back to a vector clock representation when needed. The algorithm, however, resorts to a full VC even when only two read operations to the same variable are not related by happens-before; thus incurring O(τ)-memory where τ is the number of threads. Similar to FASTTRACK, we record the most recent (unordered) reads which, in the best case, involves an O(1)-memory footprint. Differently from FASTTRACK, however, the size of the happens-before set E r hb associated with reads to a variable grows gradually, which means we only require O(τ)-memory when all threads have independently performed reads to a given variable.
Most significantly yet, both DJIT + and FASTTRACK require one vector clock per thread, which means that, in both best and worst case, O(τ)-memory is required per thread. Our approach is more nuanced. In the best case, our memory consumption per thread is O(1). In the worst, O(ντ) where ν is the number of shared variables in a program. 16 The average case will depend on the access and synchronization pattern of threads and thus requires empirical observations under different workloads.

Related work
Race detection via the analysis of source code is an undecidable problem. Regardless, race detectors via the static analysis of source code [23,38,3] exist and have found application in industry. More recently, Blackshear et al. [3] implement a static analysis tool called RACERD to help the parallelization of previously sequential Java source code. The tool over approximates the behavior of programs and can, thereby, reject programs that turn out to be data-race free. This over approximation was not a hindrance, as even conservative parallelization efforts can lead to gains over purely sequential code.
By and large, however, instead of flagging races in a program as a whole, race detectors have resorted the analysis of particular runs of a program. To that end, detectors instrument the program so that races are either flagged during execution, in what is called on-line or on-the-fly race detection, or on logs captured during execution and analyzed postmortem. Even still, dynamic race detection is NP-hard [24] and many techniques have been proposed for detection at scale. Broadly, these techniques involve static analysis used to reduce the number of runtime checks [10] [30], and heuristics that trade false-positive [32,29,5] or false-negative rates [20] for better space/time utilization. For example, by allowing races to sometimes go undetected, sampling race detectors let go of completeness in favor of lower overheads. One common heuristic, called the cold region hypothesis, is to sample more frequently from less executed regions of the program. This rule-of-thumb hinges on the assumption that faults are more likely to already have been identified and fixed if they occur in the hot regions of a program [20]. Alternatively, by going after a proxy instead of an actual race, imprecise race detectors let go of soundness. The prominent examples here are Eraser's LockSet [32] and Locksmith [29], which enforce a lock-based synchronization discipline. A violation of the discipline is a code smell but not necessarily a race. The amalgamation of different approaches have also been investigated, leading to hybrid race detectors. For example, O'Callahan and Choi [25] combined LockSet-based detection with happensbefore information reconstructed from vector clocks; Choi et al. [5] extended LockSet to incorporate static analyses.
Another avenue of inquiry has lead to predictive race detection [35,15], which attempts to achieve higher detection capabilities by extrapolating beyond individual runs. Huang et al. [15] incorporate abstracted control flow information and formulate race detection as a constraint solving problem. With the goal of observing more races per run, Smaragdakis et al. [35] introduce a new relation, called causally-precedes, which is a generalization of the happens-before relation.
A number of papers address race detection in the context of channel communication [6,7,37]. Some of the papers, however, do not speak of shared memory but, instead, define races as conflicting channel accesses. In that setting, the lack of conflicting accesses to channels imply determinacy. A different angle is taken by Terauchi and Aiken [37], who, among different kinds of channels, define a buffered channel whose buffer is overwritten by every write (i.e. send) but never modified by a read (i.e. receive). This kind of channel, referred to as a cell, behaves, in essence, as shared memory. The goal of Terauchi and Aiken [37] is, still, determinacy. Having conflated the concept of shared memory as a channel, determinacy is then achieved by ensuring the absence of conflicting accesses to channels. Our goal, however, is different: we aim to detect dataraces but do not want to go as far as ensuring determinacy. Therefore, our approach allows "races" on channel accesses. From a different perspective, however, the work of Terauchi and Aiken [37] can be seen as complementary to ours: We conjecture that their type system can serve as the basis for a static data-race detector.
Among the dynamic data-race detection tools from industry, Banerjee et al. [1] discuss different race detection algorithms including one used by the Intel Thread Checker. The authors describe adjacent conflicts, which is similar to our notion of side-by-side or manifest data race. The paper also classifies races similar to our WaR, RaW, and WaW classification.
Go has a race detector integrated to its tool chain [12]. The -race commandline flag instructs the go compiler to instrument memory accesses and synchronization events. The race detector is built on top of Google's sanitizer project [13] and TSan in particular [33,14]. TSan is part of the LLVM's runtime libraries [34,19]. It works by instrumenting memory accesses as well as monitoring locks acquisition and release as well as thread forks and joins. Note, however, that channel communication is the vehicle for achieving synchronization in Go. Even though locks exist, they are part of a package, while channels are built into language. Yet, the race detector for Go sits at a layer underneath. In this paper we study race detection with channel communication taking a central role.
In the absence of the DRF-SC guarantee, the full C/C++11 memory model can harbor data races involving weak memory behavior. With the goal of finding data races in production level C/C++11 code, Lidbury and Donaldson [18] extend the ThreadSanitizer (TSan) tool [33,14] to support a class of non sequentially consistent executions.

Conclusion
We presented a dynamic race detector for a language in the style of Go: featuring channel communication as sole synchronization primitive. The proposed detector records and analyzes information locally and is well-suited for online detection.
Our race detector is built upon a previous result [9], where we formalize a weak memory model inspired by the Go specification [11]. In that setting, we recorded memory read-and write-events that were in happens-before relation with respect to a thread's present operation. This information was stored in a set called E hb or the happens-before set of a thread, and it was used to regulate a thread's visibility of memory events. The core of the paper was a proof of the DRF-SC guarantee, meaning, we proved that the proposed relaxed memory model behaves sequentially consistently in the absence of data races. The proof hinges on the fact that, in the absence of races, all threads agree on the contents of memory; see the consensus lemma in [9]. The scaffolding used in the proof of the consensus lemma contains the ingredients used of the race detectors presented in this paper. Based on our experience, we conjecture that one can automatically derive a race detector given a weak memory model and its corresponding proof of the DRF-SC guarantee.
In the DRF-SC the proof of [9], we show that if a program is racy, it behaves sequentially consistent up to the point in which the first data-race is encountered. In other words, this first point of divergence sets in motion all behavior that is not sequentially consistent and which arise from the weakness in the memory model. With this observation, we argue that a race detector can operate under the assumption of sequential consistency. This is a useful simplification, as sequential consistent memory is conceptually much simpler than relaxed memories. If the data-race detector flags the first evidence of a data-race, then program behavior is sequentially consistent up to that point.
Avenues for future work abound. For example, one notable extension would be to statically analyze a target program with the goal of removing dynamic checks. Here we may be able to borrow from the research on static analysis for dynamic race-detection in the context of lock-based synchronization disciplines.

A Strong semantics
For completeness sake and for reference, we include here the operational semantics without augmenting it with any information relevant for race checking. It is thereby a conventional operational semantics and corresponds to the strong semantics from [9]. e; t let r = e in t when r / ∈ fv(t)) stop ∑ 0 The surface syntax is unchanged from Figure 1. The operational semantics is formulated using run-time configurations as given in equation (7).
For race detection, we used the "same" run-time syntax, except that they were augmented with additional information (cf. equation for the intermediate formulation (3) of the race detecting semantics resp. equation (4). Compared to the race detecting semantics, the configurations carry less information. In particular, the recorded events don't carry identifying labels and threads don't keep track of happens-before information as for the race checker. 17

A.1 Structural congruence
Configurations are interpreted up-to structural congruence, only: Parallel composition is associative and commutative, with the empty configuration as neutral element. The νbinder is used to manage the scopes for dynamically created names. Besides that, syntax is considered tacitly up-to renaming of bound names, in particular, ν-bound names.
Dynamically created names are channel names. In the augmented semantics, where processes are named and also events carry a label, also names for those entities can be created on-the-fly and they are subject to the congruence rules for ν-bound names.

A.2 Local steps
The rules from Figure 16 concern reduction steps that don't affect the memory or involve channel communication.

A.3 Memory interactions and channel communication
Reading and writing, the two basic memory interactions, are covered in Figure 17 and channel communication in Figure 18. Compared to the semantics for race detection (cf. Figures 6 and 4), the semantics here is done without extra information and bookkeeping of happens-before information. Related to that, the recorded events don't carry any names to identify the event. For the channel communication in Figure 18, no more happens-before information is communicated. Especially the forward channel carries only the communicated value v (cf. rule R-SEND and R-REC). To realize the boundedness of the channels, the semantics still maintains the two parts of a channel: the forward channel for communication, and the backward channel for "flow-control." The backward channel c b does not carry any information, just the number of entries representing still empty slots in the forward channel. We use the unit value () for that, and initially, the backward channel is filled with a number of ()'s corresponding to the capacity of the channel (see rule R-MAKE).
For channel communication, the semantics distinguished between synchronous communication, i.e., a "rendezvous" over a channel of capacity 0, and asynchronous com-munication, with a channel of non-zero, but finite capacity. For the asynchronous case, both sending and receiving q = [(), . . . , → t c f [⊥ :: q] Rules dealing with the select statement semantics are given on Figure 19. The R-SEL-SEND and R-SEL-REC rules apply to asynchronous channels and are analogous to R-SEND and R-REC. The R-SEL-SYNC rules apply to open synchronous channels (i.e. the forward and backward queues are empty). The R-SEL-REC⊥ is analogous to R-REC⊥. Finally, the default rule (R-SEL-DEF) applies when no other select rule applies.