Generalized and Customizable Sets in R

This introduction to the R package sets is a (slightly) modiﬁed version of Meyer and Hornik (2009a), published in the Journal of Statistical Software . We present data structures and algorithms for sets and some generalizations thereof (fuzzy sets, multisets, and fuzzy multisets) available for R through the sets package. Fuzzy (multi-)sets are based on dynamically bound fuzzy logic families. Further extensions include user-deﬁnable iterators and matching functions.


Introduction
Only few will deny the importance of sets and set theory, building the fundamentals of modern mathematics.For theory-building typically axiomatic approaches (e.g., Zermelo 1908;Fraenkel 1922) are used.However, even the primal, "naive" concept of sets representing "collections of distinct objects" (Cantor 1895) discarding order and count information seems both natural and practical.The main operation being "is-element-of", sets alone are of limited practical use-they most of the times serve as basic building blocks for more complex structures such as relations and generalized sets.A common way is to consider pairs (X, m) with set X ("universe") and membership function m : X → D mapping each member to its "grade".The subset of X of elements with non-zero membership is called "support".In multisets, elements may appear more than once, i.e., D = N 0 (m is also called the multiplicity function).There are many applications in computer science and other disciplines (for a survey, see, e.g., Singh, Ibrahim, Yohanna, and Singh 2007).In statistics, multisets appear as frequency tables.Fuzzy sets have become quite popular since their introduction by Zadeh (1965).Here, the membership function maps into the unit interval.An interesting characteristic of fuzzy sets is that the actual behavior of set operations depends on the underlying fuzzy logic employed, which can be chosen according to domain-specific needs.Fuzzy sets are actively used in fields such as machine learning, engineering, medical science, and artificial intelligence (Dubois, Prade, and Yager 1996).Fuzzy multisets (Yager 1986) combine both approaches by allowing each element to map to more than one fuzzy membership grade, i.e., D is the power set of multisets over the unit interval.Examples for the application of fuzzy multisets can be found in the field of information retrieval (e.g., Matthé, Caluwe, de Tré, Hallez, Verstraete, Leman, Cornelis, Moelants, and Gansemans 2006).
The use of sets and variants thereof is common in modern general purpose programming languages: Java and C++ provide corresponding abstract data types (ADTs) in their class libraries, Pascal and Python offer sets as native data type.Indeed, since set elements are order-invariant and unique, lookup mechanisms can be implemented very efficiently (for example via hashing, resulting in nearly constant run-time complexity, compared to linear search, requiring n/2 steps on the average for n elements).Surprisingly enough, sets are not standard in many mathematical programming environments such as MATLAB and Mathematica, and also R.Although the two latter offer set operations such as union and intersection, these are applied to linearly indexable structures (lists and vectors, respectively), interpreting them as sets.When it comes to R, this emulation is far from complete, and occasionally leads to inconsistent behavior.First of all, the existing infrastructure has no clear concept of how to compare elements, leading to possibly confusing results when different data types are involved in computations: R> s <-list(1, "1") R> union(s, s) The reason is that most of the existing operations rely on match() which automatically performs type conversions disturbing in this context.Also, quite a few other basic operations such as the Cartesian product, the power set, the subset predicate, etc., are missing, let alone more specialized operations such as the closure under union or intersection.Then, the current facilities do not make use of a class system, making extensions hard (if not impossible).Another consequence is that no distinction can be made between sequences (ordered collections of objects) and sets (unordered collections of objects), which is key for the definition of complex structures where both concepts are combined such as relations.Also, there is no support in base R for extensions such as fuzzy sets or multisets.
A few extension packages available from the Comprehensive R Archive Network deal with fuzzy concepts: Package fuzzyFDR (Lewin 2007) calculates fuzzy decision rules for multiple testing, but does not provide any explicit data structures for fuzzy sets.The main functions in fso (Roberts 2007) for fuzzy set ordination compute and return, among other information, membership values represented by numeric matrices for some variables of the the input data.fuzzyRankTests (Geyer 2007) provides statistical tests based on fuzzy p values and fuzzy confidence intervals, the latter being returned as two separate numeric vectors for values and memberships.The gcl package (Vinterbo 2007) infers fuzzy rules from the input data, encapsulated in a classifying function returned by the training function.The rules are composed of triangular fuzzy sets, represented by triples describing the triangles' corner points for which the memberships become (0, 1, 0), respectively.Similarly, the FKBL package for fuzzy knowledge base learning (Alvarez 2007) uses sequences of triangular fuzzy sets, defined by a vector of corner points.Finally, fuzzyOP (Aklan, Altindas, Macit, Umar, and Unal 2008) provides support for fuzzy numbers: A set of n numbers is represented by a k × 2n numeric matrix, where two consecutive columns represent (at most k) supporting points and memberships, respectively, of the corresponding piecewise linear membership function.If some numbers have fewer supporting points than others, the remaining cells are filled with missing values (NAs).
The sets package (Meyer and Hornik 2009b) presented here provides a flexible and customizable basic infrastructure for finite sets and the generalizations mentioned above, including basic operations for fuzzy logic.Apart from complementing the data structures implemented in base R, extension packages like the ones mentioned above could gain in flexibility from building on a common infrastructure, facilitating data exchange and leveraging synergies.
The remainder of the paper is structured as follows.In Section 2, we discuss the design rationale of data structures and core algorithms.Section 3 introduces the most important set operations.Section 4 starts with constructors and specific methods for generalized sets, followed by a more focused presentation of the fuzzy logic infrastructure, and of functionality for handling and visualizing membership information.Section 5 shows how generalized sets can further be customized by specifying user-definable matching functions and iterators.Section 6 presents three examples before Section 7 concludes.

Design issues
There are many ways of implementing sets.Choice and efficiency largely depend on the domain range (i.e., the number of possible values for each element).If the domain is relatively small, i.e. in the range of integral data types such as byte, integer, word etc., the probably most efficient representation is an array of bits representing the domain elements like in Pascal (Wirth 1983).Operations such as union and intersection can then straightforwardly be implemented using logical OR and AND, respectively.This approach obviously fails for intractably large domains (e.g., strings or recursive objects).Without further application knowledge, one needs to resort to generic container ADTs with efficient element access such as hash tables or search trees (for unique elements).Operations can then be implemented following the classical element-based definitions: Union by inserting all elements of the smaller set into the larger one; intersection by creating a new set with all elements of the smaller set also contained in the larger one; etc.
Clearly, set comparison must be permutation invariant.Some care is needed for nested sets.Assume, e.g., the comparison of A = {1, {2, 3}} and B = {1, {3, 2}} which clearly are identical.To implement set equality, a matching operator would be used to check if all elements of A are contained in B. If elements were internally stored in this order during creation, the objects representing {2, 3} and {3, 2} would be different.Comparing two set elements for equality would thus require to recursively compare all elements down the nested structures, which can quickly become infeasible computationally.We avoid this by using a canonical ordering during set creation, guaranteeing that identical sets have identical physical representation as well.We chose to sort elements using the natural order for numeric values, the Unicode character representation for strings, and the serialization byte sequence (as strings) for other objects.Eventually, the ordered elements are stored in a list.
For the sets package, further limitations are imposed by the extensions presented in Sections 4 and 5: Generalized sets require, for each element, the membership information, and we also support user-defined, high-level matching functions for comparing elements.Since operations defined for generalized sets basically operate on the memberships, it seems appropriate to store these as (generic) vectors in the same order than the corresponding elements.Thus, memberships of separate sets can simply be combined element-wise.
Many operations (e.g., testing for equality, subsetting, intersection, etc.) are based on matching elements of the sets involved.This is implemented by inserting the elements of the larger one into a hash table (we use hashed environments), and to look up the elements of the smaller set in this table (Knuth 1973, p. 391).As hash key, we use the elements' character representation.Since different objects might map to the same hash key, we actually store the indexes of the list elements, and match the actual objects using a simple linear search.(Note that since the element list is sorted, elements with same representation are grouped, so the search will typically be fast.) The implementation is based on R's S3 class system, allowing the definition of generic functions, dispatching appropriate methods depending on the first argument's class information.Objects for sets, generalized sets, and customizable sets have classes 'set', 'gset', and 'cset', respectively, with 'set' inheriting from 'gset' in turn inheriting from 'cset'.Suitable operators (such as & for intersection) are then "overloaded" to dispatch the right internal function corresponding to the operands' classes by defining corresponding methods for group generics.Additionally, all operations can directly be accessed using the corresponding name combined with a set_, gset_, or cset_ prefix to give the user the choice of up-or downcasts when objects of different class levels are involved in one computation.For example, consider the union of the set {1} and the fuzzy set {2/0.5}: using the generic operator will give an error since the operands' classes differ.The user needs, in fact, resolve the semantic ambiguity by explicitly choosing the intended operation: If the result should be a generalized (fuzzy) set, gset_union() should be used.To make the result a set (stripping membership information), s/he employs set_union() instead.

Sets
The basic constructor for creating sets is the set() function accepting any number of R objects as arguments.

R> set(1) <= set(1,2)
[1] TRUE Note that all predicate functions are vectorized for convenience: The sequence 1:4 as one element would be looked up by using list(1:4) on the left-hand side.The class-specific functions dispatched by the generic operators are set_contains_element(), set_is_equal(), etc.Other than these predicate functions, one can use length() for the cardinality:

R> s ^2L
{(1L, 1L), (1L, 2L), (1L, 3L), (2L, 1L), (2L, 2L), (2L, 3L), (3L, 1L), (3L, 2L), (3L, 3L)} and 2ˆfor the power set: The class-specific functions set_union(), set_intersection(), and set_symdiff() accept more than two arguments.1It is also possible to compute the relative complement of a set X in Y , basically removing the elements of X from Y : Note, however, that for sets (as opposed to generalized sets), the concept of a "universe" is not necessarily required, and therefore the absolute complement of a set not a well-defined operation.In the sets package, objects of class 'set' are special cases of generalized sets.To stay faithful to the simplicity of the original set concept, we define a 'set' object's universe to be identical to the set itself.The absolute complement of a 'set' object is therefore always the empty set: {} set_combn() returns the set of all subsets of specified length: closure() and reduction() compute the closure and reduction under union or intersection for a set family (i.e., a set of sets): The Summary() group methods will also work if defined for the elements:

R> range(s)
[1] 1 3 Because set elements are unordered, it is not allowed to use positional subscripting.However, sets can be subset and elements be replaced by using the elements as index themselves: {"bar", "foo", 1, 2}

R> sapply(s, sqrt)
[1] 1.000000 1.414214 1.732051 R> for (i in s) print(i) Note that for() only works because the underlying C code ignores the class information, and directly processes the low-level list representation instead.This will be replaced by a more intelligent "foreach" mechanism as soon as it exists in base R. sapply() and lapply() call the generic as.list() function before iterating over the elements.Since a corresponding method exists for sets objects, this is "safer" than using for().
Using set_outer(), it is possible to apply a function on all factorial combinations of the elements of two sets.If only one set is specified, the function is applied on all pairs of this set.

Generalized sets
There are several extensions of sets such as fuzzy sets and multisets.Both can be be seen as special cases of fuzzy multisets.All have in common that they are defined on some universe of elements, and that each element maps to some membership information.We present how generalized sets are constructed, and demonstrate the effect of choosing different fuzzy logic families.

Constructors and specific methods
Generalized sets are created using the gset() function.The required arguments depend on whether membership information is specified extensionally (listing members) or intensionally (giving a rule for membership): 1. Extensional specification: a) Specify support and memberships as separate vectors.If memberships are omitted, they are assumed to be 1.b) Specify a set of elements along with their individual membership grades, using the element function (e()).
2. Intensional specification: Specify universe and membership function.
Note that for efficiency reasons, gset() will not store elements with zero memberships grades, and the specification of a universe is only required with membership functions.For convenience (and storage efficiency), a default universe can be defined using sets_options().Setspecific universes supersede the default universe, if any.If no universe (general or specific) is defined, the support of a set will be interpreted as its universe.For multisets, the definition of a (general or set-specific) universe can be complemented by a maximum multiplicity or bound.
Without membership information, gset() creates a set (the support is converted to a set internally): Note, however, that unlike for 'set' objects, it is possible to define a universe that differs from (i.e., is a proper superset of) the support:

R> gset(support = X, universe = LETTERS[1:10])
{"A", "B", "C"} A multiset requires an integer membership vector: For fuzzy sets, the memberships need to be out of the unit interval: Alternatively to separate support/membership specification, each elements can be paired with its membership value using e(): R> gset(elements = list(e("A", 0.1), e("B", 0.2), e("C", 0.3))) Fuzzy sets can, additionally, be created using a membership function, applied to a specified (or the default) universe: For fuzzy multisets, the membership argument expects a list of membership grades, either specified as vectors, or as multisets: gset_cardinality() returns the (relative) cardinality of a generalized set, computed as the sum (mean) of all memberships.gset_support(), gset_memberships(), gset_height() and gset_core() can be used to retrieve support, memberships, height (maximum membership degree), and the core (elements with membership 1), respectively, of a generalized set.gset_charfun() returns a (point-wise defined) characteristic function for a given gset.Note that in general, this will be different from the characteristic function possibly used for the creation.
As for sets, the usual operations such as union and intersection are available: R> X <-gset(c("A", "B", "C"), 4:6) R> Y <-gset(c("B", "C", "D"), 1: Additionally, the product (gset_product()), sum (+), and difference (-) of sets are defined, which multiply, add, and subtract multiplicities (or memberships for fuzzy sets): For fuzzy (multi-)sets, not only the relative, but also the absolute complement (!) is defined: gset_mean() creates a new set by averaging corresponding memberships using the arithmetic, geometric or harmonic mean.Note that missing elements have 0 membership degree: The membership vector of a generalized set can be transformed via gset_transform_memberships(), applying any vectorized function to the memberships: R> x <-gset(1:10, 1:10/10) R> gset_transform_memberships(x, pmax, 0.5) Note that for multisets, an element's membership (multiplicity) m is interpreted as a onevector of length m, yielding possibly unexpected results: For multisets, the rep() function is a more natural choice for membership transformations: R> rep(x, 0.5)

{1}
In addition, three convenience functions are defined for fuzzy (multi-)sets: gset_concentrate() and gset_dilate() apply the square and the square root function, and gset_normalize() normalizes the memberships to a specified maximum:

{"B", "C"}
The method can also be used for ν-cuts, selecting elements according to their multiplicity.

Characteristic functions and their visualization
The sets package provides several generators of characteristic functions to be used as templates for the creation of fuzzy sets, including the following shapes: gaussian curve (fuzzy_normal()), double gaussian curve (fuzzy_two_normals()), bell curve (fuzzy_bell()), sigmoid curve (fuzzy_sigmoid()), Π-like curves (fuzzy_pi3(), fuzzy_pi4()), trapezoid (fuzzy_trapezoid()), and triangle (fuzzy_triangular(), fuzzy_cone()).For example, a fuzzy normal function and a corresponding fuzzy set are created using: R> N <-fuzzy_normal(mean = 0, sd = 1) R> N (-3:3) [1] 0.0111090 0.1353353 0.6065307 1.0000000 0.6065307 0.1353353 0.0111090 For convenience, we also provide wrappers that directly generate corresponding sets, given a specified universe: (If no universe is specified, the default universe is used; if this is also missing, the universe currently defaults to seq(0, 20, 0.1).)It is also possible to create function generators for characteristic functions from other functions (such as distribution functions): 10 (<<gset(201)>> denotes an object of class 'gset' with 201 elements-the size of the default universe).The sets package provides support for visualizing the membership functions of generalized sets, and in particular fuzzy sets.For (fuzzy) multisets, the plot method produces a (grouped) barplot for the membership vector (see Figure 1, top left):

R> plot(fuzzy_bell)
There is a plot method for tuples for visualizing a sequence of sets (see Figure 1, bottom left):

User-definable extensions
We added customizable sets extending generalized sets in two ways: First, users can control the way elements are matched, i.e., define equivalence classes of elements.Second, arbitrary iteration orders can be specified.

Matching functions
By default, sets and generalized sets use identical() to match elements which is maximally restrictive.Note that this differs from the behavior of R's equality operator or match() which perform implicit type conversions and thus confound, e.g., 1, 1L and "1".In the following example, note that on most computer systems, 3.3 − 2.2 will not be identical to 1.1 due to numerical issues.
Customizable sets can be created using the cset() constructor, specifying the generalized set and some matching function.

R> X <-cset(x, matchfun = match) R> print(X)
{"1", 1.1} Matching functions take two vector arguments, say, x and table, with table being a vector where the elements of x are looked up.The function should be vectorized in the x, i.e. return the first matching position for each element of x.In order to make use of non-vectorized predicates such as all.equal(), the sets package provides matchfun() to generate one: sets_options() can be used to conveniently switch the default match and/or order function if a number of 'cset' objects need to be created:

Iterators
In addition to specifying matching functions, it is possible to change the order in which iterators such as as.list() (but not for()-see end of Section 3) process the elements.Note that the behavior of as.list() influences the labeling and print methods for customizable sets.Sets and generalized sets use the canonical internal ordering for iterations.With customizable sets, a "natural" ordering of elements can be kept by specifying either a permutation vector or an order function: R> cset(letters[1:5], orderfun = 5:1) {"e", "d", "c", "b", "a"} R> FUN <-function(x) order(as.character(x),decreasing = TRUE) R> Z <-cset(letters[1:5], orderfun = FUN) R> print(Z) {"e", "d", "c", "b", "a"}

Examples
In the following, we present two examples for the use of multisets and fuzzy multisets.

Multisets
Multisets are frequent in statistics since they can be seen as frequency tables of some objects.
Using the sets package, a "generalized" table can easily be constructed from a list of R objects using the as.gset() coercion function.Assume, e.g., that one samples a number of fourfold tables given the margins using r2dtable(): R> set.seed(4711)R> l <-r2dtable(1000, r = 1:2, c = 2:1) Since the sum of the first row (and second column) are constrained to 1, the top left cell entry can only be 0 or 1.Also, given the marginals, there is only one degree of freedom in fourfold tables, so the value of this first cell determines the others, and thus only two possible tables exist: To count them, we can simply use as.gset() that will construct a multiset from the list: R> s <-as.gset(l)R> print(s) {<<2x2 matrix>> [330], <<2x2 matrix>> [670]} Replace the matrices by the first cells' values: The estimated probabilities of having 0 or 1 in the first cell can thus be obtained by: R> gset_memberships(s) / 1000 1 2 0.33 0.67 The probability for 0 clearly corresponds to the p value of the corresponding Fisher test:

Fuzzy multisets
Fuzzy multisets can be used to represent objects appearing several times with different membership grades (e.g., weights, degrees of credibility, . . .).Mizutani, Inokuchi, and Miyamoto (2008) describe an interesting application of fuzzy multisets to text mining: The occurrences of some terms of interest ("neural network", "fuzzy", "image") in titles, abstracts, and keywords of 30 documents on fuzzy theory are represented by fuzzy multisets, with varying memberships depending on whether a term occurs in the title (degree 1), the keywords (degree 0.6), and/or the abstract (degree 0.2).

R> data("fuzzy_docs")
R> print (fuzzy_docs[8:9]) This information is then used to compute distances between documents, and ultimately to compare several (non-linear) clustering methods regarding their abilities of recovering the true underlying structure.In fact, it is known that the first 12 documents are related to neural networks, and the remaining 18 to image processing.
In the following, we will perform simple hierarchical clustering.We start by computing a distance matrix for the 30 documents.The sets package implements the Jaccard dissimilarity, defined for two generalized sets X and Y as 1 − |X ∩ Y |/|X ∪ Y | where | • | denotes the cardinality for generalized sets.A corresponding dissimilarity matrix can be obtained using, e.g., the proxy package (Meyer and Buchta 2009):

Conclusion
In this paper, we described the sets package for R, providing infrastructure for sets and generalizations thereof such as fuzzy sets, multisets and fuzzy multisets.The fuzzy variants make use of a dynamic fuzzy logic infrastructure offering several fuzzy logic families.Generalized sets are further extended to allow for user-defined iterators and matching functions.Current work focuses on data structures and algorithms for relations, an important application of sets.

A. Available fuzzy logic families
Let us refer to N (x) = 1 − x as the standard negation, and, for a t-norm T , let S(x, y) = 1 − T (1 − x, 1 − y) be the dual (or complementary) t-conorm.Available specifications and corresponding families are as follows, with the standard negation used unless stated otherwise.
"Zadeh" Zadeh's logic with T = min and S = max.Note that the minimum t-norm, also known as the Gödel t-norm, is the pointwise largest t-norm, and that the maximum t-conorm is the smallest t-conorm.
"drastic" The drastic logic with t-norm T (x, y) = y if x = 1, x if y = 1, and 0 otherwise, and complementary t-conorm S(x, y) = y if x = 0, x if y = 0, and 1 otherwise.Note that the drastic t-norm and t-conorm are the smallest t-norm and largest t-conorm, respectively.
"product" The family with the product t-norm T (x, y) = xy and dual t-conorm S(x, y) = x + y − xy.
"Fodor" The family with Fodor's nilpotent minimum t-norm given by T (x, y) = min(x, y) if x + y > 1, and 0 otherwise, and the dual t-conorm given by S(x, y) = max(x, y) if x + y < 1, and 1 otherwise.
"Frank" The family of Frank t-norms T p , p ≥ 0, which gives the Zadeh, product and Łukasiewicz t-norms for p = 0, 1, and ∞, respectively, and otherwise is given by T (x, y) = log p (1 + (p x − 1)(p y − 1)/(p − 1)).The following parametric families are obtained by combining the corresponding families of t-norms with the standard negation and complementary t-conorm.
"Dubois-Prade" The family of t-norms T p , 0 ≤ p ≤ 1, introduced by Dubois and Prade, which gives the minimum and product t-norms for p = 0 and 1, respectively, and otherwise is given by T p (x, y) = xy/ max(x, y, p).

Figure 1 :
Figure 1: Membership plots for fuzzy sets.Top left: grouped barplot for a fuzzy multiset.Top right: graph of a bell curve.Bottom left: sequence of triangular functions.Bottom right: two combinations of a normal and a trapezoid function (dotted lines: basic shapes; solid (red) line: union; dashed (green) line: arithmetic mean).

Figure 3 :
Figure3: Dendrogram for the fuzzy_docs data, using Ward's method on a term-documentmatrix generated for the data.