Skip to main content
Log in

Cleaning uncertain graphs via noisy crowdsourcing

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Uncertain graph is an important data model for many real-world applications. To answer the query on the uncertain graphs, the edges in these graphs are associated with existential probabilities that represent the likelihood of the existence of the edge. Almost all works on this area focus on how to promote the efficiency of the query processing. However, another issue should be notable, that is, the query results from the uncertain graphs are sometimes uninformative due to the edge uncertainty. We adopt a crowdsourcing-based approach to make the query results more informative. To save the monetary and time cost of crowdsourcing, we should select the optimal edges to clean to maximize the quality improvement. However, the noise of the crowdsourcing results will make the problem more complex. We prove that the problem is #P-hard and propose an efficient algorithm to derive the optimal edge. Our experimental results show that our proposed algorithm outperforms random-selection up to 22 times in quality improvement and each-edge-comparison way up to 5 times fast in elapsed time, which proves this algorithm is both effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17

Similar content being viewed by others

Notes

  1. https://www.mturk.com/

  2. https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

  3. https://www.nature.com/articles/nature04670

  4. http://mips.helmholtz-muenchen.de/genre/proj/mpact/

References

  1. Aggarwal, C.C.: Managing and mining uncertain data. Springer, US (2009)

    Book  MATH  Google Scholar 

  2. Ball, M.O.: Computational complexity of network reliability analysis: an overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)

    Article  MATH  Google Scholar 

  3. Brabham, D.C.: Crowdsourcing as a model for problem solving: an introduction andcases. Convergence the International Journal of Research Into New Media Technologies 14(1), 75–90 (2008)

    Article  Google Scholar 

  4. Chen, M., Gu, Y., Bao, Y., Yu, G.: Label and distance-constraint reachability queries in uncertain graphs. In: Database Systems for Advanced Applications, pp 188–202. Springer International Publishing, Cham (2014)

  5. Cheng, J., Huang, S., Wu, H., Fu, W.C.: Tf-label:a topological-folding labeling scheme for reachability querying in a large graph. In: ACM SIGMOD International Conference on Management of Data, pp. 193–204 (2013)

  6. Cheng, R.: Querying and cleaning uncertain data. Springer, Berlin (2009)

    Book  Google Scholar 

  7. Cheng, R., Chen, J., Xie, X.: Cleaning uncertain data with quality guarantees. Proceedings of the Vldb Endowment 1(1), 722–735 (2008)

    Article  Google Scholar 

  8. Doan, A.H., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide Web. Commun. ACM 54(4), 86–96 (2011)

    Article  Google Scholar 

  9. Fishman, G.S.: A comparison of four monte carlo methods for estimating the probability of s-t connectedness. IEEE Trans. Reliab. 35(2), 145–155 (1986)

    Article  MATH  Google Scholar 

  10. Jin, R., Hong, H., Wang, H., Ning, R., Xiang, Y.: Computing label-constraint reachability in graph databases. In: ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, Usa, June, pp. 123?-134 (2010)

  11. Jin, R., Liu, L., Ding, B., Wang, H.: Distance-constraint reachability computation in uncertain graphs. Very Large Data Bases 4(9), 551–562 (2011)

    Google Scholar 

  12. Jin, R., Liu, L., Ding, B., Wang, H.: Distance-constraint reachability computation in uncertain graphs. Proceedings of the Vldb Endowment 4(9), 551–562 (2011)

    Article  Google Scholar 

  13. Karp, R.M., Luby, M.G.: A new monte-carlo method for estimating the failure probability of an (1983)

  14. Khan, A., Chen, L.: On uncertain graphs modeling and queries. VLDB Endowment (2015)

  15. Krogan, N.J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta, N., Tikuisis, A.P.: Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440(7084), 637–43 (2006)

    Article  Google Scholar 

  16. Lin, X., Xu, J., Hu, H.: Range-based skyline queries in mobile environments. IEEE Trans. Knowl. Data Eng. 25(4), 835–849 (2013)

    Article  Google Scholar 

  17. Lin, X., Peng, Y., Choi, B., Xu, J.: Human-powered data cleaning for probabilistic reachability queries on uncertain graphs. IEEE Trans. Knowl. Data Eng. 29(7), 1452–1465 (2017)

    Article  Google Scholar 

  18. Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. Proceedings of the Vldb Endowment 5(1), 13–24 (2011)

    Article  Google Scholar 

  19. Mo, L., Cheng, R., Li, X., Cheung, D.W.: Cleaning uncertain data for top-k queries. In: IEEE International Conference on Data Engineering, pp. 134–145 (2013)

  20. Niedermayer, J., Emrich, T., Renz, M., Mamoulis, N., Chen, L., Kriegel, H.P.: Probabilistic nearest neighbor queries on uncertain moving object trajectories. Proceedings of the Vldb Endowment 7(3), 205–216 (2013)

    Article  Google Scholar 

  21. Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1), 41–82 (2005)

    Article  Google Scholar 

  22. Ruomingjin Linliu, B.H.: Distanceconstraintreachabilitycomputationin. Pvldb 4 (9), 2011 (2012)

    Google Scholar 

  23. Solecki, B., Solecki, B., Solecki, B.: Kdd cup 2013 - author-paper identification challenge: second place team. In: Kdd Cup 2013 Workshop, pp. 3 (2013)

  24. Soliman, M.A., Ilyas, I.F., Chang, C.C.: Top-k query processing in uncertain databases. In: IEEE International Conference on Data Engineering, pp. 896–905 (2007)

  25. Tao, Y., Xiao, X., Pei, J.: Efficient skyline and top-k retrieval in subspaces. IEEE Trans. Knowl. Data Eng. 19(8), 1072–1088 (2007)

    Article  Google Scholar 

  26. Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining frequent itemsets over uncertain databases. Proceedings of the Vldb Endowment 5(11), 1650–1661 (2012)

    Article  Google Scholar 

  27. Tong, Y., Chen, L., Ding, B.: Discovering threshold-based frequent closed itemsets over probabilistic data. In: IEEE International Conference on Data Engineering, pp. 270–281 (2012)

  28. Tong, Y., Cao, C.C., Zhang, C.J., Li, Y.: Crowdcleaner: Data cleaning for multi-version data on the Web via crowdsourcing. In: IEEE International Conference on Data Engineering, pp. 1182–1185 (2014)

  29. Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: IEEE International Conference on Data Engineering, pp. 219–230 (2015)

  30. Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: ACM SIGMOD International Conference on Management of Data, pp. 229–240 (2013)

  31. Widom, J., Agrawal, A.P., Benjelloun, O., Ch, A., Chaumond, J., Murthy, R., Mutsuzaki, M., Sugihara, T., Theobald, M.: Chapter 5 trio: A system for data, uncertainty, and lineage (2013)

  32. Xu, K., Zou, L., Yu, J.X., Chen, L., Xiao, Y., Zhao, D.: Answering Label-Constraint Reachability in Large Graphs. In: ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October, pp. 1595?-1600 (2011)

  33. Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. Proceedings of the Vldb Endowment 6(9), 757–768 (2013)

    Article  Google Scholar 

  34. Zhang, C.J., Chen, L., Tong, Y., Liu, Z.: Cleaning uncertain data with a noisy crowd. In: IEEE International Conference on Data Engineering, pp 6–17 (2015)

Download references

Acknowledgments

This research is funded by NSFC (No. 61773167) and the Natural Science Foundation of Shanghai (No.17ZR1444900).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Yang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

Appendix

Appendix

1.1 New Reachability

The reachability when crowd returns ‘yes’ is

$$ R_{G,q}^{y}=R_{G,q}+P_{q,e}^{*}({p_{e}^{y}}-p_{e}) $$
(24)

And the reachability when crowd returns ‘no’ is

$$ R_{G,q}^{n}=R_{G,q}+P_{q,e}^{*}({p_{e}^{n}}-p_{e}) $$
(25)

where \({p_{e}^{y}}\) is new edge probability if the crowd’s answer is ‘yes’, similar form for \({p_{e}^{n}}\).

Proof

First, we divide the whole G into two parts: graphs containing e (Ge) and graphs without e (\(G_{\overline {e}}\)). Assume for every \(pg \in PG_{G_{e}}\), there is a corresponding \(pg^{\prime } \in PG_{G_{\overline {e}}}\) such that all edges \(E_{pg} \in E\) and \(E_{pg^{\prime }} \in E\) are the same except that \(E_{pg^{\prime }}\) doesn’t have e. Then, we have

$$ Pr(pg^{\prime})=\frac{(1-p_{e})Pr(pg)}{p_{e}} $$
(26)

By (2), \(R_{G,q}^{0}\) can also be represented as:

$$ R_{G,q}^{0}=\sum\limits_{pg \in PG_{G_{e}}} Pr(pg)r_{q}^{pg} + Pr(pg^{\prime})r_{q}^{pg^{\prime}} $$
(27)

Furthermore, \(G_{e}\) can be divided into \(G(P_{e},P_{\overline e})\) and \(G_{e}-G(P_{e},P_{\overline e})\). Obviously, for \(pg \in PG_{G(P_{e}, P_{\overline e})}\), \(r_{q}^{pg}= 1\) and \(r_{q}^{pg^{\prime }}= 0\). For \(pg \in PG_{G_{e}-G(P_{e}, P_{\overline e})}\), since whether e exists or not does not influence \(r_{q}\), we have \(r_{q}^{pg}=r_{q}^{pg^{\prime }}\).

Then, for \(pg \in PG_{G_{e}}\), the new graph probability \(Pr_{e \rightarrow y}(pg) = \frac {{p_{e}^{y}} Pr(pg)}{p_{e}}\) and \(Pr_{e \rightarrow n}(pg) = \frac {{p_{e}^{n}} Pr(pg)}{p_{e}}\), where \(e \rightarrow y\) (\(e \rightarrow n\)) represents edge e is cleaned to existence (nonexistence). For \(pg \in PG_{G_{\overline e}}\), \(Pr_{e \rightarrow y}(pg^{\prime }) = \frac {(1-{p_{e}^{y}}) Pr(pg)}{p_{e}}\) and \(Pr_{e \rightarrow n}(pg^{\prime }) = \frac {(1-{p_{e}^{n}}) Pr(pg)}{p_{e}}\). Hence, for \(pg \in PG_{G_{e}}\),

$$\begin{array}{@{}rcl@{}} Pr(pg) + Pr(pg^{\prime}) &=& Pr_{e \rightarrow y}(pg) + Pr_{e \rightarrow y}(pg^{\prime})\\ &=& Pr_{e \rightarrow n}(pg) + Pr_{e \rightarrow n}(pg^{\prime}) \end{array} $$

From above analysis, for \(pg \in PG_{G_{e}-G(P_{e}, P_{\overline e})}\) we have

$$\begin{array}{@{}rcl@{}} Pr(pg)r_{q}^{pg} + Pr(pg^{\prime})r_{q}{pg^{\prime}}\\ = Pr_{e \rightarrow y}(pg)r_{q}(pg) + Pr_{e \rightarrow y}(pg^{\prime})r_{q}(pg^{\prime})\\ = Pr_{e \rightarrow n}(pg)r_{q}(pg) + Pr_{e \rightarrow n}(pg^{\prime})r_{q}(pg^{\prime}) \end{array} $$
(28)

For \(pg \in PG_{G(P_{e}, P_{\overline e})}\), we have

$$ Pr(pg)r_{q}^{pg} + Pr(pg^{\prime})r_{q}{pg^{\prime}}=Pr(pg) $$
(29)
$$ Pr_{e \rightarrow y}(pg)r_{q}^{pg} + Pr_{e \rightarrow y}(pg^{\prime})r_{q}^{pg^{\prime}} = \frac{{p_{e}^{y}} Pr(pg)}{p_{e}} $$
(30)
$$ Pr_{e \rightarrow n}(pg)r_{q}^{pg} + Pr_{e \rightarrow n}(pg^{\prime})r_{q}^{pg^{\prime}} = \frac{{p_{e}^{n}} Pr(pg)}{p_{e}} $$
(31)

Then, \(R_{G,q}^{0}\) is equal to sum of each element: (29) + the first line of (28); \({R_{q}^{y}}\) is equal to sum of each element: (30) + the second line of (28) and \({R_{q}^{n}}\) is equal to sum of each element: (31) + the third line of (28). Therefore,

$$\begin{array}{@{}rcl@{}} R_{G,q}^{y}-R_{G,q}^{0} &= &\sum\limits_{pg \in PG_{G(P_{e},P_{\overline e})}} \frac{({p_{e}^{y}} -p_{e})Pr(pg)}{p_{e}}\\ &=& ({p_{e}^{y}}-p_{e})P_{q,e}^{*}\\ R_{G,q}^{n}-R_{G,q}^{0} &= &\sum\limits_{pg \in PG_{G(P_{e},P_{\overline e})}} \frac{({p_{e}^{n}} -p_{e})Pr(pg)}{p_{e}}\\ &= &({p_{e}^{n}}-p_{e})P_{q,e}^{*} \end{array} $$

1.2 NECP is # P-hard

Proof

To prove NECP is #P-hard, we reduce ECP in [17] to NECP as Xin Lin [17] has proven ECP (17) is #P-hard. ECP corresponds to situation where a crowd is exactly accurate.

In detail, for B to-clean edges, there are totally \(2^{B}\) cleaning results: \(CS={000...000,000...001,000...011,......,111...111}\), each element of which is a B-bits sequence where each bit represents the cleaning result of the corresponding edge. For simplicity, we assume all edges’ existential probabilities of an uncertain graph G are the same value p.

For ECP, the expected query result quality after cleaning is

$$ Q=(1-p)^{B} Q_{1} + (1-p)^{B-1} p Q_{2} + (1-p)^{B-2} p^{2} Q_{3} + ... + p^{B} Q_{2^{B}} $$
(32)

where \(Q_{1}\), \(Q_{2}\), \(Q_{3}\), ..., \(Q_{2^{B}}\) are the corresponding query result quality of CS.

For NECP, the expected query result quality after cleaning is

$$\begin{array}{@{}rcl@{}} Q^{N} & =&(1-Pr(C_{r}))^{B} Q_{1} + (1-Pr(C_{r}))^{B-1} Pr(C_{r}) Q_{2}\\ &&+ (1-Pr(C_{r}))^{B-2} (Pr(C_{r}))^{2} Q_{3} + ... + (Pr(C_{r}))^{B} Q_{2^{B}} \end{array} $$
(33)

Simplifying (5), we have

$$\begin{array}{@{}rcl@{}} Pr(C_{r})&=&(2P_{c} - 1)p + 1-P_{c}\\ 1-Pr(C_{r})&=&-(2P_{c} - 1)p + P_{c} \end{array} $$

We can see \(Pr(C_{r})\) is the linear expression with respect to p. Therefore, the fact computing \(Q^{N}\) implies that we can compute Q shows solving NECP is \(\#\)P-hard. □

1.3 Calculating \(P_{q,e}^{*}\)

Proposition 2: Calculating \(P_{q,e}^{*}\) can be reduced to calculating reachability.

Proof

Similar to calculating reachability \(R_{q}\) in (3) which in theory needs to enumerate all \(sg \in SG_{G,q}\), calculating \(P_{q,e}^{*}\) accordingly needs to enumerate all \(sg \in SG_{G(P_{e},P_{\overline e}),q}\). Also, it is impossible to identify \(SG_{G(P_{e},P_{\overline e})}\), even more troublesome than enumerating \(SG_{G,q}\).

In Example 2, we have mentioned Monte-carlo method can approximate the result of \(Pr(p_{1} \vee p_{2} \vee p_{3} \vee ... \vee p_{n})\) (assume \(n=|AP|\)). Similarly, Monte-carlo method is also applicative to calculating \(P_{q,e}^{*}\). First, we denote all paths passing through (not passing though) edge e by \(AP_{e}\) (\(AP_{\overline e}\)). Then, we have

$$ \sum\limits_{sg \in SG_{G(P_{e},P_{\overline e})}} Pr(sg)=Pr(\bigvee_{p \in AP_{e}})(1-Pr(\bigvee_{p \in AP_{\overline e}})) $$
(34)

By (34), we just need to respectively compute two parts: \(Pr(\bigvee _{p \in AP_{e}})\) and \(Pr(\bigvee _{p \in AP_{\overline e}})\), calculating each of which is equivalent to calculating reachability. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., Lin, X., Yang, Y. et al. Cleaning uncertain graphs via noisy crowdsourcing. World Wide Web 22, 1523–1553 (2019). https://doi.org/10.1007/s11280-018-0624-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0624-8

Keywords

Navigation