Covers the design, analysis, and modeling of social and information networks, including their applications for on-line information access, communication, and interaction, and their roles as datasets in the exploration of questions in these and other domains, including connections to the social and biological sciences. Analysis and modeling of such networks includes topics in ACM Subject classes F.2, G.2, G.3, H.2, and I.2; applications in computing include topics in H.3, H.4, and H.5; and applications at the interface of computing and other disciplines include topics in J.1--J.7. Papers on computer communication systems and network protocols (e.g. TCP/IP) are generally a closer fit to the Networking and Internet Architecture (cs.NI) category.
Moltbook is a social media platform in which posts and comments are authored exclusively by autonomous AI agents. We present the Moltbook Observatory Archive, an incremental dataset that passively records agent profiles, posts, comments, community metadata (``submolts''), platform-level time-series snapshots, and word-frequency trend aggregates obtained by continuously polling the Moltbook API. Data are stored in a live SQLite observatory database and exported as date-partitioned Parquet files to enable efficient analysis and reproducible research. The documented release covers 78~days of platform activity (2026-01-27 to 2026-04-14) and contains 2,615,098~posts and 1,213,007~comments from 175,886~unique posting agents across 6,730~communities. This is, to our knowledge, the first large-scale observational dataset of a social network populated exclusively by autonomous AI agents. The archive is intended to support research on multi-agent communication, emergent social behavior, and safety-relevant phenomena in agent-only online environments, and it is released under the MIT license with code for collection and export.
It is a fundamental question in epidemiology to estimate, model and predict the growth rate of a pandemic. Analogously, analysing the diffusion of innovation, (fake) news, memes, and rumours is of key importance in the social sciences. The resulting epidemic growth curves can be classified according to their growth rates. These have been found to range from exponential to both faster super-exponential curves and slower subexponential or polynomial curves. Previous research has lacked a unified explanatory framework capable of accommodating super-exponential, (stretched) exponential, and polynomial growth patterns within the same contact network. In this paper we propose a simple agent-based network model that can capture all these phases. We provide such a framework by modelling how transmission rates depend on spatial distance and on individuals' numbers of contacts. By comparing the growth rate of spreading processes with or without degree-dependent and/or distance-dependent contact rates through data-driven and synthetic simulations on real and modelled networks with underlying geometry, we find evidence that even a 'sublinear presence' of these causes may cause a significant slow down of the growth rate on the same underlying network. We find that the growth rate is governed by a combination of three factors: geometry, the prevalence of weak ties, and superspreaders. We confirm our results with rigorous proofs in a theoretical model, using a spatial multiscale-argument in long-range heterogeneous first passage percolation. Our results give a plausible explanation of why the consecutive waves of a single pandemic can differ in their growth even if their spreading mechanisms are similar.
Exploring similar nodes in attributed networks represents a key challenge in data mining. While recent representation learning methods embed networks into low-dimensional vectors, they often implicitly assume a uniform and continuous feature space. This paper proposes a visual analytics approach using dimensionality reduction to help clarify the true topological structure of high-dimensional feature spaces formed by nodes' neighborhood attribute profiles. Analyzing inter-firm transaction networks indicates that structural roles can form complex, non-linear manifolds with density biases. Comparing this feature space with industry classifications suggested: (1) supply chain hierarchies transition continuously; (2) categories treated identically under general semantics can be clearly separated by actual transaction networks; and (3) a single industry label may fragment into multiple regions. These findings suggest potential limitations in assuming identical semantics imply similar structural roles and highlight the possible need for new similarity metrics aligned with manifold topology.
Matter seeks to resolve longstanding interoperability problems in the Internet of Things (IoT), yet little is known about how developers experience the standard in day to day work. This paper examines over 13,000 issues from the official Project CHIP GitHub repository to understand the kinds of problems contributors report when implementing and integrating Matter. Using topic modeling and qualitative analysis, we identify four recurring areas of concern, Testing, Interoperability, Development, and Platform and Network, and describe how they manifest in the evolution of the codebase and tooling. The findings reveal systematic technical and integration challenges and point to concrete opportunities to refine Matter's test infrastructure, cross vendor guidance, and documentation as the standard continues to mature.
Analysis of yearly co-authorship graphs connects the shift to pandemic effects on researcher ties.
abstractclick to expand
Peridynamics is a fast growing field of continuum mechanics, especially developed for the modeling and simulation of fracture problems, initiated a quarter century ago. In this study, we analyze the evolution of the peridynamics community since its inception in terms of publication co-authorship. For this purpose, we construct a peridynamics co-authorship network for each year from 2000 to 2024 and perform network analysis based on selected metrics. Nodes represent scientists, and links connect co-authoring scientists with link weights representing the number of co-authorships (based on the total number of co-authors per publication). Network-level metrics are used to quantify the evolution of the field, and node-level metrics are used to identify trends in the most collaborative scientists in peridynamics. We noticed a deviation in network trends that occurred in the years since 2019, and we subsequently performed a country-based analysis with insights about the impact of the COVID-19 pandemic on the evolution of the peridynamics co-authorship network.
Human-annotated data remains foundational for machine learning and social media analysis. However, traditional data collection often relies on cumbersome pipelines that isolate content from its original source, compromising ecological validity. To address these challenges, we present Social-Annotate, a flexible browser extension that facilitates direct data collection on online platforms. By injecting customizable forms into webpages, the tool captures annotations while users interact with the native environment. Social-Annotate offers no-code design interface for the survey forms for non-technical users. Since injecting custom elements directly into host platforms creates a brittle dependency on evolving interfaces, we integrate a self-healing agent powered by large language models. This automated pipeline autonomously detects structural changes, regenerates valid target selectors, and validates them within a live browser environment. Our extensible platform readily supports 12 platforms including social media like $\mathbb{X}$, Instagram, TikTok and P2P messaging platforms WhatsApp and Telegram. Social-Annotate significantly reduces data collection overhead and developer maintenance, enabling researchers of all technical backgrounds to focus on data analysis rather than engineering. Moreover, Social-Annotate provides an ecosystem for conducting intervention studies by dynamic content manipulation.
Classical models of opinion dynamics represent individual opinions as scalar or vector values governed by the classical probability theory, either as deterministic quantities or random variables. This framework does not account for empirically observed phenomena such as cognitive ambivalence (where an individual simultaneously holds conflicting views) and order effects (where survey responses depend on the order in which questions are asked). We propose a quantum model of opinion dynamics in which each agent's cognitive state is represented by a density matrix that encodes both the expressed opinion and cognitive ambivalence. Survey questions become non-commuting self-adjoint operators, which provides a principled explanation for order effects. Our model also identifies quantities without classical counterparts, including quantum coherence and pairwise opinion covariances. Under a product state approximation, the quantum model reduces to the classical Friedkin--Johnsen opinion model. We test the framework on synthetic and real-world networks and observe that pairwise correlations follow network-dependent transient dynamics but converge to the same steady state regardless of the network, and that quantum coherence decays exponentially at a rate independent of the network.
We investigate the emergence of structural disparities in networks of collaborating large language model (LLM) agents. When LLM agents autonomously choose collaborators, the resulting communication network exhibits preferential-attachment dynamics: agents that are already prominent become increasingly likely to attract additional connections. In some cases, weaker LLM agents (agents with smaller base model or older version) can disproportionately occupy central and influential network positions relative to stronger LLM agents. We interpret this as a type-dependent glass-ceiling effect (GCE). We model the network of LLM agents as a time-evolving sequence of directed weighted graphs, where the vector-valued edge weights represent cumulative tokens exchanged, number of interaction rounds, and reasoning effort. Using a contraction mapping argument on the mean-field dynamics, we prove that the importance (centrality) of each agent type converges to a unique stable equilibrium. To ground the model in LLM decision mechanisms, we introduce a cross-attention-inspired utility for collaborator selection. This utility specifies the local connection dynamics and, together with the mean-field model, yields a predictive characterization of the limiting network structure and its type-dependent centrality gaps. To validate the theory, we develop an experimental testbed with 100 LLM agents. Our experiments show that autonomous network formation can generate persistent centrality disparities, with their magnitude and direction depending on model family, model size, system-prompt design, and task context. They further show that the effect of preferential attachment depends on its alignment with model capability: reinforcing it improves collective performance when stronger agents become central, whereas weakening it improves performance when network dynamics instead favor weaker agents.
Recent advances in AI have heightened scholars' and policy makers' concern with social influence and behavioral contagion in online communities. We conduct a field experiment on Reddit to investigate the extent to which online users are susceptible to positive behavioral stimuli from other users and artificial agents. We let apparent human and bot accounts give symbolic awards to users with one of four rationales: praising the recipient's logical argument, emotional sensitivity, or moral integrity, or explaining that the award resulted from a random draw in a lottery. We evaluate how the different rationales for the award affect the recipients' subsequent behavior on the platform in terms of volume, impact, and content, as well as the further behavioral contagion to other users. We find that awards do not increase user activity and downstream impact, and awards from bots with the lottery rationale can in fact reduce them. Nevertheless, awards encourage direct communication between users. These findings highlight the possible resilience of online users to simple behavioral manipulation from platform algorithms and artificial agents, but not necessarily to more sophisticated schemes that simulate human conversation. Transparently labeling automated agents remains essential for ethical and effective platform governance.
Self-supervised Continual Graph Learning (CGL) aims to successively learn from a graph sequence with different tasks without label supervision - a paradigm that has attracted widespread attention. Most existing self-supervised CGL methods rely on instance-level consistency objectives that enforce stability of individual node (or node-pair) embeddings. Due to optimizing nodes in isolation, these methods fail to maintain global relational structure, causing inter-node correspondences to progressively distort under continual learning. To this end, we propose a novel Structure-Aware Optimal Transport (SAOT) framework that explicitly captures and preserves relational structure within graph representations across sequential tasks. Specifically, SAOT leverages optimal transport theory to capture global inter-node correspondences, thereby facilitating and enhancing graph representation learning. Simultaneously, SAOT incorporates a cross-task knowledge distillation mechanism to preserve the previous structural knowledge. Extensive experiments on four CGL benchmark datasets demonstrate that SAOT outperforms existing self-supervised baselines. In particular, SAOT achieves significant performance gains, improving average accuracy by up to 5% on CoraFull-CL and over 15% on Products-CL compared with state-of-the-art methods in the Class-IL setting.
Social media algorithms allocate users' visibility by ranking content within their social networks. Yet, how recommendation logic and network structure jointly shape visibility across content and creators remains largely understudied. In this work, we tackle this question through agent-based simulations using YSocial, a social media virtual twin, in which agents interact under 7 recommendation strategies and 2 network topologies. We find that recommender logic sets the visibility regime: popularity creates a reinforcement loop in which early reactions increase later exposure, concentrating visibility on a small subset of content and limiting creator visibility to those whose content enters this loop, while collaborative filtering distributes visibility broadly across the active catalogue and user base. When the follower graph shapes candidate selection, network structure changes the direction of inequality: under popularity ranking, creator-level concentration becomes comparable to global popularity, but visibility is systematically redirected toward creators who are already socially popular. Network topology modulates the magnitude of these effects without changing their qualitative ordering. These results show that visibility allocation should be evaluated across content, creators, network position, and temporal reinforcement, and that controlled simulations can help test how feed design distributes visibility before deployment.
Understanding how information propagation affects epidemic dynamics has become an emerging topic of interest. However, the influence of interpersonal relationship heterogeneity on information acquisition and disease transmission has been largely overlooked. In this work, we introduce a hypergraph structure for Cyber-Physical Systems (CPSs) with two distinct layers. The upper layer, referred to as the cyber layer, consists of a mixed hypergraph, capturing both pairwise propagation and higher-order diffusion of epidemic-related information. The lower layer, referred to as the physical layer, employs a Susceptible-Infected-Susceptible (SIS) process to capture epidemic spreading. This work introduces an adaptive perception-protection mechanism based on Jaccard similarity, which accounts for interpersonal heterogeneity. In this mechanism, individuals receive information based on their relationships with neighbors and take protective measures accordingly. We analyze the impact of interpersonal relationships and the adoption of neighborhood-based self-protection strategies on epidemic dynamics. Furthermore, we conduct a theoretical analysis based on the Microscopic Markov Chain Approach (MMCA), analytically derive the outbreak threshold, and confirm the results with extensive Monte Carlo (MC) simulations. The results show that stronger interpersonal relationships can promote information propagation, significantly increase the threshold for epidemic outbreaks, and effectively suppress the scale of the epidemic. The study provides theoretical support for designing epidemic control strategies considering interpersonal heterogeneity and improves the understanding of epidemic spreading on hypergraphs.
Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box access to the algorithms, while personalization depends on users' attributes, behavior, and evolving interaction histories. Existing auditing methods face a tradeoff: studies with real users capture realistic behavior but are costly and hard to control, whereas sock-puppet audits scale more easily but often rely on scripted behavior that limits realism. Beyond this, both approaches struggle to decouple user attributes from user behavior, limiting our ability to causally understand personalization. To address this gap, we introduce a framework for black-box audits of personalization algorithms using generative AI agents as behavioral engines for synthetic accounts. Each agent is instantiated with a fixed persona, grounded in demographic and political survey data, and interacts with a platform's content by reasoning about it and choosing actions. Because behavior is fixed within each persona while platform-visible signals such as age, gender, or location can be experimentally perturbed, our design enables counterfactual auditing of how platforms respond to user attributes. As a case study, we deploy 1,120 agents on X shortly after the 2024 U.S. election, spanning 14 personas and three counterfactual conditions, collecting over 200,000 content exposures. We find that X's algorithmic feed amplifies toxic, polarizing, political, and right-leaning content relative to the chronological feed, with amplification varying sharply by user ideology. Counterfactual analyses show that demographic signals affect content delivery in persona-dependent ways: pooled effects are largely null, while subgroup-level effects vary in direction and magnitude. Our work establishes GenAI-based agents as a new tool for algorithmic auditing.
It matches official indices better around major events and improves housing forecasts using individual responses to signals.
abstractclick to expand
Consumer confidence is typically modeled as a persistent macroeconomic index, yet its movements arise from households that interpret economic information through heterogeneous constraints, exposures, prior beliefs, and attention. We introduce ConsumerSim, a generative Human--Environment response framework that reconstructs Consumer Confidence Index (CCI) dynamics from a microdata-calibrated synthetic population, time-stamped macroeconomic, financial, policy, and news signals, survey-like response generation, post-stratified belief expansion, and behavioral inertia alignment. Across U.S., EU27, and Japanese official CCI target series, ConsumerSim ranks first among persistence, time-series, regression, and information-augmented baselines on the reported reconstruction metrics, with clear gains around high-salience shocks. Its reconstructed signal also improves short-horizon prediction of real activity, most consistently for housing outcomes. Mechanism analyses show that CCI movements concentrate around salient events; subgroup trajectories often align in direction while differing in magnitude; and signal sensitivity varies across income, homeownership, education, and political-alignment groups. Population-expansion and ablation results indicate that representative aggregation, situational signals, persona heterogeneity, and inertia are necessary for both accuracy and diagnosis. The findings support a behavioral view of consumer confidence as an interpretable Human--Environment response process rather than a purely aggregate time series.
When editorial boards resign from their journals and publishers and declare their independence, two competing journals can result: the original journal under a new editorial board (a "zombie" journal), and a new journal established by the departing editors (a "breakaway"). The bibliometric community saw such an event when the board of Journal of Informetrics left Elsevier to found Quantitative Science Studies. We analyzed 39 breakaway-zombie journal pairs that have formed since 1989 and their declarations of independence to understand why and how they happen. Results show that declarations of independence were motivated by concerns related to governance and business model and overwhelmingly happened at journals owned by the Big Five publishers. Breakaway editors tended to found new journals at smaller publishers and adopt diamond publishing models. These findings suggest that dissatisfaction with commercial publishing models is growing, and that community-led alternatives can motivate change.
Community-detection algorithms usually return a single partition, even when independent initializations or small data perturbations yield several plausible outputs. We probe this output distribution through three paired observables: hard-partition variation of information (VI), a residual-gated fixed-point VI, and a cutoff-free Jensen-Shannon distance between belief-propagation (BP) marginal fields. For the symmetric sparse stochastic block model, linearizing BP around the uninformative fixed point gives the Kesten-Stigum onset at $\mathrm{snr}=(c_{\rm in}-c_{\rm out})/(q\sqrt{c})=1$. The hard VI maximum is instead a finite-size, readout-dependent detector curve on the detectable side, typically $\mathrm{snr}^\star \simeq 1.05\text{-}1.10$; moving the polarization cutoff from 0.001 to 0.1 shifts it across 1.047-1.128. The nontrivial-readout activation obeys $\mathrm{snr}_{50}(\tau)-1 = 0.0086 + 0.522\,\tau$ ($R^2=0.996$). Long-budget residual gating separates readout and critical slowing from fixed-point dispersion: at $\mathrm{snr}=1.05$ and 1.10 the hard VI is 1.49 and 1.58 bits but the gated subsets have zero VI, whereas from 1.15 to 1.30 nearly all runs pass the gate and retain VI 1.31 down to 1.24 bits. A high-replication audit through $N=100000$ disfavors a zero-asymptote power law and finds a small plateau $\mathrm{snr}^\star-1 \simeq 0.024$ (graph-bootstrap 90% interval [0.0227, 0.0316]). On real networks, a label-free Bethe-Hessian modularity margin with a Chung-Lu null gate is run on political blogs and six SNAP graphs: the measurement stays label-free, while heterogeneous networks can retain null-significant structure even after strong edge subsampling. The result is a detector-output decomposition near the Kesten-Stigum boundary, reporting hard readout, relaxation dynamics, and fixed-point-field dispersion separately.
Mean-field linearization shows total infections upper-bounded by supermodular function, making greedy deletion efficient on networks.
abstractclick to expand
In this paper, we investigate the discrete SIS (Susceptible-Infected-Susceptible) models. We focus on minimizing epidemic spreading over networks by extending an existing edge deletion algorithm to the SIS model. To achieve this, we employ the mean-field approximation to linearize the network dynamics into a deterministic SIS model. We analytically demonstrate that the total number of infections is upper-bounded by a super-modular function, thereby ensuring the efficiency of the edge-deletion approach. To evaluate the proposed method, we conduct experiments on synthetic Erdos-Renyi networks and the real-world dataset collected from BBC Pandemic Haslemere app. Numerical simulations validate our theoretical results, confirming that both configurations converge to the stable, disease-free equilibrium.
Complex contagion models, in which adoption requires reinforcement from multiple neighbors, have been extensively studied in the monotone (no-recovery) setting, but the phase diagram of threshold models with SIS-like recovery on networks remains unmapped. We study a stochastic Watts-threshold SIS model on Erdos-Renyi and Barabasi-Albert networks and reconstruct its extinction-persistence phase boundary in the joint parameter space of transmission rate $\beta$, adoption threshold $\theta$, and infectious duration $d$. Using adaptive Delaunay-based sampling and weighted logistic regression on over 180,000 Monte Carlo trials, we find that: (i) the boundary is well described by a six-parameter interaction model whose structure is invariant across both topologies; (ii) the transition is sharp, with the 10-90\% extinction-probability band spanning only $\Delta\theta \approx 0.005$-$0.008$; and (iii) the adoption threshold is the dominant parameter governing epidemic feasibility, with transmission rate and infectious duration playing secondary and asymmetric roles. The characterization provides a quantitative reference for the complex-contagion analogue of the classical SIS epidemic threshold.
Despite the promise of decentralization, measurement studies have identified a conspicuous lack of decentralization in blockchains. Centralization has been observed in almost all layers of the blockchain, in decentralized applications, and in decentralized autonomous organizations. In many cases, it is practically impossible to definitively determine the extent of centralization in the system. While multiple works have proposed methods to decrease centralization, by and large blockchains continue to be significantly centralized.
In this paper, we develop a general framework for building verifiably decentralized blockchain systems. Our framework is motivated by the core observation that the richness and diversity of collaborative interactions between users -- rather than resource uniformity -- captures the essence and extent of decentralization in a blockchain system. Existing blockchains do not have any incentive mechanisms to encourage inter-coalition collaboration, which directly contributes to centralization. We propose a novel reward design that incentivizes users to collaborate with other users without forming isolated coalitions. Technically, our method uses a Sybil-resistant asymmetric Shapley value for reward attribution within a collaboration group, and the theory of expander graphs for measuring and enforcing decentralization.
Our framework is general and can be adapted to alleviate centralization in any layer, application, or decentralized organization. It also has important implications beyond the topic of centralization. For example, we show that our solution can naturally address the blockchain scalability problem. We also identify a new class of decentralized collaborative applications that have hitherto been unexplored in blockchains.
Study of 3 million scientists over 120 years finds cross-field mobility prevents the typical age-related drop in new ideas.
abstractclick to expand
Modern science is organized around specialization in training and teamwork. Scientists develop deep expertise within a field and combine complementary knowledge through collaboration to solve complex problems. Yet whether specialization is the most effective path to sustained innovation remains unclear. Here we introduce a quantitative framework that distinguishes generalists from specialists based on scaling patterns of disciplinary mobility while remaining independent of career age and productivity. Applying this framework to 49 million publications produced by 3 million scientists between 1900 and 2020, we examine how research style relates to innovation, learning, collaboration, and productivity. We find that scientists who move across fields are more likely to sustain innovative contributions throughout their careers, whereas those who remain within narrow fields exhibit the age-related decline in innovation. Generalists are less anchored to the literature of their training. They are more likely to pursue research independently, and, when they collaborate, they preferentially partner with other generalists. Teams with a greater share of generalists produce more innovative research, even after accounting for differences in knowledge diversity. Despite these advantages, generalists publish fewer papers on average and have become less common over time. These findings reveal a tension between the longevity of scientific careers and the longevity of scientific innovation.
When tens of thousands of autonomous AI agents interact in topical online forums, do they develop distinct community-specific linguistic identities? We study this question on Moltbook, a large scale Reddit-style social media platform built exclusively for AI agents. Using the public Moltbook Observatory Archive dataset with over 3.1 million posts and 1.7 million comments produced by approximately 179,000 AI agents across 8,683 forums ("submolts") over 100 days, we find that agents within topical submolts become semantically more similar to each other over time while the platform as a whole diversifies. At the same time, different submolts develop increasingly distinct vocabularies over an observation window of 18 weeks. Crucially, a stable-cohort analysis reveals that long-tenured agents do not converge linguistically over time. Instead, community-level linguistic differentiation operates through selective attraction - newcomers arrive already linguistically compatible with their chosen community - and differential retention - conforming agents remain active longer. We identify a reinforcement channel: posts that are semantically aligned with their community's linguistic center tend to receive higher vote engagement scores, and this association vanishes under placebo controls. Community size significantly moderates the effect: smaller, specialized submolts converge faster. Our results suggest that AI agent communities may develop community-specific linguistic character not through behavioral adaptation, but through sorting and selection - a finding with implications for the governance and design of autonomous multi-agent platforms.
Characterizing the scenario underlying an epidemic from its disease cascade is an important task in simulation analytics. We propose boundary degree, the count of an infected node's contacts in the underlying contact network that were not infected, as a per-node cascade feature for this task. Through systematic ablation on realistic social contact networks of Tennessee and Virginia, we show that boundary degree alone improves scenario identification accuracy by 19%. Edge features, whose importance was observed empirically by prior work, consistently improve accuracy across all settings; we provide theoretical grounding for this observation. These effects are complementary. We prove that certain epidemic scenarios are indistinguishable without boundary or edge information. Prior feature engineering approaches included aggregate boundary statistics, but these were not among the top-ranked feature groups; the per-node representation we propose reveals their importance clearly. Our results suggest that contact tracing applications should track contacts with non-infected individuals, not only transmissions.
How academic advantages are transmitted within families is usually studied as occupational inheritance, but it is not clear whether scholarly research orientations persist across generations and if it is an advantage when it does. To address this, we link Wikidata kinship records with OpenAlex bibliometric profiles to study 3,229 documented parent-child scholar pairs and 488,659 publications. Field-level research similarity was evident but not universal: whilst the median similarity was 0.546, 25.3% of parent-child pairs had no Field overlap (i.e., similarity 0). These pairs were substantially more similar than publication-period-matched comparison pairs (median 0.098). Direct academic interaction was uncommon: 10.4% of parent-child pairs had co-authored, 9.8% of children had cited their parents, and 6.9% of parents had cited their children. Nevertheless, each 0.1 increase in Field similarity was associated with 38-39% higher adjusted odds of co-authorship and cross-citation. There was also intergenerational continuity in academic achievement and recognition. Parents' publication volume and field-normalized citation impact were positively associated with those of their children. Children of national academy members had approximately twice the odds of becoming national academy members themselves (Odds Ratio = 2.04), while children of prizewinning parents had 46% higher odds of winning prizes (Odds Ratio = 1.46). However, children of national academy members showed lower research similarity to their parents. Greater research differentiation was associated with higher field-normalized citation impact among children, but not with publication output or higher odds of academic recognition. Academic families therefore appear to transmit resources and advantages with the sole exception that diverging from parental fields seems to confer a citation advantage.
Heterogeneous graph neural networks (HGNNs) have achieved strong performance in modeling complex graph-structured data with multiple node and relation types. However, their robustness under realistic black-box adversarial settings remains insufficiently explored. Existing attacks on HGNNs usually assume access to model gradients, soft prediction scores, or the complete graph structure, which is often unavailable when HGNN-based services are deployed as closed systems. In this paper, we propose Blackknife, a hard-label, query-limited, and structure-limited black-box evasion attack framework for heterogeneous graph neural networks. Blackknife assumes no access to the victim model architecture, parameters, gradients, logits, confidence scores, or the full graph structure. Instead, it only relies on locally observable one-hop heterogeneous structures and a small number of hard-label queries. To generate effective perturbations under these strict constraints, Blackknife first constructs a local relation-aware surrogate model from observable heterogeneous neighborhoods. It then relaxes discrete edge addition and deletion operations into continuous soft weights and optimizes them through projected gradient descent. Finally, the optimized perturbations are discretized into relation-preserving structural rewiring operations and verified using limited hard-label feedback from the victim model. Extensive experiments on three benchmark heterogeneous graph datasets, including ACM, DBLP, and IMDB, demonstrate that Blackknife consistently achieves strong attack success rates against representative HGNN models. The results further show that Blackknife remains effective under topology-based defense strategies, revealing the vulnerability of HGNNs to local structure-limited black-box attacks.
A growing number of techniques leverage the spatial structures that underlie many real-world datasets. Despite these advances, the complementary task of estimating spatial structures and understanding their role within these techniques has often been overlooked. In neurophysiological data analysis specifically, numerous methods exist to estimate brain connectivity, but most are not explicitly model-based, dynamic, multivariate, or directed. To address these limitations, we previously introduced noise-driven heat modelling on graphs for neurophysiological connectivity estimation. In this study, we extend this framework by relaxing earlier noise assumptions and adding regularisation to improve robustness. We also develop a simulation procedure to characterise and evaluate our technique in a controlled setting. Finally, we demonstrate that the technique is able to capture meaningful spatial structure across two experiments, each using two real-world datasets. The explicit model formulation of our connectivity estimator has the potential to improve the interpretability of graph-based techniques across a wide range of applications. The code implementing our method is available at https://github.com/sgoerttler/Heat_Connectivity.
Computational knowledge graphs assign philosophical concepts to traditions based on corpus frequency: the school that mentions a concept most becomes its attributed tradition. We argue this conflates three measurements: textual power, historical priority, and philosophical significance, demonstrated using the darshana-graph, a knowledge graph of 28,322 relationships across Hindu, Buddhist, and Jain traditions. Seven of the top 25 concepts by betweenness centrality predate their attributed school by 288 to 2,288 years. Moksha, attributed to Advaita Vedanta, appears first in Jain sources over 1,200 years earlier. The most reliable snapshot, at 300 BCE using only explicitly dated sources, shows a genuinely pluralistic structure: 59% Vedic, 24% Jain, 18% Buddhist. We also quantify a critical distortion in the temporal method: between 300 CE and 800 CE the network grows from 18 to 1,028 nodes, with 97.4% carrying Advaita proxy dates, revealing that apparent dominance reflects textual survival, not philosophical history. Beyond correcting attribution bias, the temporally grounded graph enables structural homology analysis across traditions. Ego-network feature vectors applied to 48 temporally labelled concepts across eight traditions identify cross-tradition concept pairs with high structural similarity. The method recovers known correspondences including purusha-jiva (Samkhya/Jain, sim 0.990) and prakriti-maya (Samkhya/Vedic, sim 0.972), and surfaces novel homologies. Nibbana and samsara score 0.954 despite being doctrinal opposites: both function as the ultimate reference concept in their tradition's soteriology. Cetana (Buddhist intention) and ajiva (Jain non-living matter) score 0.923, a pairing absent from the literature. These are not claims of doctrinal equivalence but of measurable structural homology: different philosophical vocabularies navigating a shared conceptual space.
People's opinions can change both from their interactions with each other and from their interactions with media sources. Bounded-confidence models (BCMs) of opinion dynamics provide one framework to study such dynamics. In a BCM, the nodes of a network are agents with continuous-valued opinions, and these agents interact with each other via the edges of the network. In this paper, we extend the original Deffuant--Weisbuch (DW) BCM by incorporating influence from two media sources -- one with a positive value and one with a negative value -- to capture the effects of a polarized media landscape. We show both numerically and analytically that our extended DW model exhibits drifting behavior in which a large cluster of opinions shifts toward one of the media agents. We analyze how the drift trajectory and speed depend on the model parameters, and we identify conditions in which drift is promoted or suppressed. Our results provide insight into how competing media sources can influence collective opinion formation in social systems.
Temporal link prediction is usually evaluated by predictive performance on unseen edges, but in probabilistic temporal graphs this criterion can conflate model error with irreducible uncertainty. We study this issue by characterising an inherent estimation--prediction tradeoff in binary logistic models where regimes that maximise Fisher information and improve parameter recoverability are also those with the highest entropy, making individual predictions intrinsically harder even under perfect parameter recovery. We propose a probabilistic causal framework for generating temporal graphs with transient edges and known ground-truth causal structure, allowing temporal link prediction to be evaluated jointly with causal parameter recovery. For the proposed binary logistic parametrisation, we derive the Cram\'{e}r--Rao bound and validate the tradeoff between parameter estimation error and irreducible predictive loss. Our results show that predictive accuracy alone may not reflect whether a model has learned the underlying causal mechanism, motivating benchmarks that distinguish reducible model error from intrinsic process uncertainty.
Papers that name a repository versus DOIs it declares produce opposite conclusions on how science and software influence each other.
abstractclick to expand
Software and scientific knowledge co-evolve, yet they are catalogued in separate corpora that rarely speak to one another. We bridge them at global scale by linking World of Code (a near-complete mirror of public version-control history) to Semantic Scholar and OpenAlex through a typed cross-corpus graph of 69.8M edges over eight relation types (paper-to-software mentions, software-to-paper citations, software dependencies, authorship, affiliation, and identity bridges). Anchoring on 18,247 curated science repositories, we ask two reciprocal questions: what is the impact of science on software, and of software on science? To test whether this Science-Software Supply Chain (S3C) view is feasible, we run basic investigations rather than claim a definitive measurement. The two directions appear to illuminate different, complementary strata: the literature's reach into software is dominated by a reproducibility and packaging layer (nf-core, Nextflow, Bioconda) and sequence-analysis tools, whereas software's reach back into science is proxied by a largely invisible machine-learning and data-science infrastructure tier (PyTorch, seaborn, NLTK). The direct paper-names-software channel is too sparse to rank: a human-curated gold benchmark links none of its 65 in-scope cases. Dependency reuse stands in as a proxy and is at most weakly coupled to citation count and to stars (Spearman rho=0.36). Our most cautionary finding is about measurement itself: the reuse-citation coupling flips sign and confidence across two reasonable ways of pairing a repository with a citation count, through papers that name it (n=137, rho=0.05, CI straddling zero) versus DOIs a repository declares for itself (n=1,067, rho=0.13, CI [0.07,0.19]). With linkage this sparse, the sign of a headline correlation depends on which gap one tolerates, so we report both and refrain from a strong decoupling claim.
Temporal (or time-evolving) networks provide a natural framework for modeling complex systems with time-dependent interactions, where understanding the evolution of community structures is a central challenge. While random walk-based approaches to community detection in static networks are well established through the spectral analysis of associated transfer operators, extending these ideas to temporal networks is nontrivial due to the inherent time-dependence of the underlying dynamics. In this work, we develop a general framework for community detection in temporal networks that is based on multi-view canonical correlation analysis (mCCA). We show that the proposed formulation admits a spectral characterization via a time-reversible random walk on an augmented space-time network, providing a clear dynamical interpretation of temporal communities as metastable structures of the process. Furthermore, we analyze key spectral properties of the resulting transfer operators and the interplay between spatial and temporal effects, which allows us to distinguish between structural features and artifacts induced by the snapshot coupling. Finally, we derive a reduced-order model, which preserves the essential spectral properties while significantly improving computational efficiency. We show that the proposed approach effectively detects communities in temporal networks and captures their evolution.
Social media popularity prediction aims to forecast the future reach or influence of online content from early-stage observations. Accurate prediction enables key downstream applications, such as advertising optimization and strategic content planning by users, creators, and platforms. Despite substantial progress, existing popularity prediction works often fail to jointly consider multimodal content and temporal social interaction signals. Moreover, the literature remains highly fragmented across datasets, modalities, observation windows, prediction targets, and evaluation protocols. This fragmentation prevents fair comparison and obscures a systematic understanding of how textual, visual, temporal, and interaction-based signals jointly shape popularity dynamics. To address these challenges, we introduce MMG-Pop, a Multi-modal Graph-based Popularity Prediction benchmark, which unifies datasets, modalities, temporal interaction signals, and representative baselines under a standardized evaluation protocol. Furthermore, we propose MMG-PopNet, a unified multi-modal graph-based network that jointly models the aforementioned multi-modal signals and graph-structured social interactions. Extensive experiments on MMG-Pop, comprising four datasets across Bluesky and Reddit platforms, demonstrate the superior performance of MMG-PopNet and yield new insights into cross-platform training generalization, multi-task prediction benefits, multi-modality contributions, and LLM prediction limitation. These findings establish a unified foundation for future research on social dynamics modeling and intervention under heterogeneous modalities and socially-aware agentic ecosystem paradigms.
We address the problem of inferring a directed network from nodal measurements generated by linear diffusion dynamics on the sought graph. Observations are modeled as the outputs of a graph convolutional filter, i.e., a polynomial (with unknown coefficients) of a local diffusion graph-shift operator encoding the latent graph topology, excited with an ensemble of independent graph signals with arbitrarily-correlated nodal components. Unlike prior efforts that considered undirected graphs and white signal excitations, here the graph-shift operator and the observations' covariance matrix are not simultaneously diagonalizable. In this challenging context, we first rely on measurements of the output signals along with prior statistical information on the inputs to identify the diffusion filter. Such system identification problem involves solving a system of quadratic matrix equations, which we show is identifiable under spectral-diversity assumptions on the input covariances. For algorithmic purposes we recast it as a smooth quadratic minimization subject to Stiefel manifold constraints. Subsequent identification of the network topology given the graph filter estimate boils down to finding a sparse and structurally admissible shift that commutes with the given filter, thus, forcing the latter to be a polynomial in the sought graph-shift operator. A joint graph filter and topology identification algorithm is also proposed, which alternates between the aforementioned steps in a mutually reinforcing fashion to offer improved sample complexity. Numerical tests corroborate the effectiveness of the proposed algorithms in recovering synthetic digraphs and real-data case studies, and illustrate their potential utility on urban mobility analyses as well as portfolio optimization.
Decentralized online social networks such as Mastodon distribute moderation power across thousands of independently governed servers, raising fundamental questions about how local block decisions shape global structure and information flow. In this paper, we analyze Mastodon at the instance level by constructing a signed, directed, temporal network in which positive edges aggregate inter-instance follow relationships and negative edges encode daily block actions. Using one year of data, we show that despite continuous moderation activity and changing roles among instances, the network exhibits strong structural stability: signed dyadic motifs and degree distributions display highly persistent dynamics, and aggregated transition matrices satisfy Markovian equilibrium conditions over intermediate time scales. Building on the marked asymmetry between instances that predominantly issue bans and those that are mostly banned, we then study information diffusion on the positive network via a hybrid contagion model that combines simple contagion within groups and complex contagion across groups. We find that information originating in the minority of moderating instances spreads more efficiently, both internally and toward the majority, while the opposite direction is fragile and sensitive to contagion parameters. Echo-chamber effects emerge even in a globally balanced signed network and become stronger under stricter contagion conditions. Together, these results show that decentralized moderation in Mastodon generates a stable macroscopic configuration that both structures and constrains information exchange, effectively isolating norm-violating domains without centralized control.
The broadcast of disinformation in online social networks (OSN) is a growing concern examined across several disciplines, including human-computer interaction (HCI). The pervasive issue has been prompting novel approaches to identify the malicious actors behind the dissemination of deceptive and fabricated content. Analyzing the characteristics and activities of these actors, we designed a taxonomy informed by collaboration with subject matter experts (SMEs) and a review of the academic literature. Our study explores how to distinguish the characteristics, activities, and strategies of malicious actors on OSN and examines how they contribute to the spread of disinformation. We describe the design process and the application of the taxonomy in a case study analyzing anti-migration discourse in social media channels, and reflect on its potential to aid researchers and practitioners in the responsible design of network systems.
Large-scale point cloud maps are essential for robotics and spatial intelligence tasks. UAVs provide an efficient means for large-scale map acquisition; however, due to limited flight endurance and onboard storage, mapping a large-scale scene within a single flight remains difficult. Existing multi-session map merging methods can extend the mapping range, yet in UAV scenarios they still struggle to simultaneously suppress long-range drift and preserve local geometric accuracy. To address this issue, an uncertainty-aware multi-session point cloud map merging and coarse-to-fine optimization system is proposed. The proposed method first performs initial multi-session map merging based on a scene graph, and then incorporates RTK observations through an RTK spatiotemporal alignment module, where temporal offsets are estimated using Dynamic Time Warping (DTW), and continuous RTK constraints are recovered using Multi-Output Gaussian Processes (MOGP) under incomplete sampling and frame dropouts. On this basis, a unified uncertainty-aware factor graph is constructed, and local geometric accuracy is further improved through iterative plane-factor refinement. Experiments on real-world datasets validate the effectiveness and robustness of the proposed method. To facilitate further research and development in the community, our code and dataset will be publicly released.
Community detection is a key task in network analysis, providing insight into the structural organization of complex systems. Effective resistance, a graph-theoretic metric derived from electrical network theory, has emerged as a powerful tool for evaluating connectivity and influence within networks. This paper proposes an effective resistance-based community detection algorithm that calculates the similarity between nodes using effective resistance values and produces a weighted graph. The sparse graph used in the algorithm is generated after computing the minimum spanning tree (MST) of the weighted graph and adopting a threshold sparsification strategy on non-MST edges. A maximum modularity approach is adopted using the Clauset-Newman-Moore algorithm on the resultant sparse graph. This algorithm is evaluated for both synthetic and real-world networks, demonstrating its effectiveness compared to popular existing methods. The result shows that the effective resistance-based approach accurately captures the structures of the community while maintaining computational efficiency.
The forest matrix of a signed graph plays an important role in network science and social opinion dynamics, yet existing algorithms are mainly designed for unsigned graphs and are difficult to extend to signed graphs. In this paper, we study the problem of efficiently estimating the forest matrix of signed graphs with n nodes and introduce the signed forest matrix theorem, which establishes the relationship between generalized spanning converging forests and the forest matrix. Based on this result, we propose a novel algorithm GSCF, built on a variant of loop-erased random walks, to generate generalized spanning converging forests in expected O(n) time. We further develop two sampling algorithms, FMDE and FMDE+, for estimating the diagonal of the forest matrix, both with time complexity O(ln), where l is the number of samples. Extensive experiments on various signed graphs show that our methods achieve high estimation accuracy, significantly improve computational efficiency, and scale to graphs with over twenty million nodes. Our source code is publicly available on https://github.com/HaoxinSun98/SignedForestDiagonal.
In this paper, we address the problem of fast computation and optimization of opinion-based quantities in the Friedkin-Johnsen (FJ) model. We first introduce the concept of partial rooted forests, based on which we present an efficient algorithm for computing relevant quantities using this method. Furthermore, we study two optimization problems in the FJ model: the Opinion Minimization Problem and the Polarization and Disagreement Minimization Problem. For both problems, we propose fast algorithms based on partial rooted forest samplings. Our methods reduce the time complexity from linear to sublinear. Extensive experiments on real-world networks demonstrate that our algorithms are both accurate and efficient, outperforming state-of-the-art methods and scaling effectively to large-scale networks.
The forest matrix of a graph, particularly its diagonal elements, has far-reaching implications in network science and machine learning. The state-of-the-art algorithms for the diagonal of forest matrix computation are based on the fast Laplacian solver. However, these algorithms encounter limitations when applied to digraphs due to the incapacity of the Laplacian solver. To overcome the issue, in this paper, we propose three novel sampling-based algorithms: SCF, SCFV, and SCFV+. Our first algorithm SCF leverages a probability interpretation of the diagonal of the forest matrix and utilizes an extension of Wilson's algorithm to sample spanning converging forests. To reduce the variance in forest sampling, we develop two novel variance-reduction techniques. The first technique, leading to the SCFV algorithm, is inspired by opinion dynamics in graphs and applies matrix-vector iteration to spanning forest sampling. While SCFV achieves reduced variance compared to SCF, the cross-product term in its variance expression can be complex and potentially large in certain graphs. Therefore, we develop another technique, leading to a new iteration equation and the SCFV+ algorithm. SCFV+ achieves further reduced variance without the cross-product term in the variance of SCFV. We prove that SCFV+ can achieve a relative error guarantee with high probability and maintain linear time complexity relative to the number of nodes in the graph, presenting a superior theoretical result compared to state-of-the-art algorithms. Finally, we conduct extensive experiments on various real-world networks, showing that our algorithms achieve better estimation accuracy and are more time-efficient than the state-of-the-art algorithms. Particularly, our algorithms are scalable to massive graphs with more than twenty million nodes in both undirected and directed graphs.
The social compass model has been recently proposed as a model for depolarization in populations where individuals have multiple, possibly correlated, opinions. Previous work has focused on the steady state of this model, but has not addressed the dynamics leading to depolarization. We show that the macroscopic dynamics of the social compass model can be described using the Ott-Antonsen Ansatz and that, for initially clustered opinions, the resulting equations reduce to a finite-dimensional system of ordinary differential equations. We study the linear stability of the polarized state and find a dispersion relation for the growth rate of perturbations from this state. We find that the critical coupling for depolarization depends only on the first inverse moment of the conviction distribution, whereas the rate of depolarization depends on higher moments. Consequently, conviction distributions with the same critical coupling can exhibit vastly different depolarization timescales. We also demonstrate how our analysis can be extended to study depolarization in the presence of community structure.
As AI agent protocols proliferate, the governance structures shaping their interoperability standards remain empirically underexamined. We introduce an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures at scale. We validate it on two contrasting standards for agent interoperability: ERC-8004 (permissionless, on-chain) and Google A2A (corporate-led). Analyzing 4,323 governance participation records, we combine LLM-assisted coding, topic modeling, and multi-layer network analysis to examine how institutional design shapes thematic priorities and community structure. We find that while governance form influences substantive focus, both regimes exhibit comparable levels of participation inequality and community fragmentation. Discourse alignment is denser in the permissionless setting, suggesting that open governance may foster greater thematic convergence despite decentralized participation. These findings illustrate how LLM-assisted methods can advance the empirical study of technology governance, with implications for designing more equitable agentic AI standards. All data and code are openly available.
Complex systems, from gene regulatory networks to neural circuits and transportation infrastructures, exhibit rich functional behaviour that topology alone does not capture. Here we show that functional memory exhibits a universal organisational regularity: in every biological, ecological, social, and technological domain studied, real interaction strengths organise memory at greater hierarchical depth than random weight assignment on the same topology, across thirty-four networks spanning several orders of magnitude in size and density. Using a thermodynamic description of multiscale information flow, we quantify how memory is distributed across path lengths and show that functional memory organisation collapses onto four recurrent dynamical organisations, revealing an intrinsically low-dimensional structure. Comparing each network against null models that selectively perturb weighted transport geometry, mesoscale structure, and directionality reveals that these ingredients contribute distinct and non-equivalent roles: weight geometry systematically governs memory depth, mesoscale structure shapes memory organisation across scales, and directionality modulates the sensitivity of the cascade to structural perturbation. The same comparison provides an operational criterion for whether network weights encode genuine functional interaction structure. These results establish weighted transport geometry as a primary organiser of functional memory and show that weighted interactions carry dynamical structure that binary topology alone cannot recover.
Opinions play a crucial role in shaping collective phenomena such as political polarization, cultural integration and demographic change. By continuously changing social environments in which opinions evolve, human migration serves as an important driver of collective opinion formation. While migration and opinion dynamics have both been extensively studied, the few existing models that couple the two are primarily deterministic and therefore cannot capture demographic fluctuations, finite-size effects or stochastic transitions between emergent collective states. To address this limitation, we introduce a unifying stochastic framework for opinion dynamics over migration networks that couples local opinion transitions, demographic processes and migration between communities. The dynamics are formulated through a spatio--temporal master equation, which provides a probabilistic description of the underlying population process. From this microscopic representation, we derive deterministic mean-field equations governing the co-evolution of community sizes and opinion compositions, thereby linking agent-level interactions to macroscopic population behavior. Using two representative case studies, we demonstrate how stochasticity and migration can qualitatively change the emergent dynamics and collective outcomes, including the emergence of consensus, polarization and the stabilization of oscillatory opinion dynamics. These examples highlight the rich interplay between social interactions, demographic change and migration in deterministic and stochastic settings, and they demonstrate that migration should be viewed as an integral component of collective opinion formation rather than only an external demographic process.
When institutions decide by consensus, the official record shows agreement but hides who shaped what was decided. We introduce a way to recover that hidden structure from the one trace consensus cannot suppress: the documentary record of what actors choose to work on. Adapting tools from economic complexity, we map a ``space of concerns'' in which issues lie close when the same actors repeatedly specialize in both -- turning a flat agenda into a measurable topology of attention. Across six decades of the Antarctic Treaty (6,591 documents, 66 actors), engagement is structured, local, and persistent, and the most specialized actors produce binding law at roughly five times the baseline rate. The approach generalizes to any document-rich consensus forum, showing that unanimity does not erase political structure -- it relocates it upstream, into the organization of attention.
Unified reevaluation of 9 models shows prior-data fitted networks win but at higher inference cost.
abstractclick to expand
Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called GFMs, particular interest has been paid to GFMs designed for node property prediction tasks, which is one of the most popular settings in Graph ML with lots of real-world applications from fraud detection in financial and social networks to recommendation systems for e-commerce and user-generated content platforms. While a number of GFMs for this task have been recently proposed, the field has not converged to a unified evaluation setting, and different works evaluate their models in widely different ways, preventing reliable comparison of GFMs with each other and with other types of models. In this work, we conduct a fair and rigorous reevaluation of 9 recent GFMs for node property prediction, comparing them to strong Graph Neural Network (GNN) baselines. We find that, among these GFMs, only the most recent ones based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs in predictive performance, although at a higher inference cost.
Systematic tests show structure controls success at reaching low-energy states in Hamiltonian problems.
abstractclick to expand
Instantaneous Quantum Polynomial-time (IQP) circuits are promising candidates for near-term quantum advantage due to the conjectured classical hardness of their sampling task. However, their capabilities for optimization remain largely unexplored. We present a systematic investigation of the performance and trainability of IQP circuits for Hamiltonian optimization. Our results reveal a trade-off between optimization performance and circuit connectivity, demonstrating that the circuit structure plays a key role in determining the ability of IQP circuits to reach low-energy states.
We study community detection in the two-block stochastic block model under the setting where multiple independent graph samples drawn from the same distribution are available. Building on a recently simplified spectral algorithm that preserves the independence of adjacency matrix entries throughout, we show that averaging $m$ independent samples before applying spectral partitioning reduces the error bound $\gamma$ exponentially in $m$: specifically, one can find a $\gamma$-correct partition with probability $1 - o(1)$ whenever $\frac{(a-b)^2}{a+b} \geq \frac{C}{m} \log \frac{2}{\gamma}$, improving the single-sample requirement by a factor of $m$. The key technical contribution is a multi-sample analogue of the spectral norm bound on the noise matrix, which propagates through the Davis-Kahan subspace angle analysis to yield the improved recovery guarantee. We provide experimental validation across a range of graph sizes ($n$ up to $1000$) and sample counts ($m$ up to $9$), demonstrating that the derived bounds are sharp and that even two or three samples yield dramatic improvements in recovery accuracy. Our results offer a rigorous theoretical foundation for graph data augmentation strategies used in modern graph representation learning.
Node centrality is a fundamental problem in network analysis, yet classical metrics fail to capture the collective, coalitional nature of influence. We present a systematic empirical evaluation of the Shapley-value-based framework for the sphere of influence problem -- selecting $m$ nodes to maximize network coverage under three reachability criteria: single-hop, $k$-hop, and multi-path connectivity -- using exact polynomial-time algorithms due to Michalak et al. Evaluation across three diverse real-world networks (Euroroad, Facebook TV Shows, and Cora) demonstrates that practical approximation ratios consistently approach 0.9, substantially exceeding the theoretical $(1-1/e)$ lower bound, and that the Shapley-based approach dramatically outperforms a degree-based baseline, particularly in hub-and-spoke topologies. In the most striking case, Shapley-based selection identifies just 26 nodes (under 1\% of the Cora network) sufficient to influence half the graph under 3-hop reachability, compared to substantially larger sets required by the naive baseline.
As large language models (LLMs) are increasingly used in media production from journalistm to filmmaking, what impact do they have on the stories being told? Prior work has shown LLMs to perpetuate social biases, including those related to gender. We complement existing literature on gender bias in LLM outputs by auditing the network structure of LLM-generated movie screenplays through automating the Bechdel test, a popular measure of women's representation in literary and film works. We also introduce the use of social network analysis measures to further analyze representational bias in LLM-generated scripts. We evaluate screenplays generated by three state-of-the-art LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) against 768 corresponding human-written screenplays, finding that human-written scripts are more likely to pass the Bechdel test. However, other network analyses, like centrality, homophily, and triadic relationships demonstrate that in some cases LLM-scripts have less bias, although all script types demonstrate some representational bias under most measures. We conclude by discussing the continued need for further quantitative assessments of media representations and AI-generated content.
Digital twins are too often described as realistic simulations, anatomical avatars, dashboards, or data mirrors. Those artifacts can be useful, but they miss the defining property of a digital twin: bidirectional feedback between a physical counterpart and a virtual counterpart. The physical system continuously updates the virtual one; the virtual system informs actions that change measurement, intervention, operation, or governance in the physical world. We propose such a bidirectional feedback as the organizing principle for digital twins and apply it to a nested, multi-scale hierarchy of biological and social organization, in which lower-level units combine into higher-level systems, producing desirable properties at each level, from cells and tissues to organs, individuals, organizations, and population at large. Neuroinformatics is a stress test for this view because brain health, dementia, epilepsy, and other neurological diseases require the integration of cells, circuits, behavior, care pathways, and the translation of discovery to practice. Examples from epilepsy care and consortium-scale brain-cell atlas production show that digital twinning is not merely multi-scale modeling. It is a rich, multidisciplinary paradigm of computing for designing, governing, and driving feedback loops that turn data into accountable action.
Data analysis finds activity at network distance two boosts liking probability even with no active direct neighbors.
abstractclick to expand
The present study investigates direct and indirect social contagion mechanisms in an online social network environment. Using a large-scale dataset comprising approximately 290,000 users from the VKontakte platform, we examine the factors associated with the probability that a user likes a post. Our analysis shows that, while demographic and structural characteristics of individual nodes, such as gender and degree, contribute to the observed dynamics, the strongest associations arise from activity in the user's local network. In particular, active nodes (users who have already liked the post) at distances d = 1 and d = 2 play a central role in shaping liking behavior. We find a substantial association between second-order activity and liking probability, which persists even in the absence of active direct neighbors and is consistent with indirect influence pathways in the network. No significant association is detected for nodes at distance three or beyond. The results also support the structural diversity hypothesis: the number of connected components among active friends is a significant predictor of liking.
Spectral community detection estimates latent labels from the leading eigenspace of a network adjacency matrix, but releasing the resulting labels can disclose sensitive relational information. We consider this problem under differential privacy for both ordinary and bipartite networks. For ordinary networks, the protected unit is a single edge, leading to edge differential privacy (edge-DP). For bipartite networks, the inferential target is the community structure of the left-side nodes, while the protected unit is an entire right-side incidence profile, leading to column-node-DP. We propose NetPTR, a private spectral clustering procedure that releases a noisy empirical spectral embedding after a stability test. The algorithm requires perturbation bounds for empirical eigenspaces under neighboring-network changes, which yield computable stability certificates and local sensitivity bounds. For ordinary networks, we establish edge-DP and the error bound under the degree-corrected stochastic blockmodel, which separates the non-private spectral clustering error from the additional privacy-induced error. It therefore guarantees weak consistency in sparse networks and exact recovery in moderate sparse networks. A matching lower bound shows that the required privacy budget is sharp up to logarithmic factors. We further develop a column-node-DP algorithm for bipartite networks and prove consistency under a bipartite degree-corrected block model. Simulations and real-data examples illustrate the resulting privacy--accuracy tradeoff.
Coordinated foreign influence operations pose a growing threat to online platforms, but detecting state-linked troll activity and tracking its evolution remain challenging. This paper presents an explainable machine learning framework for theory-guided detection and longitudinal analysis of suspected trolling within Korean online news comment sections. Our hierarchical model classifies comments along three dimensions central to influence campaigns: foreign origin, moral-emotional framing, and target country. To support explainability, it also extracts brief span-level textual evidence that provides human-interpretable rationales. We apply the approach to 112M South Korean news comments authored by 4M users over nearly 20 years, identifying 23,998 accounts exhibiting behavior consistent with coordinated manipulation. Analyzing these accounts, we find that they predominantly rely on morally condemning rhetoric rather than direct promotion of foreign-aligned narratives; this rhetoric receives significantly higher user engagement. Among the highest-engagement comments, the moral condemnation most frequently targets domestic political figures (e.g., presidents or party leaders) on both the left and the right, potentially amplifying polarization. Our framework supports transparent platform governance through explainable, evidence-based moderation. These observed rhetorical and engagement patterns can inform how platforms and observatories prioritize defenses and intervene before harmful narrative-target combinations achieve widespread reach.
Feedback volume, flow alignment, locality, and complexity combine into D(G) without collinearity or score dilution.
abstractclick to expand
Directed acyclic graphs (DAGs) are fundamental to the study of causal structures, hierarchical systems, and information flow. While directedness and acyclicity are defined as binary properties, real-world networks often exhibit continuous degrees of "DAG-ness" due to structural noise, back-edges, or localized feedback loops. Our previous attempt to quantify DAG-ness as a continuous measure suffered from topological redundancy, where overlapping cyclic penalties artificially deflated scores for networks with minor feedback. In this paper, we resolve these limitations by introducing a strictly orthogonal, 4-dimensional continuous DAG-ness framework. By independently measuring the volume of feedback $A(G)$, the alignment of flow $F(G)$, the macroscopic locality of feedback $M(G)$, and dynamical pathway complexity $S(G)$, the proposed measure eliminates collinearity and the "Dilution Trap." Empirical evaluation on synthetic diagnostic graphs demonstrates enhanced mathematical stability, while deterministic application to classical number-theoretic systems (the Kaprekar and Collatz graphs) confirms the framework's ability to rigorously isolate topological flow from dynamical entrapment. The resulting composite score $D(G)$ provides a highly scalable, interpretable, and mathematically sound metric for structural network analysis.
LLM "agent societies" are studied via demonstrations of emergent consensus or polarization -- with no measurable control parameter, no theory of when each regime appears, and no test of whether an outcome is a genuine social dynamic or a model artifact. We introduce the coupling gain gamma, measured per-agent by counterfactually perturbing a neighbour's stated opinion. (i) gamma is stable and model-distinguishing -- across five frontier models it spans 0.15-0.43 (n=20, 95% CIs <= 0.025), paraphrase-invariant; social-neighbour gamma roughly equals numeric-anchor gamma, so gamma is evidence-coupling, not uniquely social. (ii) Classical dynamics with measured (not assumed) coefficients organise the regime: Friedkin-Johnsen for consensus/pluralism, signed-Laplacian/structural-balance for polarization. (iii) Frontier LLMs do not spontaneously backfire (beta <= 0), so default societies do not self-polarize -- polarization is always induced; the beta>0 branch arises only in the FJ surrogate, never in the agents. (iv) A randomized-initial-condition diagnostic -- the (slope, bias) of final vs. initial opinion -- separates genuine averaging from model-prior artifacts (boundary-censoring ruled out by construction via interior-valued facts); applied to a published "emergent consensus" result (Chuang et al. 2023) it reveals a model-specific conflation: averaging on debatable claims, prior-artifact on settled facts. (v) Coupling is context-dependent: pairwise gamma does not predict multi-neighbour outcomes -- it can order them backwards -- whereas a modality-matched group coupling does (sixteen closed+open models, Pearson r=-0.70, permutation p=0.008). The regime laws take this matched coupling, not the single-neighbour gamma: emergent consensus must be read from coupling in the target interaction. We contribute a measurement protocol and a validity instrument, not new theory.
Multicultural Singapore hosts overlapping language publics (English, Chinese, and Malay) that discuss the same out-groups in parallel, a natural setting to ask whether online hate shares a structure across languages and whether what a community $\textit{produces}$ is what it $\textit{amplifies}$. From a Singapore-centric 2025 Facebook, Reddit, and YouTube corpus (31.0M items; 1.76M comments mentioning eleven identity groups), we benchmark eight open large language models as hate annotators against a human-adjudicated gold set, adopt the best (Phi-4: accuracy 0.95, Cohen's $\kappa$=0.91, recall 1.00 on an independent manual check), and replicate every finding under a second model. The results converge on one thesis, $\textit{layered cultural contingency}$: cross-lingual divergence falls monotonically as one moves from what a community hates to how and why it hates. Which out-groups are targeted is culturally specific (language $\times$ target $V$=0.25), but the threat frames and the binding moral grammar of hate (sanctity and loyalty, $55-75\%$, not fairness) are far more shared across languages, with divergence dropping to $V$=0.08 for moral foundations and 0.07 for emotion. Hate is contempt-driven and voices an out-group, anti-immigration grievance rather than an anti-system one. Reception is selectively nativist: hateful comments are amplified less than neutral mentions overall, yet anti-immigrant hate is preferentially amplified while religious and anti-LGBTQ hate is not, and volume does not track 2025 Singapore key events. We further show that absolute hate prevalence is not well defined at the LLM-annotator level, with agreement ceilings at $\kappa\approx0.42$ across models, so we report relative structure as primary. The findings bear directly on cross-lingual content moderation.
Epidemiological models often rely on survey data to represent how individuals make health-related decisions, such as whether to vaccinate or adopt protective behaviors. However, repeated large-scale surveys are costly, time-consuming, and limited in the range of scenarios they can capture. In this work, we investigate whether large language models (LLMs) can generate synthetic survey responses that reproduce patterns observed in real populations. Using longitudinal data from the FluPaths surveys, we first identify groups associated with broadly positive or negative attitudes toward vaccination through clustering analysis. We then evaluate several LLMs using a cluster-informed prompting approach to generate synthetic survey responses across multiple epidemic waves. Across models, the synthetic data generally reproduce the distributions of demographic characteristics, vaccination-related beliefs, risk perceptions, and health behaviors observed in the survey data. However, they are less successful at capturing how these factors vary together within respondents. Some models reproduce group-level vaccination trends more reliably than others, although performance varies across waves. We also trained a classifier to distinguish real from synthetic records and found that the generated responses remained identifiable as synthetic. Overall, our findings suggest that LLM-generated survey data may provide a useful tool for exploratory data augmentation and we hope that it could support agent-based epidemic modeling approaches. However, the generated data should not be treated as a substitute for human survey data without further methodological improvements and validation.
Online participation is often measured through visible expression, especially posting, yet many consequential forms of engagement occur through less vocal actions such as liking and following. Here we study how users inhabit Bluesky by reconstructing participation profiles from more than three billion activity records produced by a near-complete sample accounting for more than 80\% of registered users. We aggregate behavior into monthly user-level observations and distinguish two dimensions that are often conflated in platform analytics: intensity, capturing how much users engage, and style, capturing how engagement is expressed across actions. We find that vocal production is highly concentrated, but low-posting behavior does not imply absence from platform participation. High-intensity engagement is most strongly associated with liking rather than posting, while posting-oriented participation is more common among low-intensity users, indicating that visibility and sustained engagement should not be conflated. Transition patterns suggest that high-intensity likers and posters could be described as attractors; network-building redirects users within the active space; whereas observed inactivity acts as a persistent boundary that selectively limits re-entry. Higher-order motifs further show that inactivity often interrupts rather than erases prior regimes, and that low-intensity liking can precede durable high-intensity engagement. These results show that online participation is structured by differentiated low-vocality practices, calling for a shift from post-centered measures of activity toward dynamic accounts of platform presence. We identify a broader challenge for computational social science: platform participation cannot be adequately understood through the behavior of vocal minorities alone.
Graph convolutional networks (GCNs) have demonstrated significant success in capturing complex user-item relationships for collaborative filtering (CF). However, due to their reliance on extensive model training, training-free graph filtering (GF)-based CF methods have emerged as a promising alternative, offering computational efficiency by smoothing graph signals via matrix operations. In particular, polynomial GF-based approaches demonstrate improved accuracy through their ability to design more expressive and flexible filtering functions. Despite these advantages, existing GF methods suffer from a critical memory bottleneck: they necessitate storing the full item similarity graph, incurring prohibitive memory costs for large-scale datasets, which limits their practical applicability. To tackle this challenge, we propose Mem-GF (Memory-efficient GF), a new GF-based CF method that departs from conventional designs by principally leveraging the structure of Krylov subspaces as a core mechanism for approximating polynomial graph filters without explicitly storing the item similarity graph. We theoretically analyze the minimum Krylov subspace size that guarantees lossless approximation. Through extensive experiments, we demonstrate that Mem-GF achieves up to 5.74$\times$ lower memory usage and 4.38$\times$ speedup in runtime, while consistently exceeding the recommendation accuracy of state-of-the-art GF and GCN-based methods. Mem-GF robustly scales to datasets with tens of millions of interactions, establishing itself as a practically viable and theoretically grounded solution for efficient CF.
Visualizing knowledge structures as graphs is common, but making them intuitively understandable remains challenging. Existing methods, such as macroscopic statistical metrics and whole-graph visualizations, often fail to capture local differences in conceptual relationships and suffer from severe visual clutter as networks grow large. To address these limitations, we propose a comparative visualization method that combines Edge-Difference Graphs with network flow analysis. The method first constructs Edge-Difference Graphs by extracting edges unique to each graph from graphs sharing a common node set, reducing redundancy while preserving the overall graph structure. It then identifies diverse paths between specific nodes by solving a minimum-cost maximum-flow problem. By incorporating a cost based on Adamic-Adar similarity, it penalizes routes that pass through generic hub concepts, enabling the extraction of contextually specific paths. We applied the method to networks of 20th-century French philosophers constructed from the French, German, English and Japanese editions of Wikipedia. The results reveal distinctive relational paths that reflect how each linguistic community receives and contextualizes these philosophers. This study provides a framework for the comparative analysis of large-scale knowledge structures and deepens our understanding of cultural and structural differences.
Prompted by previous research on strategies for reducing interpersonal conflict and addressing problematic behaviors in online communities, a randomized controlled trial on Reddit compared various responses for reducing the rate of personal insults users post to the site. We generated replies from five deescalation strategies and used an automated procedure for posting them as replies to insulting comments. The findings reveal that automated replies to insults can effectively reduce their rate. Appreciation performed best. Not all strategies performed well, though. We conclude that automated responses are a viable tool for addressing some problematic behaviors. We discuss their potential utility and limitations.
Networks are shaped by competing structural mechanisms, such as communities, geometry, or hubs. In a dynamic network the most predictive mechanism can change, and a model tied to one mechanism, or to fixed weights, cannot adapt as the dominant structure shifts. We develop dynamic Bayesian predictive synthesis for networks, in which a mechanism is an agent forecasting the next snapshot's edges and a synthesis layer combines them with time-varying weights. At each step the method returns a calibrated edge forecast and inference on the mechanism weights, with intervals valid given the fitted agents, so it also reports which mechanism is most informative. Inference of this kind requires a sparse-safe parametrization and an identification theory, under which a single graph identifies and estimates the weights. A sharp threshold separates distinguishable from indistinguishable mechanisms, a change in the active mechanism is tracked at an optimal per-switch cost, and for a single snapshot the method reduces to calibrated link prediction. On real networks, simulations, and benchmarks, the synthesis gives accurate, calibrated forecasts and recovers the leading mechanism when
Collective artificial intelligence, where multiple agents work on shared tasks, holds potential to solve expansive problems in fields from medicine to collective governance. But while prescriptive engineering solutions abound, we lack descriptive scientific understanding of artificial collectives, and therefore principles for how to design resource efficient multi-agent systems. Through systematic experiments with optimizing agents, we characterize how agent interpretive abilities, rationality bounds, and task qualities interact to shape collective performance. Agents range from specialists, with narrow interpretive abilities, to generalists, with broad ones. Collectives of specialists correspond to sparse, centralized networks, while collectives of generalists correspond to dense, decentralized ones. We show that interpretive network properties have small performance effects on average (0.07 standard deviations of performance). However, for specific task qualities, these effects are 4.5 times larger (0.33 sd) and can reach much higher for certain task qualities (1.84 sd). This leads collectives of generalists to perform better on tasks that involve generating, choosing, and coordinating, while collectives of specialists with a few generalist mediators perform better on tasks that involve negotiating. Rationality bounds then moderate these relationships. At loose bounds, specialists outperform generalists through more effective sampling of high-dimensional decision spaces. At tight bounds, generalists outperform specialists through better gradient estimation. A fundamental trade-off between performance and convergence speed emerges at moderate bounds. These findings suggest that multi-agent design could benefit from matching interpretive networks to both task demands and agents' computational limits, with implications for the efficiency and energy costs of multi-agent systems.
Adverse social interactions (ASIs) can shape how online communities evolve over the time. However, structural-based ASIs and content-based ASIs are often studied separately and at a single analytical scale. In this study, we propose a multi-level framework to examine how adverse social interactions appear locally, spread through neighborhoods, and disrupt cohesive subgroups. Using large-scale datasets from X and Bluesky, we analyze friend and foe patterns at the micro level, peer influence through matched triadic designs at the meso level, and subgroup disruption against random and recommendation-based references at the macro level. Our results show that structural disconnection and toxic communication provide complementary signals: structural negativity more persistently marks subgroup disruption, while toxic communication captures broader conflict both within and across communities. These findings suggest that adverse social interactions are multi-scale processes that influence how online communities form, fracture, and evolve. Our source code is publicly available at https://github.com/XueqiC/Adverse-Social-Interactions.
Smallholder maize farmers in Uganda continue to face limited market access, weak bargaining power, low price transparency, and heavy reliance on intermediaries. These challenges are compounded by poor produce coordination, delayed payments, and weak visibility into cooperative transactions. This paper presents Farmer Connect, a cooperative-based digital platform designed to support produce management, marketplace coordination, and transparent earnings tracking among farmer groups. The system supports four user roles: administrators, supervisors, farmers, and customers. Its core functions include farmer group management, contribution recording and verification, marketplace listing, order processing, First In First Out based produce allocation, earnings visibility, mobile money payment support, and notification services. The platform was implemented using a mobile-first architecture with cloud-based backend services and an administrative web dashboard. Functional implementation showed that the system was able to support the major workflows required for group-based maize marketing and cooperative coordination, with approximately 85% of identified user requirements implemented. The study shows that cooperative-centered digital platforms can provide a practical framework for improving transparency, coordination, and buyer access for smallholder farmers.
Human collective participation is rarely steady in time: it is bursty, with short episodes of intense activity separated by long quiet intervals. In crisis response and community mobilization, predicting when people act matters as much as predicting whether they act. Such settings are increasingly modeled with LLM-based social simulators, yet these simulators are validated on whether each action is individually plausible, not on whether actions are timed as in reality. Their temporal realism, the degree to which simulated activity reproduces the bursty, heavy-tailed timing of real human systems, thus remains untested. We examine this gap using a multi-year, city-scale log of offline volunteering in Shenzhen that spans the COVID-19 pandemic. Empirically, we establish that bursty timing is common at individual and tracked-group levels, that it is largely endogenous and self-exciting, and that it is amplified by the pandemic rather than produced by daily activity cycles. A standard LLM-only simulator reproduces almost none of this timing: its synchronous schedule has no self-excitation channel, so agents act on a near-regular clock. Guided by these findings, we build a simulator in which a data-calibrated self-excitation channel and a crisis-period regime decide when each agent acts and query the LLM only at those moments, leaving it to decide which task to join and whether to commit. The LLM-only baseline yields no bursty agents (median burstiness $B=-0.14$); a single data-calibrated gate is then sufficient to lift per-agent timing above the burst threshold (median $B\approx0.37$) without degrading LLM content decisions. These results indicate that temporal realism in LLM-based crisis-response simulation is best achieved by decoupling when agents act, governed by an explicit self-excitation and crisis-activation mechanism, from what they do, governed by the LLM.
Code evolution yields methods with higher accuracy and efficiency than standard designs when tested across 580 networks from varied fields.
abstractclick to expand
The problem of predicting links in complex networks appears in different disciplines and has led to a variety of ingenious human-designed methods. We use this rich program space to explore the performance and behavior of automated code-evolution systems tasked to obtain machine-designed methods for link prediction. Despite being trained on limited data, algorithms evolved through code evolution outperform human-designed methods (with an average AUC score of 0.915 vs. 0.783, computed over 580 networks) and show improved computational efficiency, allowing them to be applied to networks with millions of links. The discovered methods follow approaches that have been employed in human-designed methods, but contain key innovations in the selection and combination of node- and link-features. This illustrates the role modern large language models and genetic algorithms can play in algorithmic innovation and scientific discovery more generally.
Many real-world networks are incomplete, making link prediction a fundamental challenge in network science. To train parameters and evaluate algorithms, observed links are usually divided into three subsets, namely training, validation, and probe sets. This division implicitly involves two sampling processes: first-stage sampling yields the probe set and second-stage sampling obtains the variation set. To date, our understanding of how these two sampling processes affect algorithm performance remains quite limited. To address this issue, we propose a sampling scheme called $\beta$-sampling, where the sampling probability of a link is proportional to the product of the degrees of its two endpoints raised to the power of $\beta$. Experiments on 45 real-world networks reveal that the structural characteristics of missing links, as simulated via varying probe sets, substantially impact prediction accuracy. When missing links tend to connect high-degree nodes, such links can be predicted accurately with ease. Furthermore, even with a fixed probe set, second-stage sampling still exerts a significant influence on prediction accuracy. Notably, the optimal second-stage sampling strategy differs from \textit{random sampling} (which randomly selects links to form the validation set) and \textit{consistent sampling} (which guarantees that links in the validation and probe sets share identical structural characteristics).
As individuals turn to the Internet to find answers to questions they may have, several Question Answering (QA) forums have evolved, where users knowledgeable in certain topics can contribute their expertise to answering these requests for information. While these are currently volunteer based, we consider a future version employing knowledge workers who are experts in certain topics. In such a system, the request-answer processes forming the queuing system may utilize schedulers that assign requests in different topics to the experts in the forum, who may be able to answer them according to their expertise levels in different topics. With this model, we calculate the capacity of the system for handling the requests while keeping the system stable, and design schedulers that achieve capacity. We also investigate how collaboration between experts in answering requests can potentially increase capacity.
We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as "50k to 8M", as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.
From ancient Mesopotamia to modern cities, dense human settlements coincide with bursts of economic productivity, cultural innovation, and social change. But how does packing people more tightly together alter social organization in ways that reshape collective outcomes? Here, I use a minimal agent-based model to isolate the effect of population density, holding population size and individual behavior fixed while varying only how closely individuals are placed in space. In the model, individuals form social ties gradually, favoring those nearby and those already well-connected. Under these simple rules, varying population density alone is sufficient to reorganize social network structure: sparse populations develop locally clustered communities, while denser ones form globally integrated networks with shorter social distances and a tightly interconnected core of popular individuals. This structural transition occurs sharply over a narrow range of densities and is governed by whether physical proximity or social popularity dominates tie formation. Simulating contagions on these networks reveals that the consequences of this shift depend on what is spreading. Simple contagions (e.g., information or disease) reach a majority of individuals more quickly in denser populations. Complex contagions (e.g., social norms or collective behaviors) do not spread faster, but instead achieve broader and more reliable adoption as density increases. Together, these results show that population density can act as a structural force independent of the economic and behavioral mechanisms typically invoked to explain why cities are engines of change.
Grassroots platforms keep ownership with users through cryptographic signatures instead of corporate servers.
abstractclick to expand
Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation.
In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a digital speech act -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (i) digital speech acts qualify for copyright protection under existing U.S. precedent: Burrow-Giles locates authorship in volitional creative choices despite mechanical or algorithmic processes, Feist supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (ii) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that copyright ownership and physical possession of authenticated digital expressions coalesce in the person; and (iii) this coalescence of legal ownership and physical possession provides the foundations for digital sovereignty and democratic self-governance.
How vulnerable are online social networks to adversaries who seek to amplify opinion polarization by manipulating opinions, and how difficult is it to mitigate such manipulation? Existing studies have examined this question using mathematical models of opinion dynamics. While these models offer valuable theoretical insights, they rely on simplified assumptions about interactions, message content, and opinion updates, limiting the adversarial strategies they can capture and the applicability of their findings to real-world settings. Large language model (LLM)-based simulations provide a richer alternative: agents can be assigned diverse personas, communicate through natural language, and respond to persuasive or adversarial content in a context-dependent way. This enables the study of manipulation strategies that are difficult to represent using classical mathematical models. To the best of our knowledge, this study provides the first systematic analysis of polarization amplification and mitigation in an LLM-based simulated social network framework.
In our framework, LLM agents with diverse personas interact over a social network by exchanging natural language posts and updating their opinions accordingly. We show that even an adversary with a limited manipulation budget can considerably increase polarization. We then study two classes of defense mechanisms: reactive mitigations, which assign specific users to actively counter manipulation, and proactive interventions, which increase resistance through general mechanisms not tied to particular users. Our results show that although these mechanisms reduce the impact of adversarial attacks, they generally do not restore the network to its baseline polarization state. These findings suggest that neither approach fully overcomes the vulnerability of the network, highlighting the potential risk of such attacks.
30-year analysis of 166 countries finds distance increases reliance on performance signals for partner choice.
abstractclick to expand
Researchers have long suspected that international research collaboration (IRC) and scientific and technological (S\&T) performance are subject to reciprocal causality, yet the endogenous co-evolution of these twin phenomena has yet to be tested by large-scale empirical analysis. This study tests IRC network effects on national research performance and vice versa simultaneously using a longitudinal co-evolution model on three decades of global network and national performance data. Stochastic actor oriented models (SAOM) are used to analyze data on 166 countries from 1993 to 2022. Yearly IRC networks are constructed from Web of Science's XML database, and performance data are gathered from Elsevier's fractional field-weighted citation index (FWCI). The models also account for geographic, economic, demographic, and political factors, as well as endogenous network processes. The results provide support for reciprocal co-evolution. However, notably, geographic distance appears to moderate the interaction between research performance and network dynamics, suggesting researchers may rely more on visible performance metrics when selecting geographically distant collaborators. This finding points to the role of citation based performance metrics as a signaling mechanism for collaborator selection.
The rapid growth of the genealogical sector, spanning platforms with billions of records and millions of users, has produced some of the largest and most complex networks available for analysis. Despite substantial advances in genealogical network research, it remains unclear whether human kinship networks exhibit universal structural properties. We address this by developing an integrated approach to genealogical network analysis that combines network-theoretic structure with an inferred notion of time. Using over one hundred datasets from the Kinsources repository, we reinterpret standard network measures in genealogical terms and introduce \emph{pseudogenerations}, a method for extracting temporal structure directly from network topology.
Within this framework, we identify common features shared across datasets. We find that genealogical networks exhibit scale-free--like degree and component-size distributions, multiscale family organization, and small-world behavior with respect to genetic and union-based distances. We show that 2-components provide a natural unit of genealogical structure, observe consistent disassortative mixing, and find that recorded unions are strongly biased toward short genetic distances relative to potential pairings. We also document temporal and demographic patterns, including shifts in recorded parental and child information, as well as correlations among recorded unions, parents, and children. These results suggest that diverse genealogical datasets share a common set of structural and temporal characteristics, providing evidence for universal features of human kinship networks and establishing a general framework for their comparative analysis.
Analysis of Alzheimer's support exchanges shows LLMs use less first-person past language than humans, creating a narrative authenticity gap.
abstractclick to expand
Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.
Social media posts often include misinformative or misleading content, diminishing the expected credibility of content feeds. We present an optimization-based method to improve the credibility of news content on social media feeds by refining existing content rankings. This method is based on a dual-objective optimization approach that minimizes the Spearman's footrule distance to the original ranking to maintain the original content order while incorporating an additional linear cost objective to elevate the expected credibility of the content feed. Additionally, we propose a robust semi-automated pipeline for assigning credibility scores to content based on a mixture of retrieval-augmented score assignments and human-generated fact-checks. This semi-automated pipeline helps ground the credibility assignment using human-generated labels while ensuring the algorithm extends to posts with few or no human-generated labels. We showcase our approach through an experimental setup using real-world data collected over X (Twitter), where we assign the credibility scores based on a mixture of user-generated community notes and retrieval augmented generation. The method we present leads to at most 7% deviation in both optimization objectives from the Pareto optimal front with known initial ranking values. Additionally, the algorithm allows for incorporating different measures for source credibility, making it applicable across various social media platforms.
The total biharmonic distance, which is the sum of the biharmonic distance between every pair of nodes in a network, is a key metric for evaluating network connectivity and robustness. In this paper, we study the problem of minimizing the total biharmonic distance by adding $k$ nonexistent edges for a given graph $G$ and budget $k$. The problem is computationally challenging. We show that the objective function of the problem is monotone but not supermodular. To solve this problem, we propose simple greedy algorithms with cubic time complexity. To mitigate the high time complexity of these greedy algorithms, we apply several techniques, including the projection method, the Laplacian solver, and convex hull approximation. These techniques reduce the time complexity of our proposed algorithms from cubic to nearly linear while providing error guarantees. Finally, extensive experiments on real datasets demonstrate both the efficiency and effectiveness of our proposed algorithms.
Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.
TT-SR matches or beats baselines on synthetic benchmarks by typing edges from separate sender and receiver roles on each node.
abstractclick to expand
Directed community detection is challenging because edge directions encode asymmetric source-target relations. Most directed modularity and random-walk methods assign one label to each vertex, whereas recent bimodularity-based methods cluster directed edges more freely. We propose TT-SR, a Two-Tier Sender-Receiver framework that lies between these two viewpoints. Each vertex is assigned a sender role and a receiver role, and each directed edge receives the type induced by the sender role of its source and the receiver role of its target. Thus, TT-SR is more expressive than one-label vertex clustering while remaining more interpretable than unrestricted edge clustering. The method generates candidate sender-receiver assignments from count-residual, stationary-flow, degree-corrected, and order-score views. The candidates are refined by local role updates and selected by a two-tier rule: a degree-corrected profile score provides the primary structural criterion, while Bernoulli density and order-flow scores are used only as secondary ranking signals. We justify the main spectral views through sender-receiver modularity relaxations and interpret the degree-corrected score as a likelihood-based residual comparison. Experiments on pathway-type, co-block, and ordered-flow synthetic benchmarks show that TT-SR achieves the strongest or essentially tied strongest edge-community recovery across three scale settings. The gains are most pronounced on degree-corrected co-block and ordered-flow graphs. Real-network diagnostics further indicate that TT-SR aligns well with Email-Eu-core metadata and extracts strong sender-receiverbicommunity summaries on unlabeled directed networks.