pith. sign in

arxiv: 2606.31619 · v1 · pith:PBCNWQQ6new · submitted 2026-06-30 · 💻 cs.CR · math.DS

Hybrid Topological Data Analysis and LSTM Networks for Enhanced Network Intrusion Detection Using CIC-IDS2017 Dataset

Pith reviewed 2026-07-01 04:44 UTC · model grok-4.3

classification 💻 cs.CR math.DS
keywords network intrusion detectiontopological data analysisLSTMCIC-IDS2017persistent homologyanomaly detectionhybrid modelcybersecurity
0
0 comments X

The pith

A hybrid TDA and LSTM model reaches AUC of 1.000 and F1 of 1.000 for network intrusion detection on CIC-IDS2017.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a hybrid method that pairs Topological Data Analysis with LSTM networks to detect intrusions in network traffic. Persistent homology supplies topological summaries of traffic patterns while the LSTM component models their time evolution. Evaluation on the CIC-IDS2017 collection of 2.8 million flows across 14 attack categories yields an AUC of 1.000 and F1-score of 1.000, with five-fold cross-validation confirming the same mean AUC and near-perfect mean F1. Ablation tests separate the contributions of the topological and temporal parts, and comparisons show gains over TDA plus random forest and over isolation forest baselines.

Core claim

The central claim is that feeding Betti curves and persistence diagrams from persistent homology into LSTM layers produces a classifier that attains an AUC of 1.000 and an F1-score of 1.000 on the CIC-IDS2017 dataset, distinguishing normal traffic from DDoS, brute-force, web, penetration, and botnet attacks while remaining stable under five-fold cross-validation.

What carries the argument

The hybrid TDA+LSTM pipeline in which persistent homology extracts topological invariants that are then processed by LSTM layers to capture sequential dependencies in network flows.

If this is right

  • The model outperforms TDA combined with random forest (F1=0.994) and isolation forest (F1=0.835) across multiple attack categories.
  • Topological features alone achieve F1=0.990 while temporal features alone achieve F1=1.000, indicating the two sources are complementary.
  • Five-fold cross-validation produces a mean AUC of 1.000 with zero standard deviation and a mean F1 of 0.999 with standard deviation 0.001.
  • The combination of Betti curves, persistence diagrams, and LSTM layers improves feature extraction for modern threat categories in the dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reported scores generalize, the same pipeline could be tested on streaming network data to measure latency added by the topological computation step.
  • The approach might extend to other sequential anomaly tasks such as fraud detection in transaction logs where both shape and order matter.
  • Replacing the LSTM component with a transformer or testing alternative persistence computations could isolate which part drives the perfect scores.
  • Repeating the ablation on additional public intrusion datasets would clarify whether the complementarity of TDA and LSTM is dataset-specific.

Load-bearing premise

The CIC-IDS2017 dataset splits and feature extraction steps contain no leakage between training and test sets and the topological features plus LSTM do not overfit to the specific attack patterns present in this collection.

What would settle it

Evaluating the same trained model on a fresh network intrusion dataset such as UNSW-NB15 or on CIC-IDS2017 with deliberately shuffled train-test partitions that eliminate any possible leakage would show whether the AUC remains at 1.000.

Figures

Figures reproduced from arXiv: 2606.31619 by Amar Jeet, Bhaskar Ranjan Karn, Dinesh Kumar.

Figure 1
Figure 1. Figure 1: Hybrid TDA+LSTM Architecture for Network Intrusion Detection [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion Matrix for TDA+LSTM Hybrid Model [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training and Validation Loss Curves for TDA+LSTM Hybrid Model [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The TDA+LSTM hybrid model effectively distinguishes between benign and anomalous traffic on the CIC-IDS2017 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Topological signatures of different attack types on CIC-IDS2017: [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: TDA and LSTM branch contribution during training, showing [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Betti Curves for Normal and Attack Traffic Patterns [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: TDA Feature Extraction Pipeline showing the complete workflow [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LSTM Network Architecture with detailed layer specifications and [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Error analysis showing distribution of false positives and false [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
read the original abstract

Network intrusion detection systems (NIDS) are crucial in cybersecurity infrastructure, needing advanced techniques to detect hostile activity in network traffic. This research introduces a hybrid approach that combines Topological Data Analysis (TDA) with Long Short-Term Memory (LSTM) networks to improve anomaly detection in network security. Our multi-layered design combines TDA's persistent homology with LSTM networks to capture topological characteristics of network traffic patterns and simulate temporal sequences. We assessed our methodology using the CIC-IDS2017 dataset, which includes over 2.8 million labelled flows, 77 network variables, and 14 attack categories that reflect modern threat landscapes such as DDoS, brute force, web attacks, penetration, and botnet activities. Integrating Betti curves and persistence diagrams with deep learning architectures enhances feature extraction performance. Our hybrid TDA+LSTM model has an AUC of 1.000 and F1-score of 1.000, with 5-fold cross-validation producing a mean AUC of 1.000 $\pm$ 0.000 and mean F1 of 0.999 $\pm$ 0.001. An ablation research demonstrates the complimentary contributions of topological (F1=0.990) and temporal characteristics (F1=1.000). Comparative research shows that the suggested strategy beats TDA+Random Forest (F1=0.994) and Isolation Forest (F1=0.835) baselines in several attack categories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a hybrid TDA+LSTM architecture that extracts Betti curves and persistence diagrams from the 77 features of the CIC-IDS2017 dataset (2.8 M flows, 14 attack classes) and feeds them into an LSTM for intrusion detection. It reports AUC = 1.000 and F1 = 1.000 on the full task, together with 5-fold CV means of AUC 1.000 ± 0.000 and F1 0.999 ± 0.001, and claims superiority over TDA+RF and Isolation Forest baselines.

Significance. If the perfect scores can be shown to arise from a leakage-free pipeline, the work would demonstrate that topological summaries can usefully augment temporal models on large-scale network data. The ablation and baseline comparisons are presented, but the zero-variance result on an imbalanced 14-class problem remains the central and most surprising claim.

major comments (3)
  1. [Experimental results / 5-fold CV paragraph] Experimental results section: the central claim of mean AUC 1.000 ± 0.000 across 5 folds on a 14-class problem is load-bearing, yet the manuscript provides no description of how the 5-fold splits were constructed or whether the TDA feature extraction (Betti curves, persistence diagrams) was performed independently inside each training fold. Global computation of topological features across the entire 2.8 M flows would constitute leakage and directly explain the reported zero variance.
  2. [TDA + LSTM architecture description] Methodology section on TDA pipeline: the integration of persistence diagrams with the 77 network variables is described at a high level, but no explicit statement confirms that the Vietoris–Rips or other filtrations were computed only on training data within each CV fold. This omission leaves the no-leakage assumption unverified and is required to support the AUC/F1 = 1.000 result.
  3. [Ablation research paragraph] Ablation study: the claim that temporal features alone already yield F1 = 1.000 while topological features add only marginal value (F1 = 0.990) is presented without error bars or statistical tests; combined with the perfect overall score, this pattern is consistent with either label leakage or memorization of the specific CIC-IDS2017 attack signatures rather than genuine generalization.
minor comments (2)
  1. [Abstract] Abstract: 'complimentary contributions' should read 'complementary contributions'.
  2. [Model implementation] The manuscript does not report the exact LSTM architecture (number of layers, hidden size, dropout) or the precise parameters used for persistence diagram vectorization; these details are needed for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments. We address each major point below regarding the cross-validation procedure and potential leakage concerns. We will revise the manuscript to provide the requested clarifications on the experimental setup.

read point-by-point responses
  1. Referee: [Experimental results / 5-fold CV paragraph] Experimental results section: the central claim of mean AUC 1.000 ± 0.000 across 5 folds on a 14-class problem is load-bearing, yet the manuscript provides no description of how the 5-fold splits were constructed or whether the TDA feature extraction (Betti curves, persistence diagrams) was performed independently inside each training fold. Global computation of topological features across the entire 2.8 M flows would constitute leakage and directly explain the reported zero variance.

    Authors: We acknowledge the omission in the manuscript. The 5-fold splits were constructed using stratified k-fold cross-validation to preserve the distribution of the 14 attack classes across folds. TDA feature extraction (Betti curves and persistence diagrams) was performed independently on the training data of each fold only, with test data excluded from all filtration and persistence computations. This design was intended to eliminate leakage. We will add a dedicated subsection in the Experimental Results section detailing the split construction and per-fold TDA process. revision: yes

  2. Referee: [TDA + LSTM architecture description] Methodology section on TDA pipeline: the integration of persistence diagrams with the 77 network variables is described at a high level, but no explicit statement confirms that the Vietoris–Rips or other filtrations were computed only on training data within each CV fold. This omission leaves the no-leakage assumption unverified and is required to support the AUC/F1 = 1.000 result.

    Authors: We agree that an explicit confirmation is required. The Vietoris–Rips filtrations and persistence diagram computations were restricted exclusively to the training subset within each CV fold; no test data participated in any topological feature generation. We will insert a clarifying statement in the Methodology section on the TDA pipeline to document this training-only computation explicitly. revision: yes

  3. Referee: [Ablation research paragraph] Ablation study: the claim that temporal features alone already yield F1 = 1.000 while topological features add only marginal value (F1 = 0.990) is presented without error bars or statistical tests; combined with the perfect overall score, this pattern is consistent with either label leakage or memorization of the specific CIC-IDS2017 attack signatures rather than genuine generalization.

    Authors: The ablation results indicate that the LSTM temporal modeling captures sequential patterns effectively on its own. We will augment the ablation paragraph with per-fold error bars and standard deviations consistent with the main 5-fold CV reporting. While the high baseline performance from temporal features alone is notable, the per-fold separation of TDA computation supports that the results are not due to leakage. We will also include a brief discussion of why the scores are high given the dataset characteristics. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper reports empirical performance of a hybrid TDA+LSTM model on the CIC-IDS2017 dataset, with results presented as experimental outcomes from feature extraction and model training. No equations, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations are visible in the abstract or described methodology. The central claims rest on dataset evaluation rather than any derivation that reduces to its own inputs by construction. Perfect scores raise separate concerns about leakage or overfitting but do not constitute circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions of TDA (persistent homology is well-defined on point clouds) and LSTM (sequential modeling captures temporal dependencies) without introducing new free parameters, axioms, or entities in the abstract.

pith-pipeline@v0.9.1-grok · 5801 in / 1098 out tokens · 24613 ms · 2026-07-01T04:44:52.932053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references

  1. [1]

    A survey of data mining and machine learning methods for cyber security intrusion detection,

    A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,”IEEE Commu- nications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, 2016

  2. [2]

    Intrusion detection using neural networks and support vector machines,

    S. Mukkamala, G. Janoski, and A. Sung, “Intrusion detection using neural networks and support vector machines,”Proc. IEEE Int. Joint Conf. Neural Networks, vol. 2, pp. 1702–1707, 2002

  3. [3]

    A hybrid network intrusion detection technique using random forests,

    J. Zhang and M. Zulkernine, “A hybrid network intrusion detection technique using random forests,”Proc. Int. Conf. Availability, Reliability and Security, pp. 262–269, 2006

  4. [4]

    HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection,

    W. Wang et al., “HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection,” IEEE Access, vol. 6, pp. 1792–1806, 2018

  5. [5]

    Kitsune: An ensemble of autoencoders for online network intrusion detection,

    Y . Mirsky, T. Doitshman, Y . Elovici, and A. Shabtai, “Kitsune: An ensemble of autoencoders for online network intrusion detection,”Proc. NDSS, 2018

  6. [6]

    Long short term memory recurrent neural network classifier for intrusion detection,

    J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory recurrent neural network classifier for intrusion detection,”Proc. Int. Conf. Platform Technology and Service, pp. 1–5, 2016

  7. [7]

    Time series classification via topological data analysis,

    Y . Umeda, “Time series classification via topological data analysis,” Information and Media Technologies, vol. 12, pp. 228–239, 2017

  8. [8]

    Topology and data,

    G. Carlsson, “Topology and data,”Bulletin of the American Mathemat- ical Society, vol. 46, no. 2, pp. 255–308, 2009

  9. [9]

    A survey of topological machine learning methods,

    M. Hensel, G. Moor, and B. Rieck, “A survey of topological machine learning methods,”Frontiers in Artificial Intelligence, vol. 4, pp. 1–22, 2021

  10. [10]

    Persistent homology of complex networks,

    D. Horak, S. Maleti ´c, and M. Rajkovi´c, “Persistent homology of complex networks,”Journal of Statistical Mechanics, vol. 2009, no. 3, P03034, 2009

  11. [11]

    Toward gener- ating a new intrusion detection dataset and intrusion detection system evaluation,

    I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward gener- ating a new intrusion detection dataset and intrusion detection system evaluation,”Proc. ICISSP, pp. 108–116, 2018

  12. [12]

    A detailed analysis of CIC-IDS-2017 dataset for designing intrusion detection systems,

    R. Panigrahi and S. Borah, “A detailed analysis of CIC-IDS-2017 dataset for designing intrusion detection systems,”Int. Journal of Engineering and Technology, vol. 7, no. 3.24, pp. 479–482, 2018

  13. [13]

    Edelsbrunner and J

    H. Edelsbrunner and J. Harer,Computational Topology: An Introduction. AMS, 2010

  14. [14]

    Deep learning,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015

  15. [15]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

  16. [16]

    Isolation forest,

    F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,”Proc. IEEE Int. Conf. Data Mining, pp. 413–422, 2008

  17. [17]

    Random forests,

    L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

  18. [18]

    Support-vector networks,

    C. Cortes and V . Vapnik, “Support-vector networks,”Machine Learning, vol. 20, no. 3, pp. 273–297, 1995

  19. [19]

    Persistence images: A stable vector representation of persistent homology,

    H. Adams et al., “Persistence images: A stable vector representation of persistent homology,”Journal of Machine Learning Research, vol. 18, no. 8, pp. 1–35, 2017

  20. [20]

    Computing persistent homology,

    A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete & Computational Geometry, vol. 33, no. 2, pp. 249–274, 2005

  21. [21]

    PHAT–persistent homology algorithms toolbox,

    U. Bauer, M. Kerber, J. Reininghaus, and H. Wagner, “PHAT–persistent homology algorithms toolbox,”Journal of Symbolic Computation, vol. 78, pp. 76–90, 2017

  22. [22]

    Rectified linear units improve restricted Boltzmann machines,

    V . Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,”Proc. ICML, pp. 807–814, 2010

  23. [23]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Proc. ICLR, 2015

  24. [24]

    Dropout: A simple way to prevent neural networks from over- fitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- dinov, “Dropout: A simple way to prevent neural networks from over- fitting,”JMLR, vol. 15, no. 1, pp. 1929–1958, 2014

  25. [25]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,”Proc. ICML, pp. 448–456, 2015

  26. [26]

    Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,

    M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,”Journal of Information Security and Applications, vol. 50, 102419, 2020

  27. [27]

    Learning phrase representations using RNN encoder-decoder for statistical machine translation,

    K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,”Proc. EMNLP, pp. 1724–1734, 2014

  28. [28]

    Deep learning approaches for anomaly-based intrusion detection systems: A survey, taxonomy, and open issues,

    A. Aldweesh, A. Derhab, and A. Z. Emam, “Deep learning approaches for anomaly-based intrusion detection systems: A survey, taxonomy, and open issues,”Knowledge-Based Systems, vol. 189, 105124, 2020

  29. [29]

    A survey of network-based intrusion detection data sets,

    M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based intrusion detection data sets,”Computers & Security, vol. 86, pp. 147–167, 2019

  30. [30]

    Neural persistence: A complexity measure for deep neural networks using algebraic topology,

    B. Rieck, T. Togninalli, C. Bock, M. Moor, M. Horn, T. Gumbsch, and K. Borgwardt, “Neural persistence: A complexity measure for deep neural networks using algebraic topology,”Proc. ICLR, 2019

  31. [31]

    Deep learning-based network intrusion detection: A comprehensive survey,

    Y . Zeng, M. Gu, and H. Chen, “Deep learning-based network intrusion detection: A comprehensive survey,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

  32. [32]

    Machine learning techniques applied to detect cyber attacks on web applications,

    M. Choras and R. Kozik, “Machine learning techniques applied to detect cyber attacks on web applications,”Logic Journal of the IGPL, vol. 23, no. 2, pp. 236–246, 2015

  33. [33]

    A topological loss function for deep-learning based image segmentation using persistent homology,

    J. R. Clough, N. Byrne, I. Oksuz, V . A. Zimmer, J. A. Schnabel, and A. P. King, “A topological loss function for deep-learning based image segmentation using persistent homology,”IEEE Trans. PAMI, vol. 44, no. 12, pp. 8766–8778, 2022. APPENDIXA MATHEMATICALFOUNDATIONS A. Topological Invariants The persistent homology computation relies on the fun- dament...