Most network traffic is ordinary. That is what makes intrusion detection
difficult. A model can perform well on aggregate accuracy and still fail at the task that
matters most: catching the attacks that appear least often.
In a security
system, this is not a small evaluation detail. It is the difference between a
model that looks good and a model that is useful.
The question I wanted to explore was simple:
what happens when the most important class is the class the model sees least?
This project became an attempt to build around that failure mode: clean the
data carefully, learn a compressed representation of traffic behavior,
preserve the original packet-level evidence, and evaluate the model through
the classes that accuracy can hide.
Accuracy can hide the failure that matters most
Accuracy is a dangerous comfort metric in intrusion detection. If most traffic is benign, a model can learn the majority class well and still remain weak on rare attack classes. The headline number may look strong, but the security question is narrower and harder: did the model catch the behavior that appears least often?
For this reason, the first useful question was not how high is the accuracy? It was
where does the model fail when the class distribution becomes unfair?
In this project, the confusion matrix and per-class metrics are not secondary visuals after a final score. They are the main evidence. They show where the model succeeds, where it confuses one class for another, and where rare failures might disappear behind the average.
The held-out test set is dominated by Normal and DDoS traffic. Botnet and PortScan are the classes that can disappear behind aggregate accuracy.
This is the first reason the article starts with distribution instead of architecture. Before discussing the model, the data already tells us what kind of evaluation will be honest. If a rare attack class occupies only a small part of the test set, then one aggregate score can easily make the model look more reliable than it actually is.
Overall accuracy is dominated by the majority classes. Per-class recall shows whether the rare attacks were actually recovered.
The point is not that accuracy is useless. It is that accuracy is incomplete. It answers one question: how often was the model right overall? Intrusion detection needs a harder question: which attacks did the model miss, and how costly are those misses?
The dataset was already telling a story
The first design decision was not architectural. It was about hygiene.
Raw security datasets often contain messy labels, uneven class populations, redundant columns, and feature values that need to be normalized before a model can learn anything meaningful. If that foundation is weak, the rest of the pipeline becomes hard to trust. A sophisticated model cannot compensate for unclear labels or corrupted tensors.
The preprocessing stage reduced the raw traffic labels into target classes, removed unusable features, scaled the numerical values, and saved the tensors needed by later phases. That step sounds mundane, but it is what makes the later model evaluation interpretable.
In this article, the rare-class problem is not treated as something to patch at the end. It appears at every stage: how the data is cleaned, how the split is performed, how resampling is applied, and how the final model is judged.
Making rare attacks visible without leaking evidence
Class imbalance is not solved by simply generating more examples. If synthetic samples leak into the test set, the evaluation stops measuring generalization and starts measuring contamination.
That is why the split has to happen before resampling. The model is allowed to see a balanced training distribution, but the final test set must remain untouched. Otherwise, the confusion matrix would no longer represent a blind evaluation.
The split happens before augmentation. SMOTE changes the training distribution, not the evidence used for final evaluation.
This distinction matters because rare-class recovery can look artificially strong if the evaluation set has been shaped by the same augmentation strategy used for training. In a security model, that would create exactly the kind of false confidence the system is supposed to avoid.
The important rule is simple: augmentation belongs on the training branch. The test branch is evidence. It should stay blind.
Learning structure without throwing away evidence
Once the data was clean and the split was protected, the next question was how the model should represent network traffic.
A raw feature vector can carry useful packet-level evidence, but it can also be noisy and high-dimensional. A representation learner can compress that space, but compression creates its own risk: useful details may disappear inside the bottleneck.
The pipeline therefore separates representation learning from final classification. The autoencoder learns a compressed view of traffic behavior, the clustering stage adds a latent grouping signal, and the classifier receives both learned structure and original evidence.
The pipeline separates label hygiene, representation learning, latent clustering, raw evidence preservation, imbalance handling, and final evaluation.
The goal of this structure was not to make the pipeline look complex. It was to keep each responsibility separate. Data cleaning should not be confused with representation learning. Representation learning should not be confused with imbalance handling. Imbalance handling should not be confused with final evidence.
This separation made the system easier to reason about. If the final model failed on a rare class, the failure could be inspected through the data split, resampling behavior, latent representation, and final confusion matrix instead of being hidden inside a single end-to-end score.
The autoencoder had to compress without memorizing
A normal autoencoder can learn to reconstruct its input too comfortably. If the bottleneck simply memorizes surface-level variation, the learned features may look compact without becoming more useful.
The SSCAE in this project was used as a pressure mechanism. Noise injection makes reconstruction harder. Sparsity encourages the model to avoid activating everything at once. The contractive penalty discourages fragile representations that change too sharply with small input variation.
Noise injection, sparsity, and a contractive penalty push the bottleneck toward a more stable representation of traffic behavior.
This is why I think of the autoencoder less as a feature generator and more as a filter. It is trying to pass forward the structure that survives compression, not every fluctuation present in the original traffic features.
The article should not overclaim this step. A stable bottleneck is not proof of deployment reliability. It is one part of the evidence chain: useful only when the downstream classifier and evaluation still behave honestly.
Raw evidence still had to survive compression
The compressed representation alone was not enough for the final classifier. That was intentional.
If the classifier only sees latent features, it inherits every tradeoff made by the representation learner. Compression can help expose structure, but it can also remove details that matter for classification. For intrusion detection, those details may be exactly where rare attack behavior appears.
The final classifier therefore receives a hybrid input: learned latent features, a cluster signal, and the original scaled traffic features. The learned representation gives the model a compressed view. The raw features preserve packet-level evidence.
The hybrid tensor preserves the original traffic features instead of forcing the classifier to rely only on compressed representations.
This is the part of the project I care about most architecturally. The model is not asked to choose between learned structure and raw evidence. It gets both. That makes the system less elegant in the abstract, but more practical as an evaluation pipeline.
The final input is an 85-dimensional hybrid evidence tensor: 15 learned features, 1 cluster feature, and 69 original traffic features.
The confusion matrix became the real evaluation
Once the model produced predictions, the confusion matrix became the center of the evaluation.
A single score compresses the result into something convenient. The confusion matrix makes the result uncomfortable again. It shows which classes were recovered, which classes were confused, and where rare behavior still leaked into the wrong prediction.
Aggregate accuracy compresses the result. The confusion matrix shows which classes were actually recovered and which ones were confused.
This is the figure that determines whether the article’s thesis holds. If rare classes disappear here, the system has not solved the problem. If they are visible, recoverable, and still inspectable, then the model is at least answering the right evaluation question.
The matrix also prevents the article from becoming a success story too early. Even when the diagonal looks strong, the off-diagonal cells matter. They show the remaining confusion that an aggregate metric would smooth away.
Recall improved, but precision told the cautionary story
Rare-class recall is important because missed attacks are expensive. But recall alone is not enough.
A model can recover more rare attacks by becoming too willing to call traffic suspicious. That improves recall while creating false positives. In a real analyst workflow, those false positives are not abstract. They become noise someone has to investigate.
Per-class metrics reveal tradeoffs that aggregate accuracy hides.
This is why the per-class metric figure is not just supporting evidence. It is a correction. It keeps the article honest by showing the tradeoff that remains after rare-class recovery improves.
In this run, Botnet recall is the encouraging signal. Botnet precision is the cautionary signal. Both have to be read together.
Trust still needs more than metrics
Model quality is not only about the final classifier. The representation learner also has to train in a way that does not look unstable.
The convergence curve is useful because it shows whether optimization settled into a consistent direction. But it should not be mistaken for proof that the system will generalize to live traffic. Training stability is a trust signal, not a deployment guarantee.
Training curves show whether optimization stabilized. Generalization still needs held-out or time-shifted evaluation.
That distinction matters. A smooth training curve tells us the optimization process did not obviously collapse. It does not tell us how the model will behave under new traffic distributions, future attack patterns, or different capture conditions.
For an intrusion detection system, trust also has a time dimension. A detector that is accurate but slow is still incomplete. Security decisions lose value when they arrive too late.
Latency turns the model from an offline classifier into something closer to a deployable detector.
Latency is not the main story of this article, but it is an important reality check. Detection quality answers whether the system can catch the signal. Latency answers whether it can catch it in time.
Feature importance was only a sanity check
Feature importance is useful, but it is easy to overread.
A ranked feature list does not explain the model in a human-complete way. It does not prove causality, and it does not replace deeper error analysis. But it can still act as a sanity check: is the classifier relying on plausible traffic signals, latent features, or strange artifacts that should make us suspicious?
A ranked view of feature reliance helps inspect whether the classifier is leaning on meaningful traffic signals or accidental artifacts.
This figure is useful because the system intentionally combines two kinds of evidence: learned representation and original raw features. If the final model only relied on one side, that would say something about the usefulness of the hybrid design. If both sides appear, the figure becomes a small check that the classifier is actually using the mixed evidence it was given.
What I would be careful about
I would not present this project as a solved intrusion detection system.
It is a strong experiment in class-imbalance-aware modeling, but there are still several places where the evaluation needs to stay humble.
- SMOTE is not real attack diversity. It can make rare classes visible during training, but synthetic samples do not replace genuinely diverse attack traffic.
- Benchmark traffic is not live traffic. A model that works on one dataset can still struggle when the distribution shifts.
- False positives still matter. Rare-class recall is valuable, but analyst workflows can break if the model produces too many alerts.
- Feature importance is not explanation. It is a useful inspection tool, not a full account of why the model made a decision.
- Time-split evaluation would be stronger. A future version should test the model against newer or temporally separated traffic.
These limitations do not weaken the project. They make the evaluation more honest. In security, pretending uncertainty does not exist is more dangerous than showing where the system still needs pressure.
What this project taught me
The main lesson was not that one model architecture is enough to solve intrusion detection.
The lesson was that evaluation has to be designed around the failure mode. If rare attacks matter most, then the system cannot be judged by a metric that lets those attacks disappear.
The final output of this project is not only a trained classifier. It is a way of looking at the classifier: through class distribution, split hygiene, latent representation, raw evidence, confusion matrix behavior, latency, and feature reliance.
A security model is not useful because it performs well on average. It is useful when it exposes where rare failures occur, what evidence supports the decision, and whether the result can be trusted in deployment.