--[ Cyber Data Science for DCO

$ getent passwd jbaxter
├─── name: Jacob Baxter
├──── org: Orang Labs
└─ social:
   └─ twitter: @BenevOrang

A Former Army Officer, Jacob is a technologist who has always followed his curiosity and interest in technology, having a core belief in its ability to make the world a better place. He has a background in Applied Mathematics and got into Cybersecurity as an Army Officer, working a defensive mission set on traditional large enterprise networks, while based in Augusta, GA. He's been in the field now for about 6 years, and then has spent time doing a lot of research at DARPA and in the DOD on using programming, data science, and machine learning to try to help improve Cybersecurity. Presently he spends a majority of his time working on tough research questions as a Research Fellow in the United States Military Academy's Cyber Research Center and trying to make cool tech at Orang Labs. He's been a practicing Cyber Data Scientist for over 5 years, with experience utilizing Data Science for Cyber Problems in DOD DCO, at DARPA, and in his current roles. If he wasn't working the 9 to 5, all of his friends know he'd be found on a side-street in Bali, speaking Indonesian with the local Ayam Goreng (fried chicken) or Sate cart, waiting for his next Ultimate Frisbee game to start.


This workshop will teach the basics of performing Data Science and Machine Learning in Python for Cyber Security Applications. We'll cover statistical, graph, unsupervised, and supervised approaches to analytics. Attendees will walk away with the ability to use Python to answer questions about Cyber Data, such as projecting and grouping similar IPs based on their network flows, predicting unlabeled admin accounts from account behavior, and visualizing these types of problems to help communicate to other analysts and stakeholders.


Students need to come prepared with the following:

  • Knowledge: Students should have a basic familiarity with Python, as well as familiarity with how Networked Systems work and the types of data traditionally observed from a defensive perspective, such as Flow, Zeek, Authentication logs, etc.

    If you’re a bit rusty at Python and/or have never touched Pandas, I would highly recommend checking out something like https://www.codecademy.com/learn/paths/data-science, specifically focusing on their pandas and matplotlib tutorials. We’re going to be doing a lot with pandas DataFrames, which is kind of like having a little in-memory database or excel file. It’ll be your mini Data Science SIEM for the work we do. So if you already work in Splunk or ELK a bunch, a lot of what you might pick up in this course is other ways to do some of the same kinds of queries you already do.

    If you’d like to brush up on some math that can be helpful, I love: https://www.youtube.com/watch?v=kjBOesZCoqc&list=PL0-GT3co4r2y2YErbmuJw2L5tW4Ew2O5B — ultimately, Linear Algebra can often be though of as just a type of data structure that allows us to do things very efficiently, take advantage of how we can encode most data into geometric spaces, and then leverage nice math to find cool insights. Grant’s videos are highly visual and aim for an intuition, rather than what you might see (and dread) in a college course.

    From a Cyber perspective, I plan for us to primarily go over some anonymized flow records and windows event logs; if you haven’t seen much of these, I’d definitely read about them. Flow is very similar to the Zeek conn table.

  • Infrastructure: Students should have a data science capable laptop with ~16 GB of memory. They are also encouraged to have Jupyter Lab installed, be familiar with the process of installing python packages, and calling them from a Jupyter Notebook.