The usages of smart phones are increasing steadily and also growth of Android application users are increasing. Due to growth of Android application user, some intruder are creating malicious android application as tool to steal the sensitive data and identity theft / fraud mobile bank, mobile wallets. There are so many malicious applications detection tools and software are available. But an effectively and efficiently malicious applications detection tools needed to tackle and handle new complex malicious apps created by intruder or hackers. In this paper we came up with idea of using machine learning approaches for detecting the malicious android application. First we have to gather dataset of past malicious apps as training set and with the help of Support vector machine algorithm and decision tree algorithm make up comparsion with training dataset and trained dataset we can predict the malware android apps upto 93.2 % unknown / New malware mobile application.
Traditionally Numerous? malware? detection? tools? have been developed, but some tools are may not able? to detect newly created malware application and unknown malware application infected by various Trojan, worns,? spyware? Detecting of large number of malicious application over millions of android application is still a challenging task using traditional way. In existing, Non machine learning way of detecting the malicious application based on characteristics, properties, behavioural.
- Identification of newly updated or created malicious application is hard to find out.
- Non Machine learning approaches are not reliable and efficient
- In Existing approaches covers only 30 permissions out of 300 app permissions, due to this limited apps permissions different types of attacks can occurs.
Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android?s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.
To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.
We can analyze cyber threats using two techniques, static analysis, and dynamic analysis, the most important thing is that these are the approaches to get the features that we are going to use in data science.
Static analysis:? ?it includes the methods that allow us to get information about the software that we want to analyze without executing it, one example of them is the study of the code, their callings, resources, etc.
Dynamic analysis: it is another approach where the idea is to analyze the cyber threat during its execution, in other words, get information about its behavior, some of their features are the netflows.
- Improves the percentages of detection malicious application.
- Machine learning is better efficient than Non machine learning algorithm.
- Able to detect new malware android applications.
- We only need to consider 22 out of 135 permissions to improve the runtime performance by 85.6%.
In 2016 we explored Android Genome Project (MalGenome), it is a dataset which was active from 2012 until the end of the year 2015, this set of malware has a size of 1260 applications, grouped into a total of 49 families. Today, we can find other jobs such as: Drebin, a research project offering a total of 5560 applications consisting of 179 malware families; AndrooZoo, which includes a collection of 5669661 applications Android from different sources (including Google Play); VirusShare, another repository that provides samples of malware for cybersecurity researchers; and DroidCollector, this is another set which provides around 8000 benign applications and 5560 malware samples, moreover, it facilitates us samples of network traffic as pcap files.
- DATA PRE-PROCESSING
- Feature Extraction and EDA
- Static Analysis
- Dynamic Analysis
- DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
Formatting: The data you have selected may not be in a format that is suitable for you to work with. The data may be in a relational database and you would like it in a flat file, or the data may be in a proprietary file format and you would like it in a relational database or a text file.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances that are incomplete and do not carry the data you believe you need to address the problem. These instances may need to be removed. Additionally, there may be sensitive information in some of the attributes and these attributes may need to be anonym zed or removed from the data entirely.
Sampling: There may be far more selected data available than you need to work with. More data can result in much longer running times for algorithms and larger computational and memory requirements. You can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions before considering the whole dataset.
Feature Extraction and EDA
A new method to detect malicious Android applications through machine learning techniques by analyzing the extracted permissions from the application itself. Features used to classify are the presence of tags uses-permission and uses-feature into the manifest as well as the number of permissions of each application. These features are the permission requested individually and the ?uses-feature? tag.the possibility of detection malicious Android applications based on permissions and 20 features from Android application packages.
In this first step, I’m going to analyze some features in order to answer the next hypothesis, exist a differential of the permissions used between a set of malware and benign samples For this approach we developed a code that consisted to extract and make a CSV file which has information about permissions of applications
For this approach, we used a set of 4705 benign and 7846 malicious applications. All of the files were processed by our feature extractor script , the idea of this analysis is to answer the next question, according to the static analysis previously seen a lot of applications use a network connection, in other words, they are trying to communicate or transmit information, so.. is it possible to distinguish between malware and benign application using network traffic.
we concluded that the best network features are:
(R1): TCP packets, it has the number of packets TCP sent and got during communication.
(R2): Different TCP packets, it is the total number of packets different from TCP.
(R3): External IP, represents the number the external addresses (IPs) where the application tried to communicated
(R4): Volume of bytes, it is the number of bytes that was sent from the application to the external sites
(R5) UDP packets, the total number of packets UDP transmitted in a communication.
(R6) Packets of the source application, it is the number of packets that were sent from the application to a remote server.
(R7) Remote application packages, number of packages received from external sources.
(R8) Bytes of the application source, this is the volume (in Bytes) of the communication between the application and server.
(R9) Bytes of the application remote, this is the volume (in Bytes) of the data from the server to the emulator.
(R10) DNS queries, number of DNS queries.