Extracting Phishing Website Features in URL and Prediction using Machine Learning
Malicious Web sites largely promote the growth of Internet criminal activities and constrain the development of Web services. As a result, there has been strong motivation to develop systemic solution to stopping the user from visiting such Web sites. We propose a learning based approach to classifying Web sites into 3 classes: Benign, Spam and Malicious. Our mechanism only analyzes the Uniform Resource Locator (URL) itself without accessing the content of Web sites. Thus, it eliminates the run-time latency and the possibility of exposing users to the browser based vulnerabilities. By employing learning algorithms, our scheme achieves better performance on generality and coverage compared with blacklisting service.
URLs of the websites are separated into 3 classes:
- Benign: Safe websites with normal services
- Spam: Website performs the act of attempting to flood the user with advertising or sites such as fake surveys and online dating etc.
- Malware: Website created by attackers to disrupt computer operation, gather sensitive information, or gain access to private computer systems.
A poorly structured NN model may cause the model to underfitthe training dataset . On the other hand, exaggeration inrestructuring the system to suit every single item in the trainingdataset may cause the system to be overfitted . One possiblesolution to avoid the overfitting problem is by restructuring theNN model in terms of tuning some parameters, adding newneurons to the hidden layer or sometimes adding a new layerto the network. A NN with a small number of hidden neuronsmay not have a satisfactory representational power to modelthe complexity and diversity inherent in the data. On the otherhand, networks with too many hidden neurons could overfit thedata. However, at a certain stage the model can no longer beimproved, therefore, the structuring process should beterminated. Hence, an acceptable error rate should be specifiedwhen creating any NN model, which itself is considered aproblem since it is difficult to determine the acceptable errorrate a priori . For instance, the model designer may set theacceptable error rate to a value that is unreachable whichcauses the model to stick in local minima  or sometimes themodel designer may set the acceptable error rate to a value thatcan further be improved.
- It will take time to load all the dataset.
- Process is not accuracy.
- It will analyze slowly.
Lexical features are based on the observation that the URLs of many illegal sites look different, compared with legitimate sites. Analyzing lexical features enables us to capture the property for classification purposes. We first distinguish the two parts of a URL: the host name and the path, from which we extract bag-of-words (strings delimited by ‘/’, ‘?’, ‘.’, ‘=’, ‘-’ and ‘’).
We find that phishing website prefers to have longer URL, more levels (delimited by dot), more tokens in domain and path, longer token. Besides, phishing and malware websites could pretend to be a benign one by containing popular brand names as tokens other than those in second-level domain. Considering phishing websites and malware websites may use IP address directly so as to cover the suspicious URL, which is very rare in benign case. Also, phishing URLs are found to contain several suggestive word tokens (confirm, account, banking, secure, ebayisapi, webscr, login, signin), we check the presence of these security sensitive words and include the binary value in our features.Intuitively, malicious sites are always less popular than benign ones. For this reason, site popularity can be considered as an important feature. Traffic rank feature is acquired from Alexa.com.Host-based features are based on the observation that malicious sites are always registered in less reputable hosting centres or regions.
- All of URLs in the dataset are labelled.
- We used two supervised learning algorithms random forestand support vector machine to train using scikit-learn library.
Extracting Phishing Website Features using Machine Learning
Hardware and Software Requirements:
- Windows 7,8,10 64 bit
- RAM 4GB
- Data Set
0.00 average based on 0 ratings
More Things You Might Like This
Abstract: Although the educational level of the Portuguese population has improved in the last decades, the statistics keep Portugal at Europe’s tail end due to its high student failure rates. In particular, lack of success in the core classes of Mathematics and the Portuguese language is extremely serious. On the other hand, the fields of
Abstract: Advances in natural language processing (NLP) and educational technology, as well as the availability of unprecedented amounts of educationally-relevant text and speech data, have led to an increasing interest in using NLP to address the needs of teachers and students. Educational applications differ in many ways, however, from the types of applications for which
Machine Learning based Regression Model for Prediction of Soil Surface Humidity over Moderately Vegetated Fields
Abstract: Agriculture is one of the major revenue producing sectors of India and a source of survival. Numerous seasonal, economic and biological patterns influence the crop production but unpredictable changes in these patterns lead to a great loss to farmers. These risks can be reduced when suitable approaches are employed on data related to soil