Tutorials > Introduction to Machine Learning for Beginners

Introduction to Machine Learning for Beginners

Published on: 22 February 2021

Development IoT Machine Learning

Introduction

The Machine Learning is a branch of research on artificial intelligence (AI). The goal of machine learning is to understand the structure and to match the data in a model that can be understood and used by people.

Although this practice is related to computer science, the approach to machine learning is different from the traditional ones. In traditional computer science, algorithms are a package of explicitly stated and programmed instructions followed by computers to perform calculations or solve problems.

Machine learning algorithms, on the other hand, allow computers to train on data inputs and use statistical analysis to generate an output from a range of possible outputs. For this reason, machine learning helps computers build patterns from data examples, so they can build an automatic decision-making system based on the inputs received.

Today, every user benefits from machine learning through technology, even if they often don't realize it. Facial recognition technology allows social media platforms to help users tag or share photos of friends. The Optical Character Recognition (OCR) system converts text images into characters. Suggestion engines, powered by machine learning, suggest which movie or TV show to watch based on user preferences. Self-driving cars, based on machine learning, are being tested to be launched on the market.

The Machine Learning is a constantly changing field. For this reason, there are some tricks to keep in mind when working with machine learning and its methods or when analyzing its processes.

In this tutorial you will find an introduction to the main Machine Learning methods of supervised or unsupervised learning and the most common algorithmic approaches, including the k-nearest neighbor algorithm, decision tree-based learning and deep learning.

You will also learn which programming languages ​​are most used in machine learning, and find out the advantages and disadvantages for each of them. Finally, the topic of human biases, their potential involvement in machine learning algorithms, as well as, what to do to prevent these biases when creating algorithms, will be addressed.

Machine Learning Methods

In machine learning, tasks are generally classified into broader categories. These categories are based on how learning occurs or how feedback on learning is passed on to the developed system.

Two of the most used methods of machine learning are supervised learning, which trains an algorithm based on input and output data labelled by humans, and unsupervised learning, which does not provide the algorithm with any labeled data in order to allow it to find its structure based on the data received in input.

Supervised learning

For example, with supervised learning, an algorithm can be powered by data with images of sharks labeled as ’fish’ and images of oceans labeled as ’water’. Being trained on this dataset, the supervised learning algorithm will later identify other images, sharks or oceans, with the previously specified tags.

A common use of supervised learning is when history data is used to predict future events on the basis of their likelihood to be similar to previous ones. For example, it can be used to predict products that a user who buys online might like, or to filter spam emails.

In supervised learning, the algorithm also learns to recognize elements in untagged photos, to categorize them according to the tags of the tagged elements.

Unsupervised learning

Unsupervised learning involves unlabeled data, thus letting the algorithm search for aspects that pool its input data. Since unlabeled data are more prevalent than labeled data, the methods based on this learning are very robust.

The goal of unsupervised learning is to discover hidden patterns within datasets. However, it may also have the goal of setting up a learning pattern to allow the machine to discover the representations needed to classify raw data on its own.

Unsupervised learning is commonly used for transactional data. You may find yourself analyzing huge data sets about customers and their purchases, but being a human you would not be able to automatically understand its sense, after identifying what similar attributes can be extracted from the personal data of customers and their types of purchases,.

If this data were fed to an unsupervised learning algorithm, then the machine might , for example, determine that women of a certain age group, who buy neutral soaps, are more likely to be pregnant. As a result, a marketing campaign related to pregnancy and baby products could be launched for this type of audience.

Without choice corrective feedback, unsupervised learning algorithms can build meaningful relationships between data that, when considered individually, may be difficult to relate, no matter how complex they are. This type of learning is also often used to detect anomalies, such as fraudulent use of credit cards, or systems that recommend you a product you should buy following certain purchases. 

Also in unsupervised learning, getting back to the previous example, untagged dog photos can be used to make the system record details that could help in the subsequent classification of untagged dog photos.

Algorithmic approaches

As a field, machine learning is closely related to computational statistics. Therefore, prior knowledge of statistics is useful for understanding and mastering machine learning algorithms.

In this regard, it is useful to remember the concepts of correlation and regression, as they are techniques widely used to understand the relationships between quantitative variables.

The correlation is a measure of association between two variables that are not designated as both dependent or both independent. The regression, at the grassroots level, is used to examine the relationship between a dependent variable and an independent variable. Since regression statistics can be used to anticipate the dependent variable when the independent variable is known, regression enables prediction capabilities.

Machine learning approaches are continuously developing. For the purposes of a developer, some of the more popular approaches used in machine learning during development will be explained below.

K-nearest neighbor

The k-nearest neighbor algorithm is a recognition model that can be used for classification as well as regression. Often abbreviated as k-NN, the k in the algorithm is a positive integer.

In the k-NN classification, the output will be that the input data belong to a specific class. The algorithm will assign the object to the most common class among its k closest neighbors. If k were equal to 1, then the object would be classified with the class of the only available neighbor.

Decision tree

Generally a decision tree is used to visually represent decisions or show how they are established. When working with  machine learning and data mining, decision trees serve as predictive models. These models map the observations on the data up to the elicitation of the conclusions on them and on the possible obtainable results.

The goal of decision learning is to create a model that, based on the data received as input, is able to predict possible outcomes.

In the predictive model, the data attributes, determined through observations, are represented by the branches, while the conclusions about possible predicted values ​​are represented by leaves.

When learning is "a tree", the source data is divided into subsets based on the attribute value test. The test is repeated recursively on each of the derived subsets. Once the subset in a node has the equivalent value to the target value, the recursive process (loop) will be completed.

Deep Learning

The deep learning attempts to mimic the behavior of the human brain in processing sound and light of sight and hearing. A deep learning based architecture is based on the biological neural network and is built on multiple layers in an artificial neural network created with hardware and GPU ( Graphic Processor Unit ).

Deep learninguses a cascade of layers of non-linear processor units to extract or transform the characteristics or data representations. The output of one layer serves as the input of the next layer. In deep learning, learning algorithms can be supervised for data classification, or unsupervised for model analysis.

Among the machine learning algorithms currently in use and development, deep learning algorithm assimilates a greater amount of data and has been able to beat humans in some cognitive processes. Thanks to these results and its characteristics, this algorithm is today the one best able to offer a valid contribution in the development of artificial intelligences.

Computer visual and speech recognition have gained valuable advances thanks to deep learning approaches. IBM Watson is an example of this mastery of the algorithm.

Programming languages ​​for Machine Learning

When choosing a language to specialize in for machine learning skills in the professional sectors, specific job postings or libraries available in the various recommended languages should be considered.

From several job posting analyses, the ranking of the best programming languages ​​to choose for machine learning places Python first, followed by Java , R, and finally C ++ .

The popularity of Python is due to the development of many deep learning frameworks such as TensorFlow, PyTorch and Keras. Being a language with a readable and usable syntax to create scripts, Python is strong and effective both in working directly with data and in precompiling them.

The library called scikit-learn is built on existing package components that Python users will already be familiar with: NumPy, SciPy, Matplotlib.

Java is widely used in business programming and is generally used by front-end desktop application developers who also work on machine learning in companies. Although this is usually not the first choice for someone new to programming who would like to learn about machine learning, a background in Java development for machine learning is preferrable.

In terms of machine learning applications in the industry, Java tends to be used more than Python for network security, including in cyberattacks and fraud detection cases.

Among the machine learning libraries for Java are: 

  • Deeplearning4j, a distributed, open source deep learning library written for both Java and Scala;
  • MALLET ( MAchine Learning for LanguagE Toolkit ), which enables machine learning applications on text, including natural language processing, topic modeling, document classification and clustering;
  • Weka, a collection of machine learning algorithms to be used for data mining.

R is an open source programming language used primarily for statistical computing. It has grown in popularity in recent years and is favored by many in academia. R is not typically used in industrial manufacturing environments, but has become popular in industrial applications as a result of the growing interest in data science.

Popular machine learning packages in R include:

  • caret (abbreviation of Classification And REgression Training) for the creation of predictive models;
  • randomForest for classification and regression;
  • e1071 which includes functions for statistics and probability theory.

C ++ is a good language for machine learning or for artificial intelligences in video games or for robot applications (locomotion, for example). If you have worked with the development of integrated computing hardware or are an electronic engineer you may prefer languages ​​such as C or C ++ for their high level of control the language provides.

Some C ++ libraries related to machine learning are mlpack, scalable, Dlib which offers wide-ranging machine learning algorithms and Shark, modular and open source.

Human prejudices

It may be thought that the information received from computational analysis is always objective but, in reality, this is not the case: even the outputs of machine learning are not neutral. Human bias plays a key role in how data are collected, organized and processed in the algorithm that determines how to interact with these data.

If, for example, people provide images of ‘fish’ as data to train an algorithm and these images are of ‘goldfish’ only, then the machine will not be able to label sharks as ‘fish’. User inputs will therefore have created a prejudice against sharks, which will not be considered in the fish’ label.

Obviously, these prejudices can have a negative impact on the objectivity of the machine, but also become detrimental for users that interact with it. For example, imagine what a machine with gender or race bias could cause, due to improper algorithm training. In addition to this, an algorithm saturated with these biases may not suggest the right advertisements, the right ads or job opportunities to a user due to slack of ‘elasticity’.

To iron out these prejudices, there are several methods. One of these is to make sure that many people work on a project and that as many test and revise it, in order to favor the refinement of a type of analysis that is as "collective" and objective as possible.

Another method is to engage third party groups to monitor and review the algorithm, create alternative systems to detect these biases, and add ethics review as part of data science projects. Raising awareness of our own and collective prejudices, conscious or unconscious, coupled with building equity in machine learning projects, helps discourage prejudice in this field.

Conclusions

In this tutorial you could learn different use cases of Machine Learning, while analyzing the most popular methods and approaches applied in this area, as well as the programming languages ​​that can be used to move in the development of such projects.

To understand how machine learning works, it was fundamental to understand the logic behind the main algorithms used today. Keep in mind that machine learning is an ever-changing industry. To stay up to date, it is essential to be constantly updated on the subject and on the new methods and algorithms that may arise in the future.