Down in the data mine, something stirs…

Share on facebook
Share on twitter
Share on linkedin

VIDEO CONTENT: The future, as far as transport is concerned, is all about four key areas: Energy – Data – Congestion – Disruptors.

Generally speaking, organisations have loads of data and today, easily available methods to display it. Indeed, data mining – the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning – is now big business.

That’s all well and good when the data is in a simple format, typically involving numbers. Therefore, we can easily correlate types of traffic flows with the time of day on certain routes, for example. On this modelling – the art of accurately predicting what is likely to happen in the future – is now a well-establish science.

Transport for London has been a big user of data and it now drives everything from modelling passenger numbers to controlling the traffic lights in real time to reduce delays on key routes.

But what happens when data is in the form of language and words?

Enter a new phrase – text mining – which does a similar thing.

At its simplest, text mining is finding a word in a document, the ‘find and replace’ command in Word that we’re all familiar with it.

But, as anyone who has tried to find something with a specific phrase in Google has discovered, it’s not always that easy. Advanced Google users know the tricks of using symbols such as ‘+’ to target their searches, but even then it’s with limited accuracy.

Now, with the application of artificial intelligence (AI) and machine learning (ML) – something called semantic search capabilities (SSC) are possible using Natural Language Processing (NLP).

Hang on! Too much jargon, you cry!

And, you’re right. So, hold tight for a quick trip…

When words become numbers

Think what would be possible if we could treat data held in language (written text) in the same way as numeric data.

Numeric data can be readily analysed with the help of a computer using standard statistical techniques, devised to analyse data expressed as numbers.

In contrast, language can be easily understood by a human but not by a computer, making analysis of data held in documents difficult.

If a collection of reports and other documents held by an organisation is too large for a human to read and analyse, computer techniques are needed to enable the extraction of useful information, summarise it and use it for further analysis. 

Text mining is a range of methods that allow computers to extract information from text perhaps to identify things such as activities, equipment, and places, and allow exploration of the relationships between them.

Text mining goes much further than simple keyword searches and requires the computer to keep track of the meaning of words.

“Language can be easily understood by a human but not by a computer, making analysis of data held in documents difficult”

For example, if you are investigating causes of back pain, you only want to analyse documents that use the word ‘back’ when it refers to the rear torso, but exclude uses such as ‘she came back into the room’, or ‘he stepped back’.

Text mining techniques can go beyond analysing individual documents, making links between concepts (in this case back pain), identifying common topics, and automatically summarising content across multiple documents.

By analysing a large collection of documents, it is possible to gather evidence and gain insights that are unlikely to be discovered by manually reading a smaller sample of reports. 

Training computers

By training computers how to read text almost like a human, experts at the National Centre for Text Mining, University of Manchester (NaCTeM) have created a system that can do in-depth analysis instantly.

NaCTeM is proving that applying Natural Language Processing (NLP) tools to data really works – performance of the backend NLP tools has an accuracy of up to 90% compared with traditional keyword approaches.

This opens the door to new, more efficient searching of valuable information for all kinds of industries.

In collaboration with the Health and Safety Executive (HSE) the NaCTeM has developed a new text search system.

Under the Reporting of Injuries, Diseases and Dangerous Occurrences Regulations (RIDDOR) 2013, employers in all sectors are legally required to report accidents to the HSE.

With the demise of coal mining, and despite the many gains made in recent years, construction remains the most dangerous UK occupation with 40 deaths in the last year, plus a further 81,000 people living with ill-health suffered at work.

The newly-developed RIDDOR Text Analysis Tool uses natural language processing and machine learning to perform semantic searching on accident and incident textual data (there’s a useful video about this below).

Previously, trawling through the vast volume of information generated by accidents has been a slow, laborious task. Until now it has only been possible to search through RIDDOR reports using ‘codified data’ – terms such as type of injury or the age of the person who’s been injured. This then brings up a long list of documents that have to be read through to get to the relevant facts.

The RIDDOR Text Analysis Tool is a big step forward. It ‘mines’ through the free text of the HSE documents to explore the contents and present the relevant information in an instantly usable form.

This will enable health and safety managers, contractors and HSE inspectors to extract pertinent safety-critical concepts and associations without the labour of trawling through thousands of pages of text.

It brings together HSE data and the power of artificial intelligence – in particular natural language processing and deep learning – to make devising risk assessments more accurate, effective, intuitive and much easier.

Professor Sophia Ananiadou, Director of NaCTeM, Department of Computer Science, led the team which developed the tool. She explains: “It makes sense of large volumes of textual data in an efficient and intuitive manner; it gets right to the heart of what the searcher is looking for. In an industry where risk management is crucial, this is a solution to the problem of overlooking risks due to the overwhelming amount of text.

“The tool allows users to drill down in an interactive manner and easily find the information they need. If that’s guidance on what to do to prevent falls, it will present a set of short summaries, with key words (such as ‘fall’, ‘ankle’ or ‘ladder’) highlighted, giving immediate access to the relevant information. This could include what happened, what needs to happen to prevent any further occurrences and any protective measures taken.”

For example, if you enter ‘ladder’ and ankle into the search engine, the tool instantly brings up clusters of words gathered from reports containing them – and others such as ‘slab’ ‘kerb’ and ‘slipped’ that guide you to think more widely around the subject.

Alongside this is a list of 587 documents (found in 0.007 seconds) – but presented with relevant extracts such as ‘ladder gave way, slipped along the plane of elevation’ rather than just allowing you to download the full document.

Tim Yates, HSE data scientist, adds: “This will help users carrying out risk assessments by quickly providing insights, for example before using a piece of equipment for the first time.

The potential of the RIDDOR Text Analysis Tool for construction contractors, looking for information on a particular type of accident or incident with a view to preventing it from happening in the future, is very powerful, he says: “The tool can provide users with many examples of incident reports containing specific construction related aspects, so they can gain insight about the hazards associated with, say, equipment and activities that they are involved with.”

“This will transform the way we use incident and accident data. This will help industry come up with evidence-based solutions to prevent harm from happening in the future.”

The next steps will include engaging with construction companies to develop the system further, then taking it wider, to other industries.

At this stage, it’s being used as a safety tool, which has obvious benefits for the transport sector.

It’s inevitable that the tool could be developed further, to be of much wider use in transport and transport planning. This is an exciting and tantalising prospect.

If you think that using semantic search capabilities could help your industry, please get in touch with Sophia Ananiadou, Director of NaCTeM, Department of Computer Science at  [email protected]

You will receive our regular news digest – typically weekly. We are serious about GDPR and we promise to take care of your data and will never sell it or pass it on.

Your Privacy

We and our partners use technologies, such as cookies, and process personal data, such as IP addresses and cookie identifiers, to personalise ads and content based on your interests, measure the performance of ads and content, and derive insights about the audiences who saw ads and content. Click below to consent to the use of this technology and the processing of your personal data for these purposes. View our privacy policy.