2

我有一个编程技能数据集,我喜欢对其进行预处理/清理并创建一些更通用的组。

  • 为了干净,我可以对以下文本进行什么文本清理。以下数据集中的示例。Visual C 和 C 相同或 Yi 和 Yi 框架相同。
  • 是否有任何类型的程序员/软件工程和项目管理或本体词典可以帮助我将以下分类为更抽象的类别

这是我的数据集

C++
C
CAE
Programming
Matlab
Simulations
Finite Element Analysis
Software Engineering
Algorithms
Linux
Software Development
Python
Engineering
CAD
Numerical Analysis
Fortran
Java
Mechanical Engineering
C/C++ STL
CFD
Optimization
ANSYS
AutoCAD
LaTeX
ANSA
Eclipse
HTML
Machine Learning
Software Design
SQL
UML
Abaqus
C#
MySQL
Aerodynamics
Catia
JavaScript
PHP
Microsoft Office
Nastran
OpenGL
Stress Analysis
CSS
Qt
Modeling
Structural Analysis
Computational Geometry
Fluid Mechanics
Mathematica
Parallel Computing
Visual Studio
XML
CATIA
Computational Mechanics
D
Fluid Dynamics
LS-DYNA
NetBeans
Object Oriented Design
Objective-C
R&amp
Windows
Composites
Computer Science
Customer Service
Inventor
Manufacturing
Operating Systems
Parallel Programming
Pro Engineer
Research
Solidworks
Business Strategy
Crashworthiness
jQuery
Management
Microsoft Excel
OpenFOAM
Pattern Recognition
Shell Scripting
TCP/IP
Vim
?TA
Android Development
Autodesk Inventor
Automotive Engineering
Blender
CFX
Databases
Git
Joomla
Mathematics
Microsoft Visual Studio...
NVH
Optimizations
Photoshop
PostgreSQL
Product Design
Product Development
Scripting
Solid Mechanics
Subversion
Unix
Web Development
Analysis
Artificial Intelligence
Automotive
Business Development
C 
CAD/CAM
CUDA
Data Analysis
Data Mining
Electrical Engineering
Engineering Design
Engineering Management
Heat Transfer
High Performance...
Hypermesh
Image Processing
Java Enterprise Edition
Mercurial
mETA
Microsoft SQL Server
Microsoft Word
MPI
Multithreading
Negotiation
New Business Development
OpenMP
Perl
PowerPoint
Project Engineering
Project Management
Prolog
Pthreads
Robotics
Simulation
SolidWorks
Thermodynamics
Visual Basic
3D Studio Max
Accounting
Agile Methodologies
ANSA/mETA
Ant
Apache
AutoCAD Mechanical
Biomechanics
Biomedical Engineering
Budgets
Business Analysis
Computer Vision
Corporate Communications
CVS
Delphi
Design Patterns
Dynamics
EJB
Embedded C
Embedded Systems
Energy
English
Financial Analysis
Fortran 95
Genetic Algorithms
Haskell
Hibernate
HTML 5
iOS development
JPA
JSP
JUnit
Marketing Strategy
Materials Science
Meshing
Meta
MongoDB
Multithreaded...
Network Programming
Neural Networks
Numerical Simulation
OOP
Parallel Algorithms
Parallel Processing
Piping
Post Processing
Powertrain
Presentations
Public Relations
Radioss
Sales
Scientific Computing
Scrum
SOAP
Software Project...
Solid Edge
Spring
Star-CCM+
Strategic Planning
Teaching
Team Leadership
Template Metaprogramming
Test Driven Development
Ubuntu
Unigraphics
Unit Testing
Vehicles
Visual C++
Web Applications
Web Services
Weblogic
Wireshark
WordPress
.NET
?TA Post-processor
?TA Post Processor
?ヤチ
3D
3D Modeling
Account Management
Account Reconciliation
Accounts Payable
Acoustics
Active Directory
Adjoint Optimization
Aerospace
Agile Project Management
Algorithm Development
Analog Photography
Android
AngularJS
ANSA Pre-processor
ANSA/META
ANSY CFX
ANSYS FLUENT
Apache HTTP Server
Applied Mathematics
Approximation Algorithms
Architecture
ARM Assembly
Artificial Neural...
ASME
Assembly Language
Astrophysics
Automotive Design
AVL Boost
B2B
Balanced Scorecard
BEM
Benchmarking
Bind
Biomaterials
boost
Boost C++
Business Coaching
C/C++
C++ Builder
C++ Language
CAD/CAM Software
CAE Process Automation
Carbon Fiber
Casting
CATIA V5
CATIA, CFD, ANSA, ?TA
Channel Partners
Characterization
Cilk
Civil Engineering
ClearCase
ClearQuest
Cluster
Cluster Development
CNC
Coaching
Cocoa
Combustion
Company Presentations
Competitive Analysis
Compiler Construction
Compression Algorithms
Computation Geometry
Computational Physics
Computer Graphics
Computer Repair
Consecutive...
Constitutive Modeling
Corel
Corporate Identity
Corporate Sales...
Crash
Crisis Communications
CRM
CSS3
Data Acquisition
Data Exchange
Data Management
Data Privacy
Database Administration
Database Design
Decision Support
Digital Photography
Direct Sales
DirectX
Discrete Mathematics
Distributed Systems
Domain Specific...
Driving License
Dynamic Programming
Dynamical Systems
Economics
ECU manager- MoTeC
Editing
Education
Electronic Engineering
Electronics
Emacs
Embedded Software
Employee Training
Energy Derivatives
Engine bench data...
Engine calibration
Engine Modelling
Engine Performance
Engineering Analysis
Entrepreneurship
Ergonomics
ERP
Event Management
Event Planning
Evolutionary Algorithms
Evolutionary Computation
Experimentation
Fatigue Analysis
FEM analysis
Financial Reporting
Finite Cell Method
Fixed Assets
Fluid-Structure...
Functional Programming
General Ledger
Generative Programming
Glade
Glassfish
GLSL
GNU Make
GPU Computing
GPU Programming
Graphics
Grid Generation
GT-Power
GUI development
Hadoop
Human-computer...
Human Factors
Human Factors...
Illustrator
Image Segmentation
Information Architecture
Information Systems
Informix
Integration
Internal Communications
International Business
International Sales
Interpreting
Isogeometric analysis
Italian languages
J2EE
Java RMI
JavaSE
JBoss
JBoss Application Server
JDBC
JMS
jQuery Mobile
JSON
JT Open Toolkit
Kanban
KDevelop
Key Account Management
Kinematics
Language Services
Latex
Lex
Lightroom
Linear Algebra
Linguistics
Linux server...
Linux System...
Localization
Machine Embroidery
Machining
Management Consulting
MapReduce
Market Analysis
Market Research
Marketing Communications
Materials Testing
Mathematical Modeling
Mathematical Programming
MATLAB
Maven
Mechanical Behavior of...
Mechanical Testing
Mechanism Design
Media Relations
Medical Devices
Medical Translation
Mesh Generation
MetaPost
Microcontrollers
Microscopy
Microsoft Windows
Microstructure
Mobile Application...
ModeFrontier
Monte Carlo
Moodle
Morphing
Motion Analysis
MSC.Patran
Nanoindentation
NASTRAN
Network Administration
Network Simulator
Node.js
NoSQL
OGRE
Online Gaming
Open Source
openACC
openCL
OpenCL
OpenCV
Optical Microscopy
Outlook
Pamcrash
Paraview
pascal
Pascal
Patents
Pedestrian Safety
Performance Management
pFEM
phpMyAdmin
Physical Modeling
Physics
PL/SQL
Plasticity
Plex
Polymers
POSIX Threads
Pre-sales
Presentation Skills
Press Releases
Pressure Vessels
PRINCE2
Problem Solving
Process Improvement
Product Management
Product Marketing
Program Management
Programming Languages
Project Planning
Prototyping
PTC Creo
Public Speaking
qt
Qt Creator
Quantum Mechanics
Quartz
RadTherm
Requirements Analysis
REST
Revenue Recognition
Reverse Engineering
RPAS
Safety
Sales Management
SAP2000
Scanning Electron...
Scheme
Science
Scientific Visualization
SDL Trados
SEM
Sensitivity Analysis
Servlets
Shape Analysis
Shape Recognition
Shape Registration
Shared Memory
Signal Processing
Simulation Software
SImulations
Simulink
SIP
Skilled Negotiator
SNMP
Social Media
Social Networking
Socket Programming
Sockets
Software Architectural...
Software Documentation
Software Quality...
Software Testing
Solaris
SpaceClaim
Spectroscopy
Spring Data
Spring Framework
Spring MVC
SQL Server
Squeak and Rattle
STAR-CD
Star CCM+
STAR CCM+
Start-ups
Statistics
Steel Design
Steel Structures
STEP ISO 10303
stl
STL
Strategic Alliances
Strategy
Structural Optimization
Struts
Struts2
Subtitling
Swing
System Administration
Tax Advisory
TCL
Tcl-Tk
Team Building
Team Management
Teamwork
Technical Translation
Technical Writing
Telecommunications
Tenrox
Testing
Time Series Analysis
Tomcat
Tortoise SVN
TR-069
Track testing
Trados
Translation
Tribology
Turbomachinery
Turbulence
Turbulence Modeling
Typo3
Unity3D
Unix Shell Scripting
User Experience
User Interface Design
Vaadin
VBA
Vehicle Dynamics
VHDL
Virtual Reality
Visual C#
VTK
Web Design
Website Localization
Websphere
WebSphere
WebSphere Application...
WebSphere MQ
Weka
Widgets
win32
Windows 7
Windows Azure
Windows Server
Wordfast
Wordpress
Workflow Reference Model
wxWidgets
XQuery
XSLT
Yacc
Yii
Yii Framework
4

1 回答 1

1

有两种方法可以对数据集进行清理和分类:

  1. 手动
  2. 使用一些文本提取 API,它会给你一些层次结构的概念。您可以使用 AlchemyAPI、TextMiner 等查看哪些术语被组合在一起。它不会给你确切的分类,但会给你广泛的类别图片。
于 2016-06-25T16:37:55.420 回答